Modern Open Table Format integration with ODP

Apache Hive was the pioneering data lake table format popularized for warehousing within the big data ecosystem, but it had significant limitations due to its design and architecture. Recent advancements in new table formats have empowered data engineers and architects with superior capabilities in compliance and maintenance standards.

These features include enhanced ACID transactions, time traveling, historical snapshot creation, branching and tagging, and Change Data Capture (CDC) procedures that simplify upserts and deletions. With HDFS or object storage, these table formats support full CRUD operations, boasting improved performance and scalability through updated data organization strategies.

The following sections will offer a preliminary guide on using these open table formats with ODP Spark integration, featuring Spark Scala/SQL. This guide will provide a concise overview of the format, its capabilities, and usage, including code snippets for inserting, updating, deleting, and leveraging other available features.

Hudi

Apache Hudi (pronounced “hoodie”) is the next generation streaming data lake platform. Apache Hudi brings core warehouse and database functionality directly to a data lake. Hudi provides tables, transactions, efficient upserts or deletes, advanced indexes, streaming ingestion services, data clustering or compaction optimizations, and concurrency all while keeping your data in open source file formats.

HUDI Spark3 Support Matrix

HudiSupported Spark3 Version
0.14.x3.4.x (default build), 3.3.x, 3.2.x, 3.1.x, 3.0.x
0.13.x3.3.x (default build), 3.2.x, 3.1.x
0.12.x3.3.x (default build), 3.2.x, 3.1.x
0.11.x3.2.x (default build, Spark bundle only), 3.1.x
0.10.x3.1.x (default build), 3.0.x
0.7.0 - 0.9.03.0.x
0.6.0 and priornot supported

Working with Apache Hudi and Spark: Data Operations Guide

Spark Shell

This command launches the Spark shell with specific configurations set to use Kryo for serialization and to integrate Apache Hudi as the catalog in Spark SQL, allowing for Hudi's features and tables to be accessed directly within Spark SQL queries.

Bash
Copy

Import Statements

This code block imports necessary libraries for Spark and Hudi operations, and it initializes variables for the table name and base path to be used in subsequent Hudi data operations.

Bash
Copy

Create Table, Insert Data, and Query Data

This code block demonstrates how to create a Hudi table, insert data into it, and then query that data using Spark SQL, showcasing a complete cycle of table creation and data manipulation.

Bash
Copy

Update Data and Read Data

This code block reads data from a Hudi table, modifies the 'fare' column for a specific rider, and updates the table with the new information.

Bash
Copy

Merging Data and Read Data

This code block demonstrates how to merge data from a source Hudi table into a target Hudi table, illustrating the integration of datasets within the Hudi framework.

Bash
Copy

Deleting Data

This code block loads data from a Hudi table, filters out records corresponding to a specific rider, and prepares these records for deletion from the table.

Bash
Copy

Time Travel Query

Time travel queries in Hudi allow you to view and query data as it appeared at specific points in time, using different timestamp formats to access historical data snapshots.

Bash
Copy

Capturing Data Change Query

Hudi offers comprehensive support for Change Data Capture (CDC) queries, which are essential for applications requiring a detailed record of changes, including before and after snapshots of records, within a specified commit time range.

Bash
Copy

CDC queries are currently only supported on Copy-on-Write tables.

Iceberg

Apache Iceberg is an open table format designed for large-scale analytic datasets. Iceberg integrates with computing engines like Spark, Trino, PrestoDB, Flink, Hive, and Impala, offering a high-performance table format that functions similarly to a SQL table.

User Experience

Iceberg ensures a smooth and predictable user experience. Schema evolution is reliable and does not accidentally restore deleted data. Users can achieve fast queries without needing to understand partitioning.

  • Schema evolution: Supports adding, dropping, updating, or renaming operations without unintended consequences.
  • Hidden partitioning: Prevents user errors that could lead to silently incorrect results or dramatically slow queries.
  • Partition layout evolution: Adapts the table's layout as data volumes or query patterns shift.
  • Time travel: Facilitates reproducible queries using the exact same table snapshot and allows easy examination of historical changes.
  • Version Rollback: Enables users to quickly resolve issues by reverting tables to a stable state.
VersionLifecycle StageInitial Iceberg SupportLatest Iceberg SupportLatest Runtime Jar
2.4End of Life0.7.0-incubating1.2.1iceberg-spark-runtime-2.4
3.0End of Life0.9.01.0.0iceberg-spark-runtime-3.0_2.12
3.1End of Life0.12.01.3.1iceberg-spark-runtime-3.1_2.12 [1]
3.2End of Life0.13.01.4.3iceberg-spark-runtime-3.2_2.12
3.3Maintained0.14.01.5.0iceberg-spark-runtime-3.3_2.12
3.4Maintained1.3.01.5.0iceberg-spark-runtime-3.4_2.12
3.5Maintained1.4.01.5.0iceberg-spark-runtime-3.5_2.12

Working with Apache Iceberg and Spark: Data Operations Guide

Spark Shell

Start the Spark shell with necessary configurations for Apache Iceberg integration.

Bash
Copy

Import Statements

Include the necessary Apache Iceberg and Spark libraries to enable data operations.

Bash
Copy

Create Table, Insert Data, and Query Data

Example code to create an Iceberg table, insert records, and perform a query to retrieve data.

Bash
Copy

Update and Read Data

Demonstrate updating records in an Iceberg table and reading the updated data.

Bash
Copy

Merge and Read Data

Show how to merge data from one Iceberg table into another and read the merged data.

Bash
Copy

Delete Data

Example code for deleting specific records from an Iceberg table.

Bash
Copy

Read Historical Snapshots

Retrieve data from historical snapshots of an Iceberg table using specific timestamps.

Bash
Copy

Time Travel Query

Query data from different points in time using Iceberg’s time travel capability to access historical table states.

Bash
Copy

Delta Lake

Delta Lake is an open source project that enables building a Lakehouse architecture on top of data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing on top of existing data lakes, such as S3, ADLS, GCS, and HDFS.

Delta Lake offers the following:

  • ACID transactions in Spark: Guarantees that all readers access consistent and accurate data through serializable isolation levels.
  • Scalable Metadata Management: Utilizes Spark's distributed computing capabilities to efficiently manage extensive metadata for petabyte-scale tables containing billions of files.
  • Unified Streaming and Batch Processing: Delta Lake allows tables to function both as batch tables and as streaming sources and sinks, seamlessly integrating streaming data ingest, batch historic backfill, and interactive queries.
  • Schema Enforcement: Automatically manages schema variations to prevent the insertion of incorrect records during data ingestion.
  • Time Travel: Enables data versioning for rollbacks, comprehensive historical audits, and reproducible machine learning experiments.
  • Advanced Data Operations: Supports merges, updates, and deletes to facilitate complex scenarios such as change-data-capture, slowly-changing-dimension (SCD) operations, and streaming upserts.
Delta Lake VersionApache Spark Version
3.1.x3.5.x
3.0.x3.5.x
2.4.x3.4.x
2.3.x3.3.x
2.2.x3.3.x
2.1.x3.3.x
2.0.x3.2.x
1.2.x3.2.x
1.1.x3.2.x

Working with Delta Lake and Spark: Data Operations Guide

Spark Shell

Initialize the Spark shell with configurations optimized for working with Delta Lake.

Bash
Copy

Import Statements

Import the necessary libraries for Delta Lake to enable data manipulation and querying.

Bash
Copy

Create Table, Insert Data, and Read Data

Example code to create a Delta Lake table, insert data into it, and perform queries to retrieve the data.

Bash
Copy

Update Data and Read Data

Show how to update records in a Delta Lake table.

Bash
Copy

Merge Data and Read Data

Demonstrate merging data from one Delta Lake table into another and reading the resultant data.

Bash
Copy

Delete Data

Example code for deleting specific records from a Delta Lake table, demonstrating data management capabilities.

Bash
Copy

Time Travel Query

Use Delta Lake's time travel feature to query data from different historical versions of a table.

Bash
Copy
Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard
  Last updated