Hudi

Hudi with Spark

Apache Hudi (pronounced “hoodie”) is the next-generation streaming data lake platform. Apache Hudi brings core warehouse and database functionality directly to a data lake. Hudi provides tables, transactions, efficient upserts/deletes, advanced indexes, streaming ingestion services, data clustering/compaction optimizations, and concurrency all while keeping your data in open-source file formats.

HUDI Spark3 Support Matrix

HudiSupported Spark 3 Version
0.15.x3.5.x (default build), 3.4.x, 3.3.x, 3.2.x, 3.1.x, 3.0.x
0.14.x3.4.x (default build), 3.3.x, 3.2.x, 3.1.x, 3.0.x
0.13.x3.3.x (default build), 3.2.x, 3.1.x
0.12.x3.3.x (default build), 3.2.x, 3.1.x
0.11.x3.2.x (default build, Spark bundle only), 3.1.x
0.10.x3.1.x (default build), 3.0.x
0.7.0 - 0.9.03.0.x
0.6.0 and priornot supported

Spark Shell

This command launches the Spark shell with specific configurations set to use Kryo for serialization and to integrate Apache Hudi as the catalog in Spark SQL, allowing for Hudi's features and tables to be accessed directly within Spark SQL queries.

Bash
Copy

Import Statements

This code block imports necessary libraries for Spark and Hudi operations, and it initializes variables for the table name and base path to be used in subsequent Hudi data operations.

Bash
Copy

Create Table, Insert Data, and Query Data

This code block demonstrates how to create a Hudi table, insert data into it, and then query that data using Spark SQL, showcasing a complete cycle of table creation and data manipulation.

Bash
Copy

Update Data and Read Data

This code block reads data from a Hudi table, modifies the 'fare' column for a specific rider, and updates the table with the new information.

Bash
Copy

Merging Data and Read Data

This code block demonstrates how to merge data from a source Hudi table into a target Hudi table, illustrating the integration of datasets within the Hudi framework.

Bash
Copy

Delete Data

This code block loads data from a Hudi table, filters out records corresponding to a specific rider, and prepares these records for deletion from the table.

Bash
Copy

Time Travel Query

Time travel queries in Hudi allow you to view and query data as it appeared at specific points in time, using different timestamp formats to access historical data snapshots.

Bash
Copy

Capture the Data Change Query

Hudi also exposes first-class support for Change Data Capture (CDC) queries. CDC queries are useful for applications that need to obtain all the changes, along with before or after images of records, given a commit time range.

Bash
Copy

The CDC queries are currently only supported on Copy-on-Write tables.

For more details, see Apache Quick Start Guide.

Hudi with Hive

HudiSupported Hive Version
0.14.xHive 4.x.x

Hive Shell

  1. Update the Hive Environment Configuration:
    • Edit the hive-env.sh file to include the Hudi JARs in the HIVE_AUX_JARS_PATH:
Bash
Copy
  1. Configure the Hive Properties:
    • Update the hive-site.xml file with the following properties:
Bash
Copy
  1. Restart the Hive Services:

    • Restart the Hive services to apply the changes.
  2. Create Hudi Tables:

    1. Open the Hive shell and create Hudi tables as needed.
Bash
Copy
Bash
Copy

To perform operations on Hudi tables and use time travel features with Hive, follow these steps:

First, create new Hudi tables with unique names to avoid conflicts with existing tables. Here are the updated CREATE TABLE statements:

Create Hudi Tables

Create Hudi COW Table

Bash
Copy

Insert Data into Hudi COW Table

Bash
Copy

Create Hudi MOR Table

Bash
Copy

Insert Data into Hudi MOR Table

Bash
Copy

Create Hudi Realtime Table

Bash
Copy

Insert Data into Hudi Realtime Table

Bash
Copy
  1. Perform Time Travel Queries

For time travel queries in Hudi, Hive does not natively support querying historical versions. Instead, you should use Spark for advanced time travel capabilities. However, you can still perform some operations to query the latest data or specific snapshots.

Query Latest Data (Hive)

Bash
Copy

Query Specific Snapshot (Using Spark)

If you need to query historical snapshots, use Spark to read from specific commit times or timestamps:

Bash
Copy

For more details, see Hive Metastore.

Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard
  Last updated