Iceberg

Iceberg with Spark

Apache Iceberg is an open table format for huge analytic datasets. Iceberg adds tables to compute engines, including Spark, Trino, PrestoDB, Flink, Hive, and Impala, using a high-performance table format that works just like an SQL table.

User Experience

Iceberg avoids unpleasant surprises. Schema evolution works and won't inadvertently un-delete data. Users don't need to know about partitioning to get fast queries.

  • Schema evolution: Supports adding, dropping, updating, or renaming operations without unintended consequences.
  • Hidden partitioning: Prevents user errors that can lead to silently incorrect results or dramatically slow queries.
  • Partition layout evolution: Updates the layout of a table as data volume or query patterns change.
  • Time travel: Enables reproducible queries that use the same table snapshot or lets users easily examine changes.
  • Version Rollback: Allows users to quickly correct problems by resetting tables to a good state.
VersionLifecycle StageInitial Iceberg SupportLatest Iceberg SupportLatest Runtime Jar
2.4End of Life0.7.0-incubating1.2.1iceberg-spark-runtime-2.4
3.0End of Life0.9.01.0.0iceberg-spark-runtime-3.0_2.12
3.1End of Life0.12.01.3.1iceberg-spark-runtime-3.1_2.12 [1]
3.2End of Life0.13.01.4.3iceberg-spark-runtime-3.2_2.12
3.3Maintained0.14.01.5.0iceberg-spark-runtime-3.3_2.12
3.4Maintained1.3.01.5.0iceberg-spark-runtime-3.4_2.12
3.5Maintained1.4.01.6.0iceberg-spark-runtime-3.5_2.12

Spark Shell

Start the Spark shell with necessary configurations for Apache Iceberg integration.

Bash
Copy

Import Statements

Include the necessary Apache Iceberg and Spark libraries to enable data operations.

Bash
Copy

Create a Table, Insert Data, and Query Data

Example code to create an Iceberg table, insert records, and perform a query to retrieve data.

Bash
Copy

Update and Read Data

Demonstrate updating records in an Iceberg table and reading the updated data.

Bash
Copy

Merge and Read Data

Show how to merge data from one Iceberg table into another and read the merged data.

Bash
Copy

Delete Data

Example code for deleting specific records from an Iceberg table.

Bash
Copy

Read Historical Snapshots

Retrieve data from historical snapshots of an Iceberg table using specific timestamps.

Bash
Copy

Time Travel Query

Query data from different points in time using Iceberg’s time travel capability to access historical table states.

Bash
Copy

For more details, see Apache Iceberg Getting Started Guide.

Iceberg with Hive

Feature SupportHive 2 / 3Hive 4
[SQL create table](SQL create table)✔️✔️
[SQL create table as select (CTAS)](SQL create table as select (CTAS))✔️✔️
[SQL create table like table (CTLT)](SQL create table like table (CTLT))✔️✔️
[SQL drop table](SQL drop table)✔️✔️
[SQL insert into](SQL insert into)✔️✔️
[SQL insert overwrite](SQL insert overwrite)✔️✔️
[SQL delete from](SQL delete from)✔️
[SQL update](SQL update)✔️
[SQL merge into](SQL merge into)✔️
[Branches and tags](Branches and tags)✔️

Custom Iceberg catalog

To globally register different catalogs, set the following Hadoop configurations:

Here are some examples using the Hive CLI:

Configuration KeyDescription
iceberg.catalog.<catalog_name>.typeiceberg.catalog.<catalog_name>.type
iceberg.catalog.<catalog_name>.catalog-implcatalog implementation, must not be null if type is empty
iceberg.catalog.<catalog_name>.<key>any config key and value pairs for the catalog

Register a HiveCatalog:

Bash
Copy

Register a HadoopCatalog :

Bash
Copy

Feature Support

The following features are supported across Hive for Iceberg tables:

To enable Hive support globally for an application, set iceberg.engine.hive.enabled=true in its Hadoop configuration. For example, setting this in the hive-site.xml loaded by Spark will enable the storage handler for all tables created by Spark.

Create Table

Non partitioned tables

The Hive CREATE EXTERNAL TABLE command creates an Iceberg table when you specify the storage handler as follows:

Bash
Copy

You can specify the default file format (Avro, Parquet, ORC) at the time of the table creation. The default is Parquet:

Bash
Copy

Partitioned tables

You can create Iceberg partitioned tables using a command familiar to those who create non-Iceberg tables:

Bash
Copy

The resulting table does not create partitions in HMS, but instead, converts partition data into Iceberg identity partitions.

Use the DESCRIBE command to get information about the Iceberg identity partitions:

Bash
Copy

Create Table as Select

The CREATE TABLE AS SELECT operation resembles the native Hive operation with a single important difference. The Iceberg table and the corresponding Hive table are created at the beginning of the query execution. The data is inserted / committed when the query finishes. So for a transient period the table already exists but contains no data.

Bash
Copy

Create Table like Table

Hive with Iceberg storage handler attempts to create a new table named target that has the same schema and properties as an existing table named source.

Bash
Copy

Drop Table

Tables can be dropped using the DROP TABLE command:

Bash
Copy

Insert Into

Hive supports the standard single-table INSERT INTO operation:

Bash
Copy

The Multi-table insert is also supported, but it will not be atomic. Commits occur one table at a time. Partial changes will be visible during the commit process and failures can leave partial changes committed. Changes within a single table will remain atomic.

Insert-into operations on branches also work similar to the table level select operations. However, the branch must be provided as follows.

Bash
Copy

Here is an example of inserting into multiple tables at once in Hive SQL:

Bash
Copy

Insert Into Partition

Hive supports partition-level INSERT INTO operation:

Bash
Copy

The partition specification supports only identity-partition columns. Transform columns in partition specification are not supported.

Insert Overwrite

INSERT OVERWRITE can replace data in the table with the result of a query. Overwrites are atomic operations for Iceberg tables. For non partitioned tables the content of the table is always removed. For partitioned tables the partitions that have rows produced by the SELECT query will be replaced.

Bash
Copy

Insert Overwrite Partition

Hive supports partition-level INSERT OVERWRITE operation:

Bash
Copy

The partition specification supports only identity-partition columns. Transform columns in partition specification are not supported.

Delete From

Hive supports DELETE FROM queries to remove data from tables.

Delete queries accept a filter to match rows to delete.

Bash
Copy

If the delete filter matches entire partitions of the table, Iceberg will perform a metadata-only delete. If the filter matches individual rows of a table, then Iceberg will rewrite only the affected data files.

Update

Hive supports UPDATE queries which accept a filter to match rows to update.

Bash
Copy

For more complex row-level updates based on incoming data, see the section on MERGE INTO.

Merge Into

Hive support for MERGE INTO queries that can express row-level updates.

MERGE INTO updates a table, called the target table, using a set of updates from another query, called the source. The update for a row in the target table is found using the ON clause that is like a join condition.

Bash
Copy

Updates to rows in the target table are listed using WHEN MATCHED ... THEN .... Multiple MATCHED clauses can be added with conditions that determine when each match should be applied. The first matching expression is used.

Bash
Copy

Source rows (updates) that do not match can be inserted:

Bash
Copy

Only one record in the source data can update any given row of the target table, or else an error will be thrown.

Hive Table Operations using Iceberg

Creating an Iceberg Table

You created a table named iceberg_table in Hive using the Iceberg storage format. Here is the command used:

Bash
Copy

Inserting Data into the Iceberg Table

You inserted data into the iceberg_table as follows:

Bash
Copy

Describing the Iceberg Table

To view the schema of the iceberg_table, you used the DESCRIBE command:

Bash
Copy

Summary of Commands

1. Querying the Table

Bash
Copy

This worked successfully, returning the rows as expected.

2. Updating the Table

Bash
Copy

The update operation was successful and affected the rows.

3. Deleting from the Table

Bash
Copy

The delete operation was executed, but it seems no rows were affected.

4. Describing the Table Formatted

Bash
Copy

This command successfully described the table and its metadata.

Bash
Copy

5. Showing Extended Table Information

Bash
Copy

6. Showing Functions

Bash
Copy

Time Travel Queries

Since DESCRIBE HISTORY and the snapshots and current_snapshot tables aren’t available, here are alternative steps to check time travel functionality if supported by your Hive/Iceberg setup:

Since DESCRIBE HISTORY and the snapshots and current_snapshot tables aren’t available, here are alternative steps to check time travel functionality if supported by your Hive/Iceberg setup:

  1. Create Snapshot: Insert data and create a new snapshot.
Bash
Copy
  1. Query Specific Snapshot: If you have snapshots, you can specify timestamps or snapshot IDs in your queries to check historical data.
Bash
Copy
  1. Check Table Versions: If available, you can check table versions directly.
Bash
Copy

For more details, see Hive - Apache Iceberg.

Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard
  Last updated