Configure Apache Hive with S3A (applies only to 3.3.6.1-101)

This document describes how Apache Hive supports secure, session-level access to Amazon S3 using S3A without requiring cluster-wide configuration or service restarts. It explains the limitations of traditional static credential management and introduces a JCEKS-based credential provider approach to safely enable DDL and DML operations.

The document outlines current constraints around multi-bucket access, Spark, Impala, and Hive Metastore integration. It also summarizes supported stack versions, known limitations, and recommended workarounds across the Hive ecosystem.

Open Source Pull Request Reference

Problem Statements

Apache Hive integration with S3A presents the following challenges:

  1. Accessing a single S3 bucket at the session level
  2. Accessing multiple S3 buckets within a single session
  3. Providing fine-grained access control for a specific path within an S3 bucket

This document primarily addresses Problem Statement 1 and outlines the current limitations related to Problem Statement 2.

Existing Configuration Challenges

To access S3A storage, the following configurations must traditionally be defined in Ambari → HDFS → core-site.xml:

  • fs.s3a.endpoint
  • fs.s3a.access.key
  • fs.s3a.secret.key

Drawbacks of This Approach

  • Only one S3 bucket can be configured for the entire cluster.
  • All users gain access to the configured bucket.
  • Any configuration change requires a service restart, which is unsuitable for production environments.

Session-Level Credential Limitations

If a user attempts to define S3A credentials at the session level, Data Definition Language (DDL) operations fail because the Hive Metastore (HMS) does not receive these credentials.

This occurs because HMS operates independently from the HiveServer2 (HS2) session and lacks access to runtime credentials.

Introduced Improvements

To address this issue, the following Hive enhancements were introduced:

  • HIVE-16913
  • HIVE-28272

With the solution proposed in HIVE-28272, S3A-related configurations are exposed via metaconf. This enables users to define S3A credentials using Beeline, allowing successful execution of DDL operations.

Remaining Issue with DML Operations

During INSERT (DML) operations, queries fail because:

  • Tez containers cannot create staging directories in S3A table locations.
  • S3A credentials are not propagated from HS2 to Tez.
  • Credentials such as access key and secret key are part of the hidden configuration list.
  • DAGUtils clears these sensitive configurations before passing them to the job configuration.

Removing S3A credentials from the hidden configuration list allows both DDL and DML operations to succeed. However, this approach introduces a critical security risk, as credentials become visible in:

  • Tez View
  • Tez DAGs
  • Hive protocol logging

This approach is strongly discouraged.

Proposed Solution

Based on the Hadoop AWS module documentation, the recommended and secure approach is to use Hadoop Credential Providers (JCEKS) to store and propagate S3A credentials.

Create Hadoop Credential File

Execute the following commands to create credentials securely:

Bash
Copy

Credential files can be stored on any Hadoop-supported filesystem. File permissions are automatically restricted to the creator. However, directory permissions must be manually verified to ensure restricted access.

Configure Session-Level Access in Beeline

Bash
Copy

The property hadoop.security.credential.provider.path is global to all filesystems and secrets. There is another property, fs.s3a.security.credential.provider.path which only lists credential providers for S3A filesystems. The two properties are combined into one, with the list of providers in the fs.s3a. property takes precedence over that of the hadoop.security list (i.e. they are prepended to the common list).

This approach ensures:

  • Credentials are never exposed in plaintext.
  • No service restart is required.
  • Access is limited to authorized users.

Validated Steps and Example

ODP Hive Configuration Requirements

S3 access key, secret key, and endpoint configurations must not be defined statically in Hive or HDFS configuration files.

To allow runtime configuration changes, include the following property in hive-site.xml:

Bash
Copy

When creating and storing Hadoop credentials on HDFS or a Unix file system, permissions are automatically set to restrict file access to the reader. However, as directory permissions are not inherently modified, users must verify that the containing directory's permissions limit readability exclusively to the current user.

Credential Creation Example

Bash
Copy

Verify path ownership and permissions based on user requirements.

Example Beeline Session

The available S3 bucket for this example is odp-ranger-test here, and the subsequent commands are executed via beeline. The Hive user was utilized to initiate the session, as it possesses the requisite access to the local JCEKS file within the Hive configuration located at /etc/hive/conf. Upon initialization, the initial session configuration will persist and will not be overridden, as the warehouse configuration is loaded only once.

Bash
Copy

Current Limitation – Accessing Multiple Buckets in a Session

Hadoop AWS supports per-bucket credential providers, for example:

Bash
Copy

However, Hive does not currently allow dynamic creation of metaconf keys unless they are explicitly defined in MetastoreConf.java.

As a result:

  • Per-bucket configuration keys cannot be set dynamically.
  • Accessing multiple buckets within a single session is not supported.

To make this work, you need to allow metaconf to create the configuration that starts with prefix fs.s3a.bucket at runtime. Otherwise, the user cannot access multiple buckets simultaneously in a session. This will be additional work in order to support accessing multiple buckets in a session and being tracked here.

HDFS Support

Starting with Hadoop 3.3, delegation token support (HDFS-14556) allows S3A properties to be defined at runtime without requiring service restarts

Bash
Copy

Spark S3 Support

Spark 3 does not support propagating Hive Metastore (metaconf) properties from the session to HMS.

As a result:

  • Session-level S3A configuration updates are ineffective for HMS-backed operations.
  • File-level reads and writes work.
  • DDL and DML operations requiring HMS interaction fail after data is written.

Impala Support

Session-level credential updates are not supported in Impala.

Reference: https://impala.apache.org/docs/build/html/topics/impala_s3.html

Ranger S3 Support

Addressing Problem Statement 3 (fine-grained access control at the path level) requires S3 governance solutions such as Apache Ranger.

This capability is currently under internal evaluation.

Matrix Support

To implement static configurations, please refer to the standard documentation and incorporate the following configuration into the Ambari Hive configuration:

Bash
Copy

Caution: This configuration modification is insecure, as it permits configurations containing sensitive information (secrets) to be displayed in plain text.

Mandatory Configurations to be Incorporated within the requisite XML configurations for HDFS, Hive, Impala, Tez, and Spark3

Bash
Copy

The primary distinction between session and static configuration lies in their usability during Spark operations on non-ACID tables, given that Hive metastore configurations cannot be propagated within a Spark session, as detailed in the Spark support section mentioned previously.

The S3 session level credential operational support and the matrix below is valid for stack support on ODP-3.2 (v3.2.3.5-2 onwards) and ODP-3.3 (v3.3.6.3-101 onwards).

Operation(s)Hive(JDBC/Beeline)Spark(Direct) without HWCSpark with HWC *Remarks
Create Table (Partitioned & Unpartitioned)Create a table using Hive as it's a metastore operation

Insert into/overwrite

(Partitioned & Unpartitioned) static or dynamic

Create the table using Beeline, but insert data using Spark DataFrame operations instead of INSERT SQL statements. Run msck repair on Beeline to view new or updated partition results.
show partitions
msck repairMetastore operation
Alter tableMetastore operation
Drop tableMetastore operation
Alter partitionMetastore operation
Drop partitionMetastore operation
Update ACID table (Partitioned & Unpartitioned)ACID operation
Delete ACID table (Partitioned & Unpartitioned)ACID operation

Spark + Hive Integration (Write) - Known Limitation

INSERT via Spark SQL: Failure

S3 credential propagation to the Hive Metastore Server (HMS) is unsuccessful.

Issue Description: When an INSERT operation is submitted via Spark, the Hive Metastore Server (HMS) attempts to validate the S3A location. However, the HMS lacks the necessary S3A credentials that are present in the Spark session. This represents a known limitation when employing session-level S3A configurations.

Recommended Workaround: Utilize the DataFrame API in lieu of SQL INSERT statements.

Bash
Copy

Please utilize the following Spark configuration for establishing a connection with the Hive Warehouse Connector (HWC) using prepared jceks file:

Bash
Copy

Minimum Beeline configurations required to utilize the aforementioned operations:

Bash
Copy
Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard
  Last updated