Configure ODP with Ceph S3 Ceph Using S3A

This document covers how to configure ODP to use Ceph RGW Object Storage via Hadoop S3A connector for Hadoop, Hive, Spark3, Impala, and Trino, and how to troubleshoot common issues.

Integrate the ODP with Ceph RGW Object Storage over the S3A protocol to optimize data management on open-source object storage. This setup uses Hadoop’s native S3A connector and requires no additional plugins.

This is applicable only to ODP version 3.3.6.3-101 and not applicable to version 3.3.6.3-1.

For Ceph erasure coding (EC), Ceph uses Jerasure by default, but also allows other accelerator plugins such as ISA-L. Choose the accelerator that best fits your EC deployment.

For details, see the pages below:

Prerequisites

  • Ceph RGW endpoint URL and valid access/secret keys.
  • Network connectivity from ODP components to the Ceph RGW endpoint (HTTP/HTTPS).
  • Administrative access to Hadoop, Hive, and Trino configuration directories.

Create Ceph/IAM User

  • For ODP-Ceph integration, we have two distinct user types - Ceph native user and IAM user.

  • Both user types are required:

    • The Ceph native user bootstraps access
    • The IAM user enables Ranger policy enforcement
  • Understanding the difference is important before proceeding.

User TypeCreated viaUsed toRecognized by Ranger
Ceph Native Userradosgw-admin user createBootstrap initial IAM API access❌ No
IAM Useriam create-user (AWS CLI)Ranger policy enforcement via RGW IAM API✅ Yes

Create Ceph RGW Admin User

  1. Create a new IAM account
Bash
Copy
  1. Create a Ceph admin user with IAM account
Bash
Copy

Make sure to copy and store both the Access Key ID and Secret Access Key in a secure location.

  1. Register the Ceph admin user to the AWS CLI.
Bash
Copy

Create IAM Admin User

  1. Create IAM admin user in AWS CLI
Bash
Copy

Make sure to copy and store both the Access Key ID and Secret Access Key in a secure location.

  1. Assign S3/IAM permissions to IAM admin user
Bash
Copy

Configuration

To access data in a Ceph S3 bucket, apply the following configuration changes.

Hadoop Configuration for HDFS/Spark3/Impala

  • Add configurations in HDFS → Configs → Advanced → Custom core-site
PropertyValue
fs.s3a.access.key<AWS-access-key>
fs.s3a.secret.key<AWS-secret-key>
fs.s3a.endpointhttp://10.100.11.65:8070
fs.s3a.aws.credentials.providerorg.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider
fs.s3a.implorg.apache.hadoop.fs.s3a.S3AFileSystem
fs.s3a.path.style.accesstrue
fs.s3a.connection.ssl.enabledfalse
  • Add configurations in MapReduce → Configs → Advanced → Custom mapred-site
PropertyValue
fs.s3a.implorg.apache.hadoop.fs.s3a.S3AFileSystem

Hive Configuration

  1. Create .jceks credentials using access/secret keys
Bash
Copy
  1. Add the following property at Hive → Config → Custom hive-site
Bash
Copy
  1. Set the session-level configurations on CLI
Bash
Copy

Trino Configuration

  • Add configurations in Trino → Configs → Advanced → Advanced trino-hive → Hive Config
Bash
Copy

Trino uses its own S3 client (TrinoS3FileSystem), not Hadoop S3A (S3AFileSystem). The S3 credentials and endpoint must be configured in hive.properties to configure Trino.

Ceph with Ranger S3 Plugin

Quick Connection Test

  1. Create a test bucket.
Bash
Copy
  1. Upload a test file to the bucket via HDFS.
Bash
Copy
  1. Check the test file uploaded on the Ceph cluster.
Bash
Copy

You should be able to see the test.txt file in the Ceph bucket

Troubleshooting

If Ranger policy creation against Ceph fails with an IAM token error, verify the following settings.

Ranger Policy Creation for Ceph Fails with “IAM Invalid Token” Error

Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard
  Last updated