Configuring ADLS Gen2 with ODP

Azure Data Lake Storage Gen2 (ADLS Gen2) delivers a cloud-based data storage solution that is scalable, secure, and characterized by high availability.

If you deploy an ODP cluster alongside Azure Data Lake Storage for data processing and analysis, this documentation serves as an extensive resource to assist you in establishing the necessary connectivity between your ODP cluster and ADLS Gen2. By adhering to these outlined procedures, you will achieve the following:

  • Facilitate secure and optimized data transfers between your ODP cluster and ADLS Gen2.
  • Seamlessly integrate the features of ADLS Gen2 within your ODP ecosystem.

In the following sections, you will find a detailed technical guide that will walk you through the setup process step by step.

Configure OAuth Using Core-site.xml

To configure OAuth using core-site.xml, perform the following:

  1. Configure OAuth for Azure in ODP cluster by adding the following properties to custom core-site:
KeyValue
fs.azure.account.auth.typeOAuth
fs.azure.account.oauth.provider.typeorg.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
fs.azure.account.oauth2.client.endpointOAuth 2.0 token endpoint (v2)
fs.azure.account.oauth2.client.idClient ID
fs.azure.account.oauth2.client.secretClient Secret
  1. Restart the required services.

On completion of the above steps, you can configure TLS (Transport Layer Security) for ADLS as mentioned below.

Configure OAuth using Hadoop Credential Provider

Utilizing the core-site.xml file may not be the most secure method for establishing a connection between your Hadoop cluster and ADLS. This is because it stores your client ID and secret in plain text. For enhanced security when connecting to ADLS, it is recommended to employ the Hadoop Credential provider.

To implement this approach, you must perform the following:

  1. Create a password for the Hadoop Credential Provider and export it to the environment by running the below code:
Bash
Copy
  1. Save the credentials of your Azure account.
Bash
Copy
  1. On saving your credentials, you must run the following command to refer to your credentials in the command line whenever a job is submitted by you:
Bash
Copy

Configure TLS

For optimizing the performance of TLS (Transport Layer Security) in Azure Data Lake Storage Gen2, you must transition from the default Java TLS implementation to the native OpenSSL TLS implementation.

To switch to the native OpenSSL TLS implementation in ADLS Gen2, perform the following:

  1. Verify the location of the OpenSSL libraries on the hosts using the following command:
Bash
Copy
  1. After identifying the location of the OpenSSL libraries, you must include a parameter in a particular property. Adjust the HADOOP_OPTS property to incorporate the path to the OpenSSL libraries. For instance, if the OpenSSL libraries reside in /usr/lib64, insert the following parameter and save your changes:
Bash
Copy
  1. To ensure that you have successfully configured native TLS acceleration, run the following command on any host within your cluster:
Bash
Copy
  1. Restart all required services.

Access ADLS from a Cluster

In order to gain access to containers within your account, it is necessary to include the account key in the custom HDFS core-site configuration.

For example, if you intend to access containers within your storage account named workshop, configure the following parameters:

KeyValue
fs.azure.account.auth.type.workshop.dfs.core.windows.netSharedKey
fs.azure.account.key.workshop.dfs.core.windows.netAccount Key

By following the setup of the aforementioned properties, you will be able to access Azure Data Lake Storage (ADLS) from any of the hosts within your cluster. To verify the functionality of your connection, execute a test command as mentioned below:

Bash
Copy

A. Disable soft delete for both blobs and containers using Azure's user interface as it is not yet supported for ADLS Gen2.

To disable the above recovery settings, perform the following:

1. In the Azure portal, navigate to your storage account.

2. Locate the Data protection settings under Data management.

3. Deselect Enable soft delete for blobs and Enable soft delete for containers.

B. It is advisable to avoid ending directory and file names with a period. Paths that terminate with periods can lead to inconsistent behavior, potentially resulting in the disappearance of the period.

Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard
  Last updated