Configuring ADLS Gen2 with ODP
Azure Data Lake Storage Gen2 (ADLS Gen2) delivers a cloud-based data storage solution that is scalable, secure, and characterized by high availability.
If you deploy an ODP cluster alongside Azure Data Lake Storage for data processing and analysis, this documentation serves as an extensive resource to assist you in establishing the necessary connectivity between your ODP cluster and ADLS Gen2. By adhering to these outlined procedures, you will achieve the following:
- Facilitate secure and optimized data transfers between your ODP cluster and ADLS Gen2.
- Seamlessly integrate the features of ADLS Gen2 within your ODP ecosystem.
In the following sections, you will find a detailed technical guide that will walk you through the setup process step by step.
Configure OAuth Using Core-site.xml
To configure OAuth using core-site.xml, perform the following:
- Configure OAuth for Azure in ODP cluster by adding the following properties to custom core-site:
Key | Value |
---|---|
fs.azure.account.auth.type | OAuth |
fs.azure.account.oauth.provider.type | org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider |
fs.azure.account.oauth2.client.endpoint | OAuth 2.0 token endpoint (v2) |
fs.azure.account.oauth2.client.id | Client ID |
fs.azure.account.oauth2.client.secret | Client Secret |
- Restart the required services.
On completion of the above steps, you can configure TLS (Transport Layer Security) for ADLS as mentioned below.
Configure OAuth using Hadoop Credential Provider
Utilizing the core-site.xml file may not be the most secure method for establishing a connection between your Hadoop cluster and ADLS. This is because it stores your client ID and secret in plain text. For enhanced security when connecting to ADLS, it is recommended to employ the Hadoop Credential provider.
To implement this approach, you must perform the following:
- Create a password for the Hadoop Credential Provider and export it to the environment by running the below code:
export HADOOP_CREDSTORE_PASSWORD=<password>
- Save the credentials of your Azure account.
hadoop credential create fs.azure.account.oauth2.client.id -provider jceks://hdfs/user/<user_name>/adls2keyfile.jceks -value <client ID>
hadoop credential create fs.azure.account.oauth2.client.secret -provider jceks://hdfs/user/<user_name>/adls2keyfile.jceks -value <client secret>
hadoop credential create fs.azure.account.oauth2.client.endpoint -provider jceks://hdfs/user/<user_name>/adls2keyfile.jceks -value <OAuth 2.0 token endpoint (v2)>
- On saving your credentials, you must run the following command to refer to your credentials in the command line whenever a job is submitted by you:
hadoop fs -Dfs.azure.account.oauth.provider.type=org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider -Dhadoop.security.credential.provider.path=jceks://hdfs/user/<user_name>/adls2keyfile.jceks -ls abfss://<container_name>@<account>.dfs.core.windows.net/<container_file_path>
Configure TLS
For optimizing the performance of TLS (Transport Layer Security) in Azure Data Lake Storage Gen2, you must transition from the default Java TLS implementation to the native OpenSSL TLS implementation.
To switch to the native OpenSSL TLS implementation in ADLS Gen2, perform the following:
- Verify the location of the OpenSSL libraries on the hosts using the following command:
whereis libssl
- After identifying the location of the OpenSSL libraries, you must include a parameter in a particular property. Adjust the HADOOP_OPTS property to incorporate the path to the OpenSSL libraries. For instance, if the OpenSSL libraries reside in /usr/lib64, insert the following parameter and save your changes:
HADOOP_OPTS="-Dorg.wildfly.openssl.path=/usr/lib64 ${HADOOP_OPTS}"
- To ensure that you have successfully configured native TLS acceleration, run the following command on any host within your cluster:
hadoop fs -ls abfss://<container>@<account>.dfs.core.windows.net/< container_file_path >
- Restart all required services.
Access ADLS from a Cluster
In order to gain access to containers within your account, it is necessary to include the account key in the custom HDFS core-site configuration.
For example, if you intend to access containers within your storage account named workshop, configure the following parameters:
Key | Value |
---|---|
fs.azure.account.auth.type.workshop.dfs.core.windows.net | SharedKey |
fs.azure.account.key.workshop.dfs.core.windows.net | Account Key |
By following the setup of the aforementioned properties, you will be able to access Azure Data Lake Storage (ADLS) from any of the hosts within your cluster. To verify the functionality of your connection, execute a test command as mentioned below:
/usr/odp/current/hadoop-client/bin/hadoop fs -ls abfss://<container>@<account>.dfs.core.windows.net/<container_file_path>
A. Disable soft delete for both blobs and containers using Azure's user interface as it is not yet supported for ADLS Gen2.
To disable the above recovery settings, perform the following:
1. In the Azure portal, navigate to your storage account.
2. Locate the Data protection settings under Data management.
3. Deselect Enable soft delete for blobs and Enable soft delete for containers.
B. It is advisable to avoid ending directory and file names with a period. Paths that terminate with periods can lead to inconsistent behavior, potentially resulting in the disappearance of the period.