Microsoft | Azure Data Lake Gen2
Azure Data Lake Gen2 is Amazon's central repository. You can use Data lake to store your structured and unstructured data. Data Lake can store data in its original format, irrespective of the size of data.
Take a look at this video which explains the process of adding Azure Data Lake Gen2 as a data source.
Azure Data Lake in ADOC
ADOC provides data reliability capability for data stored in your Azure Data lake Gen2 data source. You must create a Data Plane or use an existing Data Plane to add S3 as a Data source in ADOC. Once you add Data Lake as a Data Source, you can view the details of your data stored in Data Lake in Data Reliability tab in ADOC.
To access data in Azure Data Lake Gen 2, you'll need to use an Asset URL. The format for this URL is as follows:
- Format:
abfss://[container-name]@[storage-account-name].dfs.core.windows.net/[file-path]
- Example:
abfss://example-container@adlsgen2account.dfs.core.windows.net/sample-folder/sample-file.txt
This URL format ensures that you are correctly pointing to the specific data assets within your Azure Data Lake Storage.
Steps to Add Azure Data Lake as a Data Source
To add S3 as a Data source:
- Click Register from the left pane.
- Click Add Data Source.
- Select the Azure Data lake Data Source. Data Lake Data source basic Details page is displayed.


- Enter a name for the data source in the Data Source name field.
- (Optional) Enter a description for the Data Source in the Description field.
- Enable the Data Reliability capability by switching on the toggle switch.
- Select a Data Plane from the Select Data Plane drop-down menu. To create a new Data Plane, click Setup Dataplane.
- Click Next. The Azure Data Lake Connection Details page is displayed.

- Enter your account name in the Azure Storage Account Name field.
- Enter your AWS access key in the Azure Storage Account Access Key field. To learn more about Azure access keys, refer this Microsoft document.
- Click Test Connection. If your credentials are valid, you receive a Connected message. If you get an error message, validate the credentials you entered.
- Click Next. The Set Up Observability page is displayed.

- Provide the Asset Name, Path Expression, and File Type, of your ADLS assets. To add more assets, click +. The selected assets are monitored by the Data Reliability capability of ADOC.
- Based on the file type selected, you must enter values for the required parameters:
File Type | Parameter | Description |
---|---|---|
CSV | Delimiter | The character that separates fields in a CSV file. Common delimiters include commas (,), tabs (\t), or semicolons (;). |
ORC | No additional parameters are required for ORC files. | |
PARQUET | File Processing Strategy | Options include: Evolving Schema (no additional parameters required), Random Files, or Date Partitioned. |
Base Path (Random Files) | The root directory or location in the storage system where the Parquet files are stored. This is used to locate the data for random file processing. | |
Base Path (Date Partitioned) | The root directory or location where the date-partitioned Parquet files are stored. | |
Pattern (Date Partitioned) | A file pattern that includes a date (e.g., "file-<yyyy-MM-dd>.parquet") to identify the specific files for processing. | |
LookBack Days (Date Partitioned) | The number of days to look back when crawling and processing date-partitioned Parquet files. | |
TimeZone (Date Partitioned) | The time zone in which the partitioned data is recorded. | |
JSON | Flattening Level | Defines how deeply nested JSON structures will be flattened. Nested JSON fields will be expanded based on the level specified. |
MultiLine JSON | When enabled, this toggle allows for the processing of JSON data that spans multiple lines. | |
AVRO | Schema Store Type | Specifies where the AVRO schema is stored. Options could include local files, a schema registry, or other storage systems. |
Delta | No additional parameters are required for Delta files. |
- The asset path expression must be a full path address. The syntax is
s3://bucket name/file
name. For example,s3://acceldatabucket/observability.csv
. - To include all the files in the bucket, you must use the syntax
s3://bucket name
. This includes all the files stored in the S3 bucket, irrespective of the file type. - To include all the files that belong to a specific file type, you must use the syntax as
s3://bucket name/*file extension
. For example, to include all the CSV files in the bucket, you must use the syntax ass3://bucket name/*.csv
. Similarly, to include all the JSON files in the bucket, you must use the syntax,s3://bucket name/*.json
.
- Enable Schema Drift Monitoring: For newly crawled assets, Acceldata will automatically enable Schema drift monitoring. These monitoring policies may be further configured per asset in the application.
Warning Schema changes are not automatically detected unless the crawler is executed. Acceldata recommends scheduling the crawler. - Enable Crawler Execution Schedule: Turn on this toggle switch to select a time tag and time zone to schedule the execution of crawlers for Data Reliability.
- Click Submit.
Azure Data Lake is now added to ADOC as a Data Source.
ADOC supports the ABFS (Azure Blob File System) protocol for Azure Data Lake Gen 2. ABFS is optimized for the Azure storage infrastructure, offering enhanced performance, security, and better integration with Azure services. While HTTPS can be used, ABFS is the recommended protocol for optimal functionality and performance in ADOC.
Why ABFS for Azure Data Lake Gen 2
The ABFS protocol is specifically designed for Azure's storage infrastructure, offering:
- Enhanced Performance: Optimized for big data analytics on Azure.
- Improved Security: Provides robust security features suitable for sensitive data.
- Seamless Integration: Designed to work efficiently with Azure services, ensuring better data management and analytics capabilities.
Delta Parquet file formats are now supported in ADOC. Users can perform Data Reliability (DR) checks on Delta Parquet files, enhancing data reliability pipelines.
ADOC now allows you to create SQL views on ADLS Data Lake Gen2 assets. This innovation allows users to run complicated aggregation and validation checks directly on these data assets, hence increasing the efficiency and accuracy of data reliability operations.
Azure Managed Identities, also known as Managed Service Identity (MSI), provide a secure and automated solution for ADOC users to access Azure Data Lake Storage (ADLS).
These identities reduce the need for manual credential management, hence improving security and streamlining the authentication process.
- System-assigned Managed Identity: Tied to the lifecycle of an Azure resource, it is automatically created and managed by Azure. Ideal for specific resources like virtual machines.
- User-assigned Managed Identity: Created independently and can be assigned to multiple Azure resources, offering flexibility and reusability.
Advantages of using Managed Identities
- Security: Reduces the risk of credential exposure by managing and rotating credentials automatically.
- Compliance: Meets strict security and compliance standards.
- Integration: Seamlessly integrates with various Azure services including ADLS.
Configuring Azure Managed Identities for ADLS Access
To configure Azure Managed Identities for ADLS access in ADOC, follow these steps:
1. Access your Azure Storage Account:
1.1. Log into the Azure portal and select the relevant storage account.
2. Configure IAM Role:
2.1. Navigate to 'Access Control (IAM)' in the storage account settings.
2.2. Click on 'Add role assignment' to initiate the process.
3. Assign the Required Role:
3.1. Choose the 'Storage Blob Data Contributor' role for comprehensive access.
3.2. Assign this role to the Azure Managed Identity designated for ADLS access.
Configuring ADLS as a Data Source in ADOC with Managed Identities
To leverage Azure Managed Identities for ADLS in ADOC, follow these configuration steps:
1. Path Expression Setup:
Decide whether to use an absolute or relative path for ADLS data source.
Examples:
- Absolute Path:
abfss://parquet@babugen2.dfs.core.windows.net/profile.parquet
- Relative Path:
abfss://parquet@babugen2.dfs.core.windows.net/test/
2. Data Source Registration in ADOC:
2.1. In ADOC, navigate to the section where you add a new data source.
2.2. Select ADLS and input the path expression as per your requirement.
Utilizing ADLS for Global Storage in ADOC
In ADOC, ADLS can be used as global storage. To set it up with Managed Identities, follow these steps:
Choose an Authentication Method:
- For Token-based Auth: Ensure the token grants write access to the ADLS blob.
- For Managed Identity: Utilize POD-level client credentials for Azure Active Directory authentication.
Key Configuration Fields:
- Fill in the following details in the ADOC configuration:
- Storage Account Name, Container Name, Generation Type.
- Azure Client ID, Tenant ID, Federated Token File, Authority Host.
- Fill in the following details in the ADOC configuration:
- For ADLS Gen2, prioritize the storage account key if available; otherwise, utilize client credentials.
- Disable the 'Soft Delete' option in ADLS for write operations using Managed Identities.