Amazon | S3
Amazon S3 is AWS’s object storage service. Use ADOC to monitor and profile S3 buckets and files—providing data quality, freshness, usage, schema drift, and reliability insights.
Prerequisites
Before you connect Amazon S3 as a data source in ADOC, ensure the following:
- You have access to an AWS S3 bucket containing the data you want to monitor.
- The ADOC Data Plane is deployed and has network access to the S3 bucket.
- Authentication to AWS is configured using one of the following methods:
- EC2 Instance Profile with the appropriate IAM role attached,
- Kubernetes IAM Roles for Service Accounts (IRSA), or
- AWS Access Key and Secret Key credentials.
- Your IAM policy includes the necessary permissions to access S3, Secrets Manager (if applicable), and SQS (if you're using event-based or incremental monitoring).
Add S3 as a Data Source
Follow these steps to set up S3 in ADOC:
Step 1: Start Setup
Select Register from the left main menu.
Select Add Data Source.
Select AWS S3 from the list of data sources.
On the Data Source Details page:
- Enter a name for this data source that is unique within your tenant.
- Optionally, add a brief description to clarify its purpose.
- Enable the Data Reliability toggle and select your data plane from the drop-down list.
Select Next to proceed.
Step 2: Add Connection Details
Enter your AWS S3 connection information. Required fields vary depending on the selected Authentication Type.
Common Fields (Displayed for All Authentication Types)
Field | Description |
---|---|
AWS Region | The AWS region where your S3 bucket is hosted (e.g., us-east-2, eu-west-1). For more information, refer to AWS's Region and Zones document. |
AWS S3 Authentication Type | Select the method of authentication used to access your AWS S3 bucket. Options include Access Key/Secret Key, EC2 Instance Profile, and (if applicable) IAM Roles for Service Accounts (IRSA) in Kubernetes. |
Bucket Name | Name of the S3 bucket you want to monitor. Multiple buckets can be added. |
File Monitoring Channel Type | (Optional) Choose SQS to receive file change notifications from S3 via Amazon SQS (Simple Queue Service) else select NONE. |
SQS Queue URL | Required if File Monitoring Channel Type is set to SQS. |
Authentication Types
Authentication Type | Description |
---|---|
AWS Access Key / Secret Key | Provide an AWS Access Key ID and Secret Access Key with sufficient permissions (e.g., s3:ListBucket , s3:GetObject ). If Use Secrets Manager is enabled, select the Secrets Manager and specify the key name that contains the secret.For more details on how to view AWS access key and Secret key, refer to this AWS document. |
AWS EC2 Instance Profile | Uses the IAM role attached to the EC2 instance where the ADOC Data Plane is running. No credentials need to be entered manually. Ensure the IAM role has the necessary S3 and SQS permissions. |
AWS IAM Roles For Service Accounts | Uses IAM Roles for Service Accounts (IRSA), recommended for Kubernetes environments. The Kubernetes service account used by the ADOC Data Plane must be annotated with the appropriate IAM role. This allows the workload to assume the IAM role automatically, with no need to manually enter credentials. No authentication fields appear in the UI, but this method must be pre-configured in your Kubernetes environment. For more information, refer the additional reference section below on IRSA Based Authentication. |
- Select Test Connection. If successful, you’ll see “Connected.” If the test fails, troubleshoot credentials, region, network issues, and ensure that the ADOC Data Plane service (
ad-analysis-standalone
) is running. - Select Next to proceed.
Step 3: Setup Observability
Configure how ADOC will monitor your S3 bucket:
- Asset Name: Enter a logical name or label.
- Path Expression: Enter the complete path expression using the syntax
s3://bucket name/file.
For example, s3://acceldatabucket/observability.csv
- File Type: Select a file type and provide file-type specific parameters. ADOC supports the following file types: .bz2, .deflate, .gz, .z4, .snappy, JSON, CSV, ORC, PARQUET. ADOC also supports profiling of zipped and KMS encrypted files. For details, refer to the below table.
File Type | Parameter | Description |
---|---|---|
CSV | Delimiter | The character that separates fields in a CSV file. Common delimiters include commas (,), tabs (\t), or semicolons (;). |
ORC | No additional parameters are required for ORC files. | |
PARQUET | File Processing Strategy | Options include: Evolving Schema (no additional parameters required), Random Files, or Date Partitioned. |
Base Path (Random Files) | The root directory or location in the storage system where the Parquet files are stored. This is used to locate the data for random file processing. | |
Base Path (Date Partitioned) | The root directory or location where the date-partitioned Parquet files are stored. | |
Pattern (Date Partitioned) | A file pattern that includes a date (e.g., "file-<yyyy-MM-dd>.parquet") to identify the specific files for processing. | |
LookBack Days (Date Partitioned) | The number of days to look back when crawling and processing date-partitioned Parquet files. | |
TimeZone (Date Partitioned) | The time zone in which the partitioned data is recorded. | |
JSON | Flattening Level | Defines how deeply nested JSON structures will be flattened. Nested JSON fields will be expanded based on the level specified. |
MultiLine JSON | When enabled, this toggle allows for the processing of JSON data that spans multiple lines. | |
AVRO | Schema Store Type | Specifies where the AVRO schema is stored. Options could include local files, a schema registry, or other storage systems. |
Delta | No additional parameters are required for Delta files. |
Configure Observability Options:
- Enable Schema Drift Monitoring to detect changes in file schemas (e.g., added, removed, or renamed columns) over time.
- Enable Crawler Execution Schedule to set up scheduled scans of your S3 bucket:
- Choose how often the crawler runs (e.g., daily)
- Set execution time and time zone
- Add multiple execution times if needed
- Set Notifications
- Notify on Crawler Failure: Choose one or more channels to receive failure alerts.
- Notify on Success: Toggle this if you'd like to receive success notifications.
- Click Submit to save your configuration to register and begin monitoring the AWS S3 data source.
You have successfully added AWS S3 as a data source. A new card for AWS S3 will appear on the Data Sources page, displaying crawler status and basic connection details.
What’s Next
Once you've successfully connected your Amazon S3 bucket as a data source in ADOC, you can:
- Profile your S3 data: Run profiling jobs to gather metrics such as row count, null percentage, file size, and freshness across supported file formats.
- Monitor data quality in real time: Enable schema drift detection and track changes to file structure, format, or volume using scheduled crawlers or SQS-based triggers.
- Apply data reliability rules and policies: Set up and enforce data quality rules—such as column-level validations, null checks, or file arrival thresholds—directly on your S3 data.
Additional References
IRSA-Based Authentication
IAM Roles for Service Accounts (IRSA) is a secure way to manage AWS access in Kubernetes without hard-coded credentials. IRSA allows each ADOC service in the Data Plane to assume a role with only the permissions it needs, following the principle of least privilege.
How IRSA Works in ADOC? The ADOC Data Plane connects to AWS services in your environment using service accounts mapped to IAM roles. These roles allow access to:
- S3 (for data ingestion)
- AWS Secrets Manager (to retrieve credentials)
- Amazon Athena (for metadata crawling)
- SQS (for event-driven ingestion)
Kubernetes Service Account Mapping
ADOC Service | Service Account | AWS Access |
---|---|---|
Analysis Service | analysis-service | AWS Secrets Manager (read) |
Analysis Standalone Service | analysis-standalone-service | S3 (read), Athena (read) |
Spark Driver / Executor | spark-scheduler | S3 (read/write), Athena (read) |
Monitors Service | torch-monitors | SQS (read, for incremental S3 ingestion) |
Crawlers | analysis-service | Athena (read) |
Annotating Service Accounts for IRSA
To map a Kubernetes service account to an IAM role:
Replace placeholders with your AWS values.
Do not restart the Crawler or Spark Driver/Executor pods during this process.
Required IAM Policies
- S3 Read Access
- Secret Manager Access
- SQS Access (for Incremental Processing)
Secret Manager Configuration (for IRSA or EKS Pod Identity)
To retrieve secrets via ADOC’s Data Plane:
[
{
"name": "<config-name>",
"type": "AWS_SECRETS_MANAGER",
"details": {
"secretName": "<aws-secret-name>",
"accessKey": "",
"secretKey": "",
"region": "<region>",
"authType": "IAM_ROLES_FOR_SA_BASED"
}
}
]
Other authType values (for alternate methods):
- KEY_BASED
- INSTANCE_PROFILE_BASED
EKS Pod Identity (Alternative to IRSA)
EKS Pod Identity enables IAM role assignment directly to pods, removing the need for annotations. It provides the same security benefits as IRSA but with more granularity.
ADOC Service | Service Account | AWS Access |
---|---|---|
Analysis Service | analysis-service | AWS Secrets Manager (read) |
Analysis Standalone Service | analysis-standalone-service | S3 (read), Athena (read) |
Spark Driver / Executor | spark-scheduler | S3 (read/write), Athena (read) |
Monitors Service | torch-monitors | SQS (read, for incremental S3 ingestion) |
Crawlers | analysis-service | Athena (read) |
IAM policies for EKS Pod Identity are the same as those listed above under IRSA.