Amazon S3

Amazon S3 is AWS’s object storage service. Use ADOC to monitor and profile S3 buckets and files—providing data quality, freshness, usage, schema drift, and reliability insights.

Prerequisites

Ensure the following requirements are met before you connect S3 as a data source:

You have access to an AWS S3 bucket containing the data you want to monitor.
The ADOC Data Plane is deployed and has network access to the S3 bucket.
Authentication to AWS is configured using one of the following methods:
- EC2 Instance Profile with the appropriate IAM role attached,
- Kubernetes IAM Roles for Service Accounts (IRSA), or
- AWS Access Key and Secret Key credentials.
Your IAM policy includes the necessary permissions to access S3, Secrets Manager (if applicable), and SQS (if you're using event-based or incremental monitoring).

Add S3 as a Data Source

Follow these steps to register S3 in ADOC:

Step 1: Start Setup

Select Register from the left main menu.
Select Add Data Source.
Select AWS S3 from the list of data sources.
On the Data Source Details page:
1. Enter a name for this data source that is unique within your tenant.
2. Optionally, add a brief description to clarify its purpose.
3. Enable the Data Reliability toggle and select your data plane from the drop-down list.
Select Next to proceed.

Step 2: Add Connection Details

Enter your AWS S3 connection information. Required fields vary depending on the selected Authentication Type.

Common Fields (Displayed for All Authentication Types)

Field	Description
AWS Region	The AWS region where your S3 bucket is hosted (e.g., us-east-2, eu-west-1). For more information, refer to AWS's Region and Zones document.
AWS S3 Authentication Type	Select the method of authentication used to access your AWS S3 bucket. Options include Access Key/Secret Key, EC2 Instance Profile, and (if applicable) IAM Roles for Service Accounts (IRSA) in Kubernetes.
Bucket Name	Name of the S3 bucket you want to monitor. Multiple buckets can be added.
File Monitoring Channel Type	(Optional) Choose SQS to receive file change notifications from S3 via Amazon SQS (Simple Queue Service) else select NONE.
SQS Queue URL	Required if File Monitoring Channel Type is set to SQS.

Authentication Types

Authentication Type	Description
AWS Access Key / Secret Key	Provide an AWS Access Key ID and Secret Access Key with sufficient permissions (e.g., `s3:ListBucket`, `s3:GetObject`). If Use Secrets Manager is enabled, select the Secrets Manager and specify the key name that contains the secret.For more details on how to view AWS access key and Secret key, refer to this AWS document.
AWS EC2 Instance Profile	Uses the IAM role attached to the EC2 instance where the ADOC Data Plane is running. No credentials need to be entered manually. Ensure the IAM role has the necessary S3 and SQS permissions.
AWS IAM Roles For Service Accounts	Uses IAM Roles for Service Accounts (IRSA), recommended for Kubernetes environments. The Kubernetes service account used by the ADOC Data Plane must be annotated with the appropriate IAM role. This allows the workload to assume the IAM role automatically, with no need to manually enter credentials. No authentication fields appear in the UI, but this method must be pre-configured in your Kubernetes environment. For more information, refer the additional reference section below on IRSA Based Authentication.

Select Test Connection. If successful, you’ll see “Connected.” If the test fails, troubleshoot credentials, region, network issues, and ensure that the ADOC Data Plane service (ad-analysis-standalone) is running.
Select Next to proceed.

Step 3: Setup Observability

Configure how ADOC will monitor your S3 bucket:

Asset Name: Enter a logical name or label.
Path Expression: Enter the complete path expression using the syntax s3://bucket name/file. For example, s3://acceldatabucket/observability.csv
File Type: Select a file type and provide file-type specific parameters. ADOC supports the following file types: .bz2, .deflate, .gz, .z4, .snappy, JSON, CSV, ORC, PARQUET. ADOC also supports profiling of zipped and KMS encrypted files. For details, refer to the below table.

File Type	Parameter	Description
CSV	Delimiter	The character that separates fields in a CSV file. Common delimiters include commas (,), tabs (\t), or semicolons (;).
ORC		No additional parameters are required for ORC files.
PARQUET	File Processing Strategy	Options include: Evolving Schema (no additional parameters required), Random Files, or Date Partitioned.
	Base Path (Random Files)	The root directory or location in the storage system where the Parquet files are stored. This is used to locate the data for random file processing.
	Base Path (Date Partitioned)	The root directory or location where the date-partitioned Parquet files are stored.
	Pattern (Date Partitioned)	A file pattern that includes a date (e.g., "file-<yyyy-MM-dd>.parquet") to identify the specific files for processing.
	LookBack Days (Date Partitioned)	The number of days to look back when crawling and processing date-partitioned Parquet files.
	TimeZone (Date Partitioned)	The time zone in which the partitioned data is recorded.
JSON	Flattening Level	Defines how deeply nested JSON structures will be flattened. Nested JSON fields will be expanded based on the level specified.
	MultiLine JSON	When enabled, this toggle allows for the processing of JSON data that spans multiple lines.
AVRO	Schema Store Type	Specifies where the AVRO schema is stored. Options could include local files, a schema registry, or other storage systems.
Delta		No additional parameters are required for Delta files.

Important

The asset path expression must be a full path address. Use the syntax s3://<bucket name>/<file_name>. For example, s3://acceldatabucket/observability.csv.
To include multiple files in an asset, use wildcards:
- All files in a bucket: s3://<bucket-name>
- All files of a specific type: s3://<bucket-name>/*.<extension> . Examples: s3://acceldatabucket/*.csv , s3://acceldatabucket/*.json
Schema Requirements: When profiling S3 data sources, ensure that all files included in an asset share the same schema. Files with different schemas must be placed in separate folders or configured as separate assets.
- Files in the same folder with identical schemas are profiled together.
- Files with differing schemas in the same folder may cause some rows to be skipped during profiling. This is expected behavior.
- If your data evolves gradually (for example, columns are added or removed), enable the Evolving Schema option to support schema changes.
- Files with entirely different schemas cannot be combined into a single asset, even when Evolving Schema is enabled.
- When using wildcards in path expressions, ensure that all matched files have the same schema and file type.

Configure Observability Options:

Enable Schema Drift Monitoring to detect changes in file schemas (e.g., added, removed, or renamed columns) over time.
Enable Crawler Execution Schedule to set up scheduled scans of your S3 bucket:
1. Choose how often the crawler runs (e.g., daily)
2. Set execution time and time zone
3. Add multiple execution times if needed
Set Notifications
1. Notify on Crawler Failure: Choose one or more channels to receive failure alerts.
2. Notify on Success: Toggle this if you'd like to receive success notifications.
Click Submit to save your configuration to register and begin monitoring the AWS S3 data source.

You have successfully added AWS S3 as a data source. A new card for AWS S3 will appear on the Data Sources page, displaying crawler status and basic connection details.

What’s Next

Once you've successfully connected your Amazon S3 bucket as a data source in ADOC, you can:

Profile your S3 data: Run profiling jobs to gather metrics such as row count, null percentage, file size, and freshness across supported file formats.
Monitor data quality in real time: Enable schema drift detection and track changes to file structure, format, or volume using scheduled crawlers or SQS-based triggers.
Apply data reliability rules and policies: Set up and enforce data quality rules—such as column-level validations, null checks, or file arrival thresholds—directly on your S3 data.

IAM Roles for Service Accounts (IRSA) is a secure way to manage AWS access in Kubernetes without hard-coded credentials. IRSA allows each ADOC service in the Data Plane to assume a role with only the permissions it needs, following the principle of least privilege.

How IRSA Works in ADOC? The ADOC Data Plane connects to AWS services in your environment using service accounts mapped to IAM roles. These roles allow access to:

S3 (for data ingestion)
AWS Secrets Manager (to retrieve credentials)
Amazon Athena (for metadata crawling)
SQS (for event-driven ingestion)

Kubernetes Service Account Mapping

ADOC Service	Service Account	AWS Access
Analysis Service	`analysis-service`	AWS Secrets Manager (read)
Analysis Standalone Service	`analysis-standalone-service`	S3 (read), Athena (read)
Spark Driver / Executor	`spark-scheduler`	S3 (read/write), Athena (read)
Monitors Service	`torch-monitors`	SQS (read, for incremental S3 ingestion)
Crawlers	`analysis-service`	Athena (read)

Annotating Service Accounts for IRSA

To map a Kubernetes service account to an IAM role:

Replace placeholders with your AWS values.

Do not restart the Crawler or Spark Driver/Executor pods during this process.

Required IAM Policies

S3 Read Access

Secret Manager Access

SQS Access (for Incremental Processing)

Secret Manager Configuration (for IRSA or EKS Pod Identity)

To retrieve secrets via ADOC’s Data Plane:

JSON
    
 
[  {    "name": "<config-name>",    "type": "AWS_SECRETS_MANAGER",    "details": {      "secretName": "<aws-secret-name>",      "accessKey": "",      "secretKey": "",      "region": "<region>",      "authType": "IAM_ROLES_FOR_SA_BASED"    }  }]
Copy

Other authType values (for alternate methods):

KEY_BASED
INSTANCE_PROFILE_BASED

EKS Pod Identity (Alternative to IRSA)

EKS Pod Identity enables IAM role assignment directly to pods, removing the need for annotations. It provides the same security benefits as IRSA but with more granularity.

ADOC Service	Service Account	AWS Access
Analysis Service	`analysis-service`	AWS Secrets Manager (read)
Analysis Standalone Service	`analysis-standalone-service`	S3 (read), Athena (read)
Spark Driver / Executor	`spark-scheduler`	S3 (read/write), Athena (read)
Monitors Service	`torch-monitors`	SQS (read, for incremental S3 ingestion)
Crawlers	`analysis-service`	Athena (read)

IAM policies for EKS Pod Identity are the same as those listed above under IRSA.

Last updated on

Was this page helpful?

Amazon S3

Prerequisites

Add S3 as a Data Source

Step 1: Start Setup

Step 2: Add Connection Details

Step 3: Setup Observability

What’s Next

Additional References

IRSA-Based Authentication

Kubernetes Service Account Mapping

Annotating Service Accounts for IRSA

Required IAM Policies

Secret Manager Configuration (for IRSA or EKS Pod Identity)

EKS Pod Identity (Alternative to IRSA)