Amazon | S3

Amazon S3 is AWS’s object storage service. Use ADOC to monitor and profile S3 buckets and files—providing data quality, freshness, usage, schema drift, and reliability insights.

Prerequisites

Before you connect Amazon S3 as a data source in ADOC, ensure the following:

  • You have access to an AWS S3 bucket containing the data you want to monitor.
  • The ADOC Data Plane is deployed and has network access to the S3 bucket.
  • Authentication to AWS is configured using one of the following methods:
  • EC2 Instance Profile with the appropriate IAM role attached,
  • Kubernetes IAM Roles for Service Accounts (IRSA), or
  • AWS Access Key and Secret Key credentials.
  • Your IAM policy includes the necessary permissions to access S3, Secrets Manager (if applicable), and SQS (if you're using event-based or incremental monitoring).

Add S3 as a Data Source

Follow these steps to set up S3 in ADOC:

Step 1: Start Setup

  1. Select Register from the left main menu.

  2. Select Add Data Source.

  3. Select AWS S3 from the list of data sources.

  4. On the Data Source Details page:

    1. Enter a name for this data source that is unique within your tenant.
    2. Optionally, add a brief description to clarify its purpose.
    3. Enable the Data Reliability toggle and select your data plane from the drop-down list.
  5. Select Next to proceed.

Step 2: Add Connection Details

Enter your AWS S3 connection information. Required fields vary depending on the selected Authentication Type.

Common Fields (Displayed for All Authentication Types)

FieldDescription
AWS RegionThe AWS region where your S3 bucket is hosted (e.g., us-east-2, eu-west-1). For more information, refer to AWS's Region and Zones document.
AWS S3 Authentication TypeSelect the method of authentication used to access your AWS S3 bucket. Options include Access Key/Secret Key, EC2 Instance Profile, and (if applicable) IAM Roles for Service Accounts (IRSA) in Kubernetes.
Bucket NameName of the S3 bucket you want to monitor. Multiple buckets can be added.
File Monitoring Channel Type(Optional) Choose SQS to receive file change notifications from S3 via Amazon SQS (Simple Queue Service) else select NONE.
SQS Queue URLRequired if File Monitoring Channel Type is set to SQS.

Authentication Types

Authentication TypeDescription
AWS Access Key / Secret KeyProvide an AWS Access Key ID and Secret Access Key with sufficient permissions (e.g., s3:ListBucket, s3:GetObject). If Use Secrets Manager is enabled, select the Secrets Manager and specify the key name that contains the secret. For more details on how to view AWS access key and Secret key, refer to this AWS document.
AWS EC2 Instance ProfileUses the IAM role attached to the EC2 instance where the ADOC Data Plane is running. No credentials need to be entered manually. Ensure the IAM role has the necessary S3 and SQS permissions.
AWS IAM Roles For Service AccountsUses IAM Roles for Service Accounts (IRSA), recommended for Kubernetes environments. The Kubernetes service account used by the ADOC Data Plane must be annotated with the appropriate IAM role. This allows the workload to assume the IAM role automatically, with no need to manually enter credentials. No authentication fields appear in the UI, but this method must be pre-configured in your Kubernetes environment. For more information, refer the additional reference section below on IRSA Based Authentication.
  1. Select Test Connection. If successful, you’ll see “Connected.” If the test fails, troubleshoot credentials, region, network issues, and ensure that the ADOC Data Plane service (ad-analysis-standalone) is running.
  2. Select Next to proceed.

Step 3: Setup Observability

Configure how ADOC will monitor your S3 bucket:

  1. Asset Name: Enter a logical name or label.
  2. Path Expression: Enter the complete path expression using the syntax s3://bucket name/file. For example, s3://acceldatabucket/observability.csv
  3. File Type: Select a file type and provide file-type specific parameters. ADOC supports the following file types: .bz2, .deflate, .gz, .z4, .snappy, JSON, CSV, ORC, PARQUET. ADOC also supports profiling of zipped and KMS encrypted files. For details, refer to the below table.
File TypeParameterDescription
CSVDelimiterThe character that separates fields in a CSV file. Common delimiters include commas (,), tabs (\t), or semicolons (;).
ORCNo additional parameters are required for ORC files.
PARQUETFile Processing StrategyOptions include: Evolving Schema (no additional parameters required), Random Files, or Date Partitioned.
Base Path (Random Files)The root directory or location in the storage system where the Parquet files are stored. This is used to locate the data for random file processing.
Base Path (Date Partitioned)The root directory or location where the date-partitioned Parquet files are stored.
Pattern (Date Partitioned)A file pattern that includes a date (e.g., "file-<yyyy-MM-dd>.parquet") to identify the specific files for processing.
LookBack Days (Date Partitioned)The number of days to look back when crawling and processing date-partitioned Parquet files.
TimeZone (Date Partitioned)The time zone in which the partitioned data is recorded.
JSONFlattening LevelDefines how deeply nested JSON structures will be flattened. Nested JSON fields will be expanded based on the level specified.
MultiLine JSONWhen enabled, this toggle allows for the processing of JSON data that spans multiple lines.
AVROSchema Store TypeSpecifies where the AVRO schema is stored. Options could include local files, a schema registry, or other storage systems.
DeltaNo additional parameters are required for Delta files.

Configure Observability Options:

  1. Enable Schema Drift Monitoring to detect changes in file schemas (e.g., added, removed, or renamed columns) over time.
  2. Enable Crawler Execution Schedule to set up scheduled scans of your S3 bucket:
  • Choose how often the crawler runs (e.g., daily)
  • Set execution time and time zone
  • Add multiple execution times if needed
  1. Set Notifications
  2. Notify on Crawler Failure: Choose one or more channels to receive failure alerts.
  3. Notify on Success: Toggle this if you'd like to receive success notifications.
  4. Click Submit to save your configuration to register and begin monitoring the AWS S3 data source.

You have successfully added AWS S3 as a data source. A new card for AWS S3 will appear on the Data Sources page, displaying crawler status and basic connection details.

What’s Next

Once you've successfully connected your Amazon S3 bucket as a data source in ADOC, you can:

  • Profile your S3 data: Run profiling jobs to gather metrics such as row count, null percentage, file size, and freshness across supported file formats.
  • Monitor data quality in real time: Enable schema drift detection and track changes to file structure, format, or volume using scheduled crawlers or SQS-based triggers.
  • Apply data reliability rules and policies: Set up and enforce data quality rules—such as column-level validations, null checks, or file arrival thresholds—directly on your S3 data.

Additional References

IRSA-Based Authentication

IAM Roles for Service Accounts (IRSA) is a secure way to manage AWS access in Kubernetes without hard-coded credentials. IRSA allows each ADOC service in the Data Plane to assume a role with only the permissions it needs, following the principle of least privilege.

How IRSA Works in ADOC? The ADOC Data Plane connects to AWS services in your environment using service accounts mapped to IAM roles. These roles allow access to:

  • S3 (for data ingestion)
  • AWS Secrets Manager (to retrieve credentials)
  • Amazon Athena (for metadata crawling)
  • SQS (for event-driven ingestion)

Kubernetes Service Account Mapping

ADOC ServiceService AccountAWS Access
Analysis Serviceanalysis-serviceAWS Secrets Manager (read)
Analysis Standalone Serviceanalysis-standalone-serviceS3 (read), Athena (read)
Spark Driver / Executorspark-schedulerS3 (read/write), Athena (read)
Monitors Servicetorch-monitorsSQS (read, for incremental S3 ingestion)
Crawlersanalysis-serviceAthena (read)

Annotating Service Accounts for IRSA

To map a Kubernetes service account to an IAM role:

Copy

Replace placeholders with your AWS values.

Do not restart the Crawler or Spark Driver/Executor pods during this process.

Required IAM Policies

  1. S3 Read Access
Copy
  1. Secret Manager Access
Copy
  1. SQS Access (for Incremental Processing)
Copy

Secret Manager Configuration (for IRSA or EKS Pod Identity)

To retrieve secrets via ADOC’s Data Plane:

JSON
Copy

Other authType values (for alternate methods):

  • KEY_BASED
  • INSTANCE_PROFILE_BASED

EKS Pod Identity (Alternative to IRSA)

EKS Pod Identity enables IAM role assignment directly to pods, removing the need for annotations. It provides the same security benefits as IRSA but with more granularity.

ADOC ServiceService AccountAWS Access
Analysis Serviceanalysis-serviceAWS Secrets Manager (read)
Analysis Standalone Serviceanalysis-standalone-serviceS3 (read), Athena (read)
Spark Driver / Executorspark-schedulerS3 (read/write), Athena (read)
Monitors Servicetorch-monitorsSQS (read, for incremental S3 ingestion)
Crawlersanalysis-serviceAthena (read)

IAM policies for EKS Pod Identity are the same as those listed above under IRSA.

Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard