Amazon | S3
AWS S3 (Simple Storage Service) is a data storage service. S3 uses buckets to store data. Data is stored as objects in buckets of S3. An Object can be any file or a metadata that describes the file. To use S3, you must upload your objects (files) to any bucket in your S3.
S3 in ADOC
ADOC provides data reliability capability for data stored in your S3 data source. You must create a Data Plane or use an existing Data Plane to add S3 as a Data source in ADOC. Once you add S3 as a Data Source, you can view the details of your S3 bucket usage in the Data Reliability tab in ADOC.
Steps to Add S3 as a Data Source
To add S3 as a Data source:
- Click Register from the left pane.
- Click Add Data Source.
- Select the AWS S3 Data Source. AWS S3 Data source basic Details page is displayed.


- Enter a name for the data source in the Data Source name field.

- (Optional) Enter a description for the Data Source in the Description field.
- Enable the Data Reliability capability by switching on the toggle switch.
- Select a Data Plane from the Select Data Plane drop-down menu.
To create a new Data Plane, click Setup Dataplane button from the Data Plane tab in the Data Sources page.
You must either create a Data Plane or use an existing Data Plane to enable the Data Reliability capability.
- Click Next. The AWS S3 Connection Details page is displayed.

Enter the region where your AWS account is located in the AWS Region field. For more details on how to view your AWS region, refer this AWS document.
AWS S3 Authentication Type: Choose from the following authentication types:
- AWS EC2 Instance Profile: This method utilizes the role assigned to the underlying EC2 instance, allowing all workloads running on the instance to assume this role.
- AWS IAM Roles For Service Accounts: Ideal for Kubernetes or containerized environments. You'll need to specify the S3 bucket name. For IAM Roles for Service Accounts (IRSA), Kubernetes service accounts must be annotated with the appropriate IAM roles as needed. Workloads using the specified Kubernetes service account will then be able to assume the annotated IAM role.
- AWS Access Key / Secret Key: Fill up the AWS Access Key and AWS Secret Key fields with your AWS access key and secret key. For more details on how to view AWS access key and Secret key, refer this AWS document and for details on how to view your AWS region, refer this AWS document. For more details on how to view AWS access key and Secret key, refer this AWS document.
Bucket Name: Enter the name of the specific AWS S3 bucket.
(Optional) Toggle the Use Secret Manager option if you want to use AWS Secrets Manager to manage your credentials.
Click Test Connection. If your credentials are valid, you receive a Connected message. If you get an error message, validate the AWS credentials you entered.
Click Next. The Observability Setup page is displayed.

- Provide the Asset Name, Path Expression, and File Type, of your S3 assets. To add more assets, click +. The selected assets are monitored by the Data Reliability capability of ADOC.
- Based on the file type selected, you must enter values for the required parameters:
File Type | Parameter | Description |
---|---|---|
CSV | Delimiter | The character that separates fields in a CSV file. Common delimiters include commas (,), tabs (\t), or semicolons (;). |
ORC | No additional parameters are required for ORC files. | |
PARQUET | File Processing Strategy | Options include: Evolving Schema (no additional parameters required), Random Files, or Date Partitioned. |
Base Path (Random Files) | The root directory or location in the storage system where the Parquet files are stored. This is used to locate the data for random file processing. | |
Base Path (Date Partitioned) | The root directory or location where the date-partitioned Parquet files are stored. | |
Pattern (Date Partitioned) | A file pattern that includes a date (e.g., "file-<yyyy-MM-dd>.parquet") to identify the specific files for processing. | |
LookBack Days (Date Partitioned) | The number of days to look back when crawling and processing date-partitioned Parquet files. | |
TimeZone (Date Partitioned) | The time zone in which the partitioned data is recorded. | |
JSON | Flattening Level | Defines how deeply nested JSON structures will be flattened. Nested JSON fields will be expanded based on the level specified. |
MultiLine JSON | When enabled, this toggle allows for the processing of JSON data that spans multiple lines. | |
AVRO | Schema Store Type | Specifies where the AVRO schema is stored. Options could include local files, a schema registry, or other storage systems. |
Delta | No additional parameters are required for Delta files. |
- The asset path expression must be a full path address. The syntax is
s3://bucket name/file
name. For example,s3://acceldatabucket/observability.csv
. - To include all the files in the bucket, you must use the syntax
s3://bucket name
. This includes all the files stored in the S3 bucket, irrespective of the file type. - To include all the files that belong to a specific file type, you must use the syntax as
s3://bucket name/*file extension
. For example, to include all the CSV files in the bucket, you must use the syntax ass3://bucket name/*.csv
. Similarly, to include all the JSON files in the bucket, you must use the syntax,s3://bucket name/*.json
.
- Enable Schema Drift Monitoring: For newly crawled assets, Acceldata will automatically enable Schema drift monitoring. These monitoring policies may be further configured per asset in the application.
Warning Schema changes are not automatically detected unless the crawler is executed. Acceldata recommends scheduling the crawler. - Enable Crawler Execution Schedule: Turn on this toggle switch to select a time tag and time zone to schedule the execution of crawlers for Data Reliability.
- Click Submit.
S3 is now added as a Data Source. You can choose to crawl your S3 account now or later. You can navigate to the Managing Data Sources page to view the options available after adding the Data Source.
ADOC supports the following File Types:
ADOC also supports profiling of zipped and KMS encrypted files.
IRSA-Based Authentication and Permissions
IAM Roles for Service Accounts (IRSA) is an authentication method used to securely manage permissions for Kubernetes service accounts without the need for hard-coded AWS credentials like access keys and secret keys. IRSA enables the principle of least privilege by assigning specific roles to each workload in ADOC, ensuring that each service has only the permissions it requires.
For details on how to configure IRSA for your EKS Cluster, refer to this AWS document.
IRSA in ADOC
The data plane on the ADOC platform links to AWS Services within the user environment.
Service | Description |
---|---|
Data Source Access | Secure read access to AWS data sources such as S3. |
Secret Manager Access | Secure retrieval of secret values from AWS Secrets Manager. |
How to Annotate Kubernetes Service Accounts?
By annotating a Kubernetes service account with an IAM role, you are effectively telling the Kubernetes environment that whenever this service account makes a request to an AWS service, it should use the permissions associated with the specified IAM role.
To annotate a Kubernetes service account with an IAM role, you need to use the following annotation format:
eks.amazonaws.com/role-arn: arn:aws:iam::<AWS Account ID>:role/<IAM Role Name>
Replace <AWS Account ID
> with your actual AWS account ID and <IAM Role Name
> with the IAM role name that has the necessary permissions.
Annotation Details by ADOC Data Plane Service:
Following table details the Kubernetes service accounts used by the ADOC Data Plane and the corresponding AWS services they need access to. Ensure each service account is annotated with the appropriate IAM role.
ADOC Data Plane Service | Kubernetes Service Account | AWS Services Access and Mode (Read / Write) |
---|---|---|
Analysis Service | analysis-service | AWS Secrets Manager (secret value read) |
Analysis Standalone Service | analysis-standalone-service | AWS S3 (data source + global storage read) , AWS Athena (data source read) |
Spark Driver / Executor | spark-scheduler | AWS S3 (data source read + global storage write), AWS Athena (data source read) |
Monitors Service | torch-monitors | Permission for reading the SQS queue in case of Incremental File processing for S3 |
Crawlers | analysis-service | AWS Athena (data source read) |
IAM Roles and Policies for IRSA-Based Authentication
This section provides detailed IAM policies that should be created and attached to the relevant IAM roles for different types of access within the ADOC Data Plane. These policies ensure secure access to various AWS services using IRSA.
Data Source Access
All access to data within the customer's environment is read-only. The following IAM policy grants read-only access to any S3 bucket:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::<data source bucket name>/*",
"arn:aws:s3:::<data source bucket name>"
]
}
]
}
Replace <data source bucket name>
with the actual bucket name used in your environment.
Secret Manager Access
To allow the Kubernetes service account to read a secret value from AWS Secrets Manager, use the following IAM policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "secretsmanager:GetSecretValue",
"Resource": "<AWS ARN of the Secret created>"
}
]
}
Replace <AWS ARN of the Secret created>
with the actual ARN of the secret in AWS Secrets Manager.
SQS Queue Permissions
When configuring incremental file processing for an S3-based data source, you need to configure SQS queue permissions for reading the file events. Use the following IAM policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"sqs:ReceiveMessage"
],
"Resource": "arn:aws:sqs:region:<AWS Account ID>:<AWS SQS Queue Name>"
}
]
}
Replace <region>
, <AWS Account ID>
, and <AWS SQS Queue Name>
with your specific values.
Secret Manager Configuration
For configuring Secret Manager in Data Plane V2, the following JSON should be used for the secret-manager
secret:
[
{
"name": "<Name of the config>", --> Any name can be given
"type": "AWS_SECRETS_MANAGER", --> This is fixed for AWS Secret Manager
"details": {
"secretName": "<name of the secret in AWS Secret Manager>",
"accessKey": "",
"secretKey": "",
"region": "<region>",
"authType": "IAM_ROLES_FOR_SA_BASED" --> Should be this value if IRSA is to be used to connect to AWS Secret Manager
}
}
]
Customize the fields as per your environment’s requirements. For “authType
“ field, other values are → INSTANCE_PROFILE_BASED
and KEY_BASED
.
Configuration Steps
- Create IAM roles for each Kubernetes Service account listed in the table above using the IAM policies provided.
- Annotate each service account with the roles you created.
- Once annotated, restart the pods associated with each service account.
Note: Crawler and Spark Driver/Executor pods should not be restarted.
ADOC recommends that users utilize distinct IAM roles for specific Kubernetes service accounts. This will help to uphold the Principle of Least Privilege guideline.
EKS Pod Identity-Based Authentication
EKS Pod Identity-based authentication allows Kubernetes workloads to securely authenticate with AWS services without requiring static credentials. By leveraging IAM roles assigned to pods, this method enhances security and enforces the principle of least privilege, ensuring each workload has only the necessary permissions.
EKS Pod Identity in ADOC
The ADOC platform integrates with AWS services within the user environment using EKS Pod Identity.
Service | Description |
---|---|
Data Source Access | Secure read access to AWS data sources such as S3. |
Secret Manager Access | Secure retrieval of secret values from AWS Secrets Manager. |
For details on configuring EKS Pod Identity for your EKS cluster, refer to the AWS documentation.
Kubernetes Service Account Details in ADOC Data Plane
ADOC Data Plane Service | Kubernetes Service Account | AWS Services Access and Mode (Read / Write) |
---|---|---|
Analysis Service | analysis-service | AWS Secrets Manager (secret value read) |
Analysis Standalone Service | analysis-standalone-service | AWS S3 (data source + global storage read) , AWS Athena (data source read) |
Spark Driver / Executor | spark-scheduler | AWS S3 (data source read + global storage write), AWS Athena (data source read) |
Monitors Service | torch-monitors | Permission for reading the SQS queue in case of Incremental File processing for S3 |
Crawlers | analysis-service | AWS Athena (data source read) |
IAM Roles and Policies for EKS Pod Identity-Based Authentication
To ensure secure access to AWS services, the following IAM policies should be attached to the relevant IAM roles associated with each Kubernetes service account.
1. Data Source Access
All data access within the customer's environment is read-only. The following IAM policy grants read-only access to an S3 bucket:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::<data source bucket name>/*",
"arn:aws:s3:::<data source bucket name>"
]
}
]
}
2. Secret Manager Access
To allow a Kubernetes service account to read secret values from AWS Secrets Manager, use the following IAM policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "secretsmanager:GetSecretValue",
"Resource": "<AWS ARN of the Secret created>"
}
]
}
Replace <AWS ARN of the Secret created>
with the actual ARN of the secret in AWS Secrets Manager.
3. SQS Queue Permissions
For incremental file processing with S3, configure SQS queue permissions to allow reading of file events using the following IAM policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"sqs:ReceiveMessage"
],
"Resource": "arn:aws:sqs:<region>:<AWS Account ID>:<AWS SQS Queue Name>"
}
]
}
Replace <region>
, <AWS Account ID>
, and <AWS SQS Queue Name>
with your specific values.
Secret Manager Configuration for EKS Pod Identity
For configuring AWS Secrets Manager in Data Plane V2, use the following JSON configuration:
[
{
"name": "<Name of the config>",
"type": "AWS_SECRETS_MANAGER",
"details": {
"secretName": "<name of the secret in AWS Secrets Manager>",
"accessKey": "",
"secretKey": "",
"region": "<region>",
"authType": "IAM_ROLES_FOR_SA_BASED"
}
}
]