Amazon | S3

AWS S3 (Simple Storage Service) is a data storage service. S3 uses buckets to store data. Data is stored as objects in buckets of S3. An Object can be any file or a metadata that describes the file. To use S3, you must upload your objects (files) to any bucket in your S3.

S3 in ADOC

ADOC provides data reliability capability for data stored in your S3 data source. You must create a Data Plane or use an existing Data Plane to add S3 as a Data source in ADOC. Once you add S3 as a Data Source, you can view the details of your S3 bucket usage in the Data Reliability tab in ADOC.

Steps to Add S3 as a Data Source

To add S3 as a Data source:

  1. Click Register from the left pane.
  2. Click Add Data Source.
  3. Select the AWS S3 Data Source. AWS S3 Data source basic Details page is displayed.
  1. Enter a name for the data source in the Data Source name field.
  1. (Optional) Enter a description for the Data Source in the Description field.
  2. Enable the Data Reliability capability by switching on the toggle switch.
  3. Select a Data Plane from the Select Data Plane drop-down menu.

To create a new Data Plane, click Setup Dataplane button from the Data Plane tab in the Data Sources page.

You must either create a Data Plane or use an existing Data Plane to enable the Data Reliability capability.

  1. Click Next. The AWS S3 Connection Details page is displayed.
  1. Enter the region where your AWS account is located in the AWS Region field. For more details on how to view your AWS region, refer this AWS document.

  2. AWS S3 Authentication Type: Choose from the following authentication types:

    • AWS EC2 Instance Profile: This method utilizes the role assigned to the underlying EC2 instance, allowing all workloads running on the instance to assume this role.
    • AWS IAM Roles For Service Accounts: Ideal for Kubernetes or containerized environments. You'll need to specify the S3 bucket name. For IAM Roles for Service Accounts (IRSA), Kubernetes service accounts must be annotated with the appropriate IAM roles as needed. Workloads using the specified Kubernetes service account will then be able to assume the annotated IAM role.
    • AWS Access Key / Secret Key: Fill up the AWS Access Key and AWS Secret Key fields with your AWS access key and secret key. For more details on how to view AWS access key and Secret key, refer this AWS document and for details on how to view your AWS region, refer this AWS document. For more details on how to view AWS access key and Secret key, refer this AWS document.
  3. Bucket Name: Enter the name of the specific AWS S3 bucket.

  4. (Optional) Toggle the Use Secret Manager option if you want to use AWS Secrets Manager to manage your credentials.

  5. Click Test Connection. If your credentials are valid, you receive a Connected message. If you get an error message, validate the AWS credentials you entered.

  6. Click Next. The Observability Setup page is displayed.

  1. Provide the Asset Name, Path Expression, and File Type, of your S3 assets. To add more assets, click +. The selected assets are monitored by the Data Reliability capability of ADOC.
  • Based on the file type selected, you must enter values for the required parameters:
File TypeParameterDescription
CSVDelimiterThe character that separates fields in a CSV file. Common delimiters include commas (,), tabs (\t), or semicolons (;).
ORCNo additional parameters are required for ORC files.
PARQUETFile Processing StrategyOptions include: Evolving Schema (no additional parameters required), Random Files, or Date Partitioned.
Base Path (Random Files)The root directory or location in the storage system where the Parquet files are stored. This is used to locate the data for random file processing.
Base Path (Date Partitioned)The root directory or location where the date-partitioned Parquet files are stored.
Pattern (Date Partitioned)A file pattern that includes a date (e.g., "file-<yyyy-MM-dd>.parquet") to identify the specific files for processing.
LookBack Days (Date Partitioned)The number of days to look back when crawling and processing date-partitioned Parquet files.
TimeZone (Date Partitioned)The time zone in which the partitioned data is recorded.
JSONFlattening LevelDefines how deeply nested JSON structures will be flattened. Nested JSON fields will be expanded based on the level specified.
MultiLine JSONWhen enabled, this toggle allows for the processing of JSON data that spans multiple lines.
AVROSchema Store TypeSpecifies where the AVRO schema is stored. Options could include local files, a schema registry, or other storage systems.
DeltaNo additional parameters are required for Delta files.

Important

  • The asset path expression must be a full path address. The syntax is s3://bucket name/filename. For example, s3://acceldatabucket/observability.csv.
  • To include all the files in the bucket, you must use the syntax s3://bucket name. This includes all the files stored in the S3 bucket, irrespective of the file type.
  • To include all the files that belong to a specific file type, you must use the syntax as s3://bucket name/*file extension. For example, to include all the CSV files in the bucket, you must use the syntax as s3://bucket name/*.csv. Similarly, to include all the JSON files in the bucket, you must use the syntax, s3://bucket name/*.json.
  1. Enable Schema Drift Monitoring: For newly crawled assets, Acceldata will automatically enable Schema drift monitoring. These monitoring policies may be further configured per asset in the application. Warning Schema changes are not automatically detected unless the crawler is executed. Acceldata recommends scheduling the crawler.
  2. Enable Crawler Execution Schedule: Turn on this toggle switch to select a time tag and time zone to schedule the execution of crawlers for Data Reliability.
  3. Click Submit.

S3 is now added as a Data Source. You can choose to crawl your S3 account now or later. You can navigate to the Managing Data Sources page to view the options available after adding the Data Source.

ADOC supports the following File Types:

.bz2 | .deflate | .gz | .lz4 | .snappy | JSON | CSV | ORC | PARQUET

ADOC also supports profiling of zipped and KMS encrypted files.

IRSA-Based Authentication and Permissions

IAM Roles for Service Accounts (IRSA) is an authentication method used to securely manage permissions for Kubernetes service accounts without the need for hard-coded AWS credentials like access keys and secret keys. IRSA enables the principle of least privilege by assigning specific roles to each workload in ADOC, ensuring that each service has only the permissions it requires.

For details on how to configure IRSA for your EKS Cluster, refer to this AWS document.

IRSA in ADOC

The data plane on the ADOC platform links to AWS Services within the user environment.

ServiceDescription
Data Source AccessSecure read access to AWS data sources such as S3.
Secret Manager AccessSecure retrieval of secret values from AWS Secrets Manager.

How to Annotate Kubernetes Service Accounts?

By annotating a Kubernetes service account with an IAM role, you are effectively telling the Kubernetes environment that whenever this service account makes a request to an AWS service, it should use the permissions associated with the specified IAM role.

To annotate a Kubernetes service account with an IAM role, you need to use the following annotation format:

JSON
Copy

Replace <AWS Account ID> with your actual AWS account ID and <IAM Role Name> with the IAM role name that has the necessary permissions.

Annotation Details by ADOC Data Plane Service:

Following table details the Kubernetes service accounts used by the ADOC Data Plane and the corresponding AWS services they need access to. Ensure each service account is annotated with the appropriate IAM role.

ADOC Data Plane ServiceKubernetes Service AccountAWS Services Access and Mode (Read / Write)
Analysis Serviceanalysis-serviceAWS Secrets Manager (secret value read)
Analysis Standalone Serviceanalysis-standalone-serviceAWS S3 (data source + global storage read) , AWS Athena (data source read)
Spark Driver / Executorspark-schedulerAWS S3 (data source read + global storage write), AWS Athena (data source read)
Monitors Servicetorch-monitorsPermission for reading the SQS queue in case of Incremental File processing for S3
Crawlersanalysis-serviceAWS Athena (data source read)

IAM Roles and Policies for IRSA-Based Authentication

This section provides detailed IAM policies that should be created and attached to the relevant IAM roles for different types of access within the ADOC Data Plane. These policies ensure secure access to various AWS services using IRSA.

Data Source Access

All access to data within the customer's environment is read-only. The following IAM policy grants read-only access to any S3 bucket:

JSON
Copy

Replace <data source bucket name> with the actual bucket name used in your environment.

Secret Manager Access

To allow the Kubernetes service account to read a secret value from AWS Secrets Manager, use the following IAM policy:

JSON
Copy

Replace <AWS ARN of the Secret created> with the actual ARN of the secret in AWS Secrets Manager.

SQS Queue Permissions

When configuring incremental file processing for an S3-based data source, you need to configure SQS queue permissions for reading the file events. Use the following IAM policy:

JSON
Copy

Replace <region>, <AWS Account ID>, and <AWS SQS Queue Name> with your specific values.

Secret Manager Configuration

For configuring Secret Manager in Data Plane V2, the following JSON should be used for the secret-manager secret:

JSON
Copy

Customize the fields as per your environment’s requirements. For “authType“ field, other values are → INSTANCE_PROFILE_BASED and KEY_BASED .

Configuration Steps

  1. Create IAM roles for each Kubernetes Service account listed in the table above using the IAM policies provided.
  2. Annotate each service account with the roles you created.
  3. Once annotated, restart the pods associated with each service account. Note: Crawler and Spark Driver/Executor pods should not be restarted.

ADOC recommends that users utilize distinct IAM roles for specific Kubernetes service accounts. This will help to uphold the Principle of Least Privilege guideline.

EKS Pod Identity-Based Authentication

EKS Pod Identity-based authentication allows Kubernetes workloads to securely authenticate with AWS services without requiring static credentials. By leveraging IAM roles assigned to pods, this method enhances security and enforces the principle of least privilege, ensuring each workload has only the necessary permissions.

EKS Pod Identity in ADOC

The ADOC platform integrates with AWS services within the user environment using EKS Pod Identity.

ServiceDescription
Data Source AccessSecure read access to AWS data sources such as S3.
Secret Manager AccessSecure retrieval of secret values from AWS Secrets Manager.

For details on configuring EKS Pod Identity for your EKS cluster, refer to the AWS documentation.

Kubernetes Service Account Details in ADOC Data Plane

ADOC Data Plane ServiceKubernetes Service AccountAWS Services Access and Mode (Read / Write)
Analysis Serviceanalysis-serviceAWS Secrets Manager (secret value read)
Analysis Standalone Serviceanalysis-standalone-serviceAWS S3 (data source + global storage read) , AWS Athena (data source read)
Spark Driver / Executorspark-schedulerAWS S3 (data source read + global storage write), AWS Athena (data source read)
Monitors Servicetorch-monitorsPermission for reading the SQS queue in case of Incremental File processing for S3
Crawlersanalysis-serviceAWS Athena (data source read)

IAM Roles and Policies for EKS Pod Identity-Based Authentication

To ensure secure access to AWS services, the following IAM policies should be attached to the relevant IAM roles associated with each Kubernetes service account.

1. Data Source Access

All data access within the customer's environment is read-only. The following IAM policy grants read-only access to an S3 bucket:

JSON
Copy

2. Secret Manager Access

To allow a Kubernetes service account to read secret values from AWS Secrets Manager, use the following IAM policy:

JSON
Copy

Replace <AWS ARN of the Secret created> with the actual ARN of the secret in AWS Secrets Manager.

3. SQS Queue Permissions

For incremental file processing with S3, configure SQS queue permissions to allow reading of file events using the following IAM policy:

JSON
Copy

Replace <region>, <AWS Account ID>, and <AWS SQS Queue Name> with your specific values.

Secret Manager Configuration for EKS Pod Identity

For configuring AWS Secrets Manager in Data Plane V2, use the following JSON configuration:

JSON
Copy
Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard