AWS IAM Roles for Databricks Pushdown Integration

To enable Good/Bad record publishing for Databricks data sources using the Pushdown engine, your Databricks workspace must have access to the external storage location (such as an S3 bucket) where the records will be written.

This guide explains how to create and configure the necessary IAM roles and S3 permissions for AWS-based Databricks workspaces.

Prerequisites

Before starting, ensure that:

  • You have an AWS account with permissions to create S3 buckets and IAM roles.
  • You know the AWS account ID associated with your Databricks workspace.
  • The Databricks workspace has Unity Catalog enabled (for managing external locations and credentials).

Step 1. Create an S3 Bucket for Good/Bad Records

  1. Log in to the AWS Management Console.
  2. Go to S3 > Create bucket.
  3. Enter a name for your bucket such as databricks-good-bad-data.
  4. Ensure the region matches the one where your Databricks cluster runs.
  5. Click Create bucket.

This bucket will store the Good and Bad data records generated during Data Quality policy execution.

Step 2. Create an IAM Role for S3 Bucket Access

  1. In the AWS Console, go to IAM > Roles > Create role.
  2. Choose Custom trust policy and paste the following JSON in the Custom Trust Policy editor:
JSON
Copy
  1. Click Next, skip the Permissions Policy section for now, and create the role.
  2. Enter a meaningful name, for example: databricks-unity-catalog-good-bad-role.

Step 3. Grant S3 Permissions to the IAM Role

  1. Open the newly created IAM role.
  2. Go to Permissions tab.
  3. Paste the following JSON policy, replacing placeholders as needed:
JSON
Copy
  1. Save the policy and verify it appears under role’s permissions.

Step 4. Create Storage Credentials in Databricks

In the Databricks SQL editor or Notebook, create storage credentials linked to the IAM role.

Option A — Using SQL Editor

SQL
Copy

If this fails with the following syntax error, use the Databricks Notebook approach:

Copy

Option B — Using Databricks Notebook (Python)

Copy

After successful creation, copy the External ID displayed in the “Storage Credential Created” dialog.

You will use this External ID in the next step to update the IAM trust policy.

Step 5. Create an External Location in Databricks

Once the storage credential is created, register your S3 bucket as an External Location.

Run the following SQL in Databricks:

SQL
Copy

If your data source uses a Personal Access Token (PAT), grant the necessary privileges:

SQL
Copy

Step 6. Update the IAM Trust Relationship Policy

Back in AWS, edit the trust relationship for your IAM role to include the Databricks storage credential’s external ID and allow self-assumption.

  1. Go to IAM -> Roles -> < UNITY-CATALOG-IAM-ROLE-NAME> -> Trust Relationships -> Edit.
  2. Replace the trust policy with the following JSON:
JSON
Copy

Alternative (Optional) If you prefer a simpler version, the following trust relationship will also work:

JSON
Copy

Save the trust policy.

Step 7: Validate the Configuration

Use Databricks SQL to test the setup by running a simple write and read operations on your external storage (S3).

SQL
Copy

If both queries execute successfully, your IAM role and external storage configuration are complete.

Notes and Limitations

  • Only Parquet format is supported for Good/Bad record export.
  • Ensure the IAM role and bucket are created in the same AWS region as your Databricks workspace.
  • ADOC does not manage lifecycle or retention of exported data.
  • External storage must remain accessible to Databricks clusters.

Next Step

Return to Databricks Integration for Data Reliability and complete the configuration to publish Good/Bad Records Using Pushdown Engine.

Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard