Databricks

Databricks is a cloud-based big data processing platform. You can use Databricks to easily configure and deploy data processing clusters with a few clicks.

This document explains how to add Databricks as a data source in ADOC for Azure and AWS. Once integrated, you can monitor cluster health, notebook status, and incurred costs. Once you add Databricks as a data source, you can monitor your cluster health, notebook status, and costs incurred. You can view all this data in visual format. You can also create alerts on various Databricks entities.

Take a look at this video which explains the process of adding Databricks as a data source.

Steps to Add Databricks as a Data Source

Before you begin, ensure you have the following:

Common Prerequisites

  • Access to the ADOC platform.
  • A Databricks workspace.
  • Databricks Workspace ID.
  • Databricks Warehouse ID (for metadata fetching).
  • Personal Access Token for Databricks.

Azure Requirements

  • Azure account with required permissions.
  • Azure Portal access.
  • Service Principal Credentials:
    • Subscription ID
    • Tenant ID
    • Client ID
    • Client Secret
    • Resource Group
    • Managed Resource Group

AWS Requirements

  • AWS account with required permissions.
  • AWS IAM user credentials:
    • Access Key ID
    • Secret Access Key
    • Required Cost Explorer and Pricing permissions.

Azure ( on internet)

1. Creating a Service Principal

A Service Principal is required to enable API access and authenticate with Databricks.

  1. Go to Azure Portal: Navigate to https://portal.azure.com/ and log in.

  2. Access Microsoft Entra ID (Azure Active Directory): From the left-hand menu, select Microsoft Entra ID.

  3. Register a New Application

    1. In the Manage section, select App registrations.

    2. Click on New registration.

    3. Provide the following details:

      • Name: e.g., ADOC Databricks Integration.
      • Supported account types: Accounts in this organizational directory only.
    4. Click Register.

  4. Create a Client Secret

    1. In the application's Overview page, select Certificates & secrets from the Manage section.
    2. Click on New client secret.
    3. Enter a description and select an expiration period.
    4. Click Add.
    5. Important: Copy the Value of the client secret. This is your Azure Client Secret Value. Save it securely, as it is visible only once.

2. Creating a Custom Role and Assign It to the Service Principal

Access Control (IAM) in Azure

  1. Navigate to Your Subscription: Select Subscriptions from the left-hand menu and choose your subscription.

  2. Access Control (IAM): In the subscription menu, click Access control (IAM).

  3. Create a Custom Role

    1. Click Add and select Add custom role.
    2. Provide a Custom role name (e.g., Cost Management Reader).
    3. In the Basics tab, enter a description if desired.
  4. Add Permissions

    1. Go to the Permissions tab.

    2. Click Add permissions.

    3. In the Add permissions pane:

      • Provider: Select Microsoft.CostManagement.
      • Permission: Search for and add the following permissions:
        • Microsoft.CostManagement/exports/read
        • Microsoft.CostManagement/query/read
    4. Click Add.

  5. Review and Create

    1. Click Review + create.
    2. Review the details and click Create.

Assign the Custom Role to the Service Principal

  1. Return to Access Control (IAM): Ensure you're in the Access control (IAM) section of your subscription.

  2. Add Role Assignment: Click Add and select Add role assignment.

  3. Assign the Custom Role

    1. In the Role dropdown, select the custom role you created.
    2. Click Next.
  4. Select Members

    1. Under Members, choose User, group, or service principal.
    2. Click Select members.
    3. Search for your Service Principal by name.
    4. Select it and click Select.
  5. Review and Assign

    1. Click Review + assign.
    2. Confirm and click Assign.

3. Adding the Service Principal to the Databricks Workspace

  1. Log In to Databricks Workspace: Navigate to your Databricks workspace URL.

  2. Access Service Principals

    1. Click on the Settings icon (⚙️) in the lower-left corner.
    2. Select Admin Console.
    3. Go to the Service Principals tab.
  3. Add Service Principal

    1. Click Add Service Principal.
    2. Select Microsoft Entra ID managed.
    3. Enter the Application (client) ID of your Service Principal.
    4. Provide a Display name (can be any meaningful name).
    5. Click Add.

4. Provide Workspace Admin Access to the Service Principal

  1. Access the Admin Group

    1. In the Admin Console, select the Groups tab.
    2. Click on the admins group.
  2. Add the Service Principal to the Admin Group

    1. Click Add Members.
    2. Search for your Service Principal.

Select it and click Confirm.

5. Retrieve Azure Tenant ID and Other Credentials

You will need the following information:

Information RequiredHow to retrieve it?
Azure Tenant ID
  1. Go to Microsoft Entra ID > Overview.
  2. Copy the Tenant ID.
Azure Subscription ID
  1. Go to Subscriptions > Your Subscription > Overview.
  2. Copy the Subscription ID.
Azure Client IDThis is the Application (client) ID of your Service Principal.
Azure Client Secret ValueThe secret value you saved earlier.
Azure Resource GroupThe resource group where your Databricks workspace is created.
Azure Managed Resource GroupIn your Databricks workspace Overview, find the Managed Resource Group.

AWS

1. Creating an IAM User with Cost Explorer and Pricing Permissions

To retrieve cost data from AWS, you need an IAM user with the required permissions:

  1. Log in to the AWS Management Console.
  2. Create an IAM user with the following permissions (Cost Explorer and Pricing):
JSON
Copy
  1. This IAM user will provide the AWS Access Key ID and Secret Access Key for cost retrieval.

2. Creating a Personal Access Token in Databricks

For AWS Databricks integration, you must use a personal access token or Service Principal to connect to system tables (for billing data and metadata):

  1. Log in to your Databricks workspace.
  2. Create a Personal Access Token if you do not already have one.
  3. If you have a service principal, you can also create a personal access token from that principal.
  4. You can use a Service Principal to retrieve the Databricks cost. Please follow the documentation for creating a Service Principal.
  5. Keep this token secure, as it will be required during ADOC data source registration.

3. Configuring the ADOC Platform

Follow the steps provided in Configuring the ADOC Platform section.

4. Setting Up Global Init Script

  • In the final step of data source registration, you can enable setting up a global init script.
  • By enabling this option, an init script is deployed from Acceldata to your Databricks workspace environment.
  • The init script contains details of agent binaries deployed on the customer’s Databricks environment for pushing spark-related compute metrics to ADOC.

5. Providing DBU Values for Estimated Cost Calculation

Provide DBU values (such as Jobs Compute, Jobs Photon Compute, DLT, All-Purpose Photon Compute) as per the Databricks contract. These values are used for cost estimation of the current day based on DBUs.

6. Configuring AWS Cloud Provider Details and Cost Retrieval

  1. Cloud Provider Cost Discount Percentage: Enter the discount percentage provided by AWS to the customer.

  2. Cloud Provider: Select AWS.

  3. Cloud Region: Provide the AWS region where your Databricks workspace is deployed. To find the region:

    1. Go to your Databricks workspace.
    2. Click on the workspace name at the top-right corner.
    3. The region will be displayed.
  4. Cost Fetch Method: Select API.

  5. AWS Access Key ID and Secret: Enter the credentials of the IAM user created in Step 1. This allows ADOC to retrieve cloud vendor billing information.

  6. Click Submit to complete the setup.

Also see, Create IAM user.

Configuring the ADOC Platform

Onboarding a New Datasource or Updating an Existing One

  1. Click Register from the left pane.
  2. Click Add Data Source.
  3. Select the Databricks Data Source. The Databricks Data Source basic Details page is displayed.
  1. Enter a name for the data source in the Data Source name field.
  2. (Optional) Enter a description for the Data Source in the Description field.
  3. (Optional) Enable the Compute capability by switching on the Compute toggle switch.
  4. (Optional) Enable the Data Reliability capability by switching on the Data Reliability toggle switch.

You must enable either the Compute or Data Reliability capability. You cannot add Databricks as a Data Source without enabling at least one of these capabilities.

  1. Select a Data Plane from the Select Data Plane drop-down menu.

To create a new Data Plane, click Setup Dataplane.

You must either create a Data Plane or use an existing Data Plane to enable the Data Reliability capability.

  1. Click Next. The Databricks Connection Details page is displayed.

  2. Select Cloud Provider and Region.

    1. Cloud Provider: Choose the cloud provider where your Databricks workspace is hosted (Azure or AWS).
    2. Cloud Region: Specify the region of your Databricks deployment.
  3. Enter the name for your Databricks workspace in the Workspace Name field.

  4. Enter the URL of your Databricks account in the Databricks URL field. To learn more about workspaces, refer this Databricks document.

  5. Paste the Warehouse ID used for querying data.

  6. Provide the Workspace ID from your Databricks environment.

  7. Enter the access token for your Databricks account in the Token field or enable the Use Service Principal toggle button and provide the below details. To learn more about Tokens, refer this Databricks document.

    1. Select one of the following cloud provider for your Databricks workspace from the drop-down: AWS or Azure.
    2. Provide the Service Principal Client Id to connect to your Databricks workspace.
    3. Provide the Service Principal Client Secret key.
    4. Provide the Service Principal Tenant Id or Account Id for the cloud provider selected in step 12 (a).
  8. Enable if you want the system to auto-renew the token upon expiry. Define how long the token remains valid.

Within the Databricks connector, a new auto-renew token feature has been implemented as a solution to the problems that have been associated with the visibility of token expiration.

You can now opt for this option during data source registration.

Upon registering a data source and activating the auto-renew token feature upon expiration, the system generates an 'accel-token' using the provided token. To prevent unexpected expirations, a new 'accel-token' is automatically generated five days before the current one is set to expire.

If you prefer using a Service Principal instead of a token, toggle Use Service Principal and fill in the required fields below.

  1. Datasource registration (in Compute) using a Service Principal is supported for both Azure and AWS cloud providers. From the drop-down menu, select the cloud provider for your Databricks workspace. Enter the following details to connect to your Databricks workspace:

    1. Service Principal Client ID
    2. Service Principal Client Secret Key
    3. Service Principal Tenant ID (Azure Tenant ID)
  2. Enable Actual Cost: Toggle this ON to fetch cloud cost details. Supported for both Azure and AWS.

Azure

ParameterDescription
Azure Tenant IdAccess Tenant Properties in the Azure Portal and copy the corresponding Tenant Id.
Azure Subscription IdVisit the Subscriptions section in the Azure Portal and copy the relevant Subscription Id.
Azure Client Id

Generate this yourself by performing the following steps:

  1. Click Azure Active Directory from the left navigation in the Azure Portal.
  2. Click Add > App Registration.
  3. Provide a name for the application.
  4. Select a tenant.
  5. Click Register.
  6. Copy the Client Id and paste it in the corresponding field of the ADOC Setup Observability page.
Azure Client Secret Value

Within the above registered application, perform the following steps:

  1. Click Add a certificate or secret link from the Client Credentials parameter.
  2. Click New Client secret from the Certificates & secrets page.
  3. From the Add a Client modal window, provide a description.
  4. Select an expiry period.
  5. Click Add.
  6. Copy the Secret Id and paste it in the corresponding field of the ADOC Setup Observability page.

Ensure you grant read permissions to the application on the Access Control (IAM) page if not already provided.

Azure Resource GroupLocate the desired workspace by name in the Azure Portal, and from the Overview page, copy the Resource Group.
Azure Managed Resource GroupSimilar to the above, find the workspace by name in the Azure Portal, and from the Overview page, copy the Managed Resource Group.

AWS

ParameterDescription
Cloud RegionEnter the region where your Databricks workspace is deployed. You can find this information in the Databricks UI.
Cost Fetch MethodAPI
AWS Access Key IDThe AWS Access Key ID is required to retrieve cloud vendor billing information for the resources.
AWS Access Key SecretAWS Access Key Secret is required to retrieve the cloud vendor billing information for the resources.

All dashboards and visualizations pertaining to costs will reflect the real-time expenses incurred from your utilization of Azure workspace resources.

  1. Cost Fetch Method: Choose between API or System Table to retrieve cost data.
  2. If you enabled the Data Reliability capability, you must enter the JDBC URL in the JDBC URL field. This field is displayed only if you enabled Data Reliability in step 7.
  3. Choose the Dataplane Engine—either Spark or Pushdown—for profiling, data quality checks, and SQL operations. Note Token-based authentication is currently the only supported method for the Pushdown engine on AWS and Azure. For more information, see Pushdown Data Engine.
  4. Click Test Connection to validate your credentials and establish a connection with your Databricks account. Test Connection validates Databricks access, token/service principal details, and actual cost setup.

If the connection is successful, a Connected message is displayed. If there is an error, you must verify the Databricks details provided and enter the correct details.

  1. Click Next to proceed to the Observability Set Up page.

All actions are supported with the Spark Data Engine; however, with the Pushdown Data Engine, you can perform only the following actions on your data:

  • Enforce data quality policies
  • Perform data profiling
  • Create SQL views
  • View sample data
  • Conduct Row Count reconciliation

Observability Set Up

The Observability Set Up page allows you to configure Compute and Data Reliability capabilities. These sections are active only if you enabled them on the Databricks Data Source Basic Details page.

This section includes the following panels:

Compute

Enable Global Init Script: Toggle this setting to enable a global initialization script across all Databricks clusters. This ensures that a consistent script is executed automatically at the startup of every cluster, allowing for common setup, configuration, or monitoring tasks.

Setup Global Init Script: Once enabled, use this option to define the actual contents of the initialization script. This script will run on all clusters at startup, applying the configurations or commands specified here.

Provide the following details to calculate compute costs for your Databricks workspace in ADOC. These values are needed to fetch the estimated data bricks cost for the current day.

ParameterDescription
Jobs Compute Cost per DBUThis refers to the cost incurred for using compute resources in Databricks Jobs, measured per Databricks Unit (DBU). DBUs are a normalized unit of processing capability that combines CPU, memory, and I/O resources.
Jobs Photon Compute Cost per DBUThis denotes the cost per DBU for job workloads leveraging Databricks' Photon execution engine, which is optimized for performance.
Delta Live Tables Cost per DBUThis represents the cost associated with utilizing compute resources for managing and querying Delta Live Tables in Databricks. The cost is calculated per Databricks Unit (DBU), which encompasses various resource factors.
All-Purpose Photon Compute Cost per DBUThis reflects the cost for running Photon-enabled all-purpose clusters in Databricks, measured per DBU. Photon provides enhanced performance for general-purpose tasks.
All Purpose Cluster Cost per DBUThis indicates the cost attributed to operating an all-purpose cluster in Databricks, computed per Databricks Unit (DBU). All-purpose clusters provide general-purpose compute capacity for a wide range of tasks.
Cloud Provider Cost Discount percentageThis percentage signifies the reduction in cloud provider costs when using Databricks. It reflects the amount by which Databricks is able to offer discounted cloud infrastructure expenses through efficient resource allocation and optimization.

Enable Private S3 Bucket: This toggle enables the use of a private S3 bucket for storing data or logs instead of relying on Databricks’ default S3 storage.

Data Reliability

Enter the following details to set up Data Reliability:

  • Unity Catalog Enabled: Turn on this toggle switch to enable Unity Catalog in Databricks.
  • JDBC Records Fetch Size: Enter the number of records to be fetched from JDBC.
  • Enable Crawler Execution Schedule: Turn on this toggle switch to select a time tag and time zone to schedule the execution of crawlers for Data Reliability.

Click the Submit button.

Databricks is now added as a Data Source. You can choose to crawl your Databricks account now or later.

If you have enabled only the Compute capability, the Start Crawler & Go to Data Sources button is not displayed. You can only view the Go to Data Sources button. The Data Sources crawl option is not applicable for the Compute data sources.

Authentication Mechanism: Service Principals

ADOC introduces an improved authentication process for Databricks by transitioning from personal access tokens to a more secure and managed Service Principal approach.

Improved Security: Service principals enable improved security standards, allowing for greater control over permissions and quicker revocation.

Simplified Management: Managing service principal credentials is easier, lowering the overhead associated with token rollovers.

Configuring Service Principals

  1. Establish a Service Principal: Follow the steps outlined in the Creating a Service Principal section above.
  2. Assign Roles and Permissions: Configure the service principal's roles and permissions at the workspace level to ensure access to the relevant resources. This involves creating a custom role and assigning it to the service principal as described earlier.
  3. Add the Service Principal to Databricks Workspace: Add the service principal to your Databricks workspace by following the steps in the Adding the Service Principal to the Databricks Workspace section.
  4. Provide Workspace Admin Access: Provide admin access to the service principal in your Databricks workspace as described in the Provide Workspace Admin Access to the Service Principal section.
  5. Configure in ADOC: When adding or editing a Databricks data source in ADOC, select the Use Service Principal option and provide the service principal details.

All data activities, including data crawling, data profiling, and sampling, now use the service principal for authentication, ensuring secure and consistent access.

Cloud Vendor Cost Retrieval via API Method

This section explains how to integrate Databricks with the ADOC platform and collect cloud vendor costs via the API method.

Prerequisites

Before you begin, make sure you have:

  • An account with the required permissions.
  • Access the ADOC platform.
  • Databricks workspace created within a resource group.

Azure

1. Creating a Service Principal: Refer to the 1. Creating a Service Principal section above.

2. Creating a Custom Role and Assign It to the Service Principal Refer to the 2. Creating a Custom Role and Assign It to the Service Principal above.

  • Access Control (IAM) in Azure: Ensure that the service principal has the necessary permissions to access cost management data by assigning the custom role in your subscription's Access Control (IAM) settings.
  • Assign the Custom Role to the Service Principal: Refer to the section Assign custom role to the Service Principal above.

3. Configuring the ADOC Platform: After the Service Principal is set up, configure the ADOC platform to use the API method for cost retrieval.

Onboarding a New Datasource or Updating an Existing One:

  1. Open the ADOC platform and select Register from the left pane.

  2. Add or Edit a Datasource

    • For a new Datasource: Click Add Data Source.
    • To update an existing Datasource: Click the three dots in the top right corner of the Datasource and select Edit Configuration.
  3. Configure the Cost Retrieval Method: Provide the following information:

FieldDescription
Cost Fetch MethodSelect the API method
Azure Tenant IDEnter the Tenant ID for the Service Principal
Azure Subscription IDEnter the Azure Subscription ID
Azure Client IDEnter the Client ID for the Service Principal.
Azure Client Secret ValueEnter the Secret value for the Service
Azure Resource GroupSpecify the Resource Group where the Databricks workspace is created.
Azure Managed Resource GroupSpecify the Managed Resource Group for the Databricks Workspace.
  1. Complete the Setup: Click Submit to save the configuration.

After completing these steps, your ADOC platform will be configured to retrieve cloud vendor costs via the Azure API method. This integration allows for more accurate cost tracking and management within the ADOC environment.

AWS

1. Creating an IAM User: Refer to the section 1. Creating an IAM User with Cost Explorer and Pricing Permissions section above.

2. Creating a Personal Access Token in Databricks: Refer to the 2. Creating a Personal Access Token in Databricks section above.

3. Enabling Databricks Tags in AWS

  • Ensure that Databricks tags are enabled in AWS to allow proper cost attribution and tracking.
  • Verify that cost allocation tags are activated in AWS Billing settings.

4. Configuring the Acceldata Platform: After the IAM user is set up, configure the Acceldata platform to use the API method for cost retrieval.

Onboarding a New Datasource or Updating an Existing One:

  1. Open the Acceldata platform and select Register from the left pane.

  2. Add or Edit a Datasource

    • For a new Datasource: Click Add Data Source.
    • To update an existing datasource: Click the three dots in the top right corner of the datasource and select Edit Configuration.
  3. Configure the Cost Retrieval Method. Provide the following details:

FieldDescription
Cost Fetch MethodSelect the API method
AWS Access Key IDEnter the IAM user’s Access Key ID
AWS Secret Access KeyEnter the IAM user’s Secret Access Key
AWS RegionEnter the AWS region where Databricks is deployed

5. Complete the Setup: Click Submit to save the configuration.

After completing these steps, your Acceldata platform will be configured to retrieve cloud vendor costs via the AWS API method. This integration enables accurate cost tracking and management within the Acceldata environment.

Cost Retrieval via System Table Method

Azure

1. Creating a Service Principal: Refer to the 1. Creating a Service Principal section above.

2. Creating a Custom Role and Assign It to the Service Principal Refer to the 2. Creating a Custom Role and Assign It to the Service Principal above.

  • Access Control (IAM) in Azure: Ensure that the service principal has the necessary permissions to access cost management data by assigning the custom role in your subscription's Access Control (IAM) settings.
  • Assign the Custom Role to the Service Principal: Refer to the section Assign custom role to the Service Principal above.

3. Adding the Service Principal to Databricks Workspace: Refer to the 3. Adding the Service Principal to the Databricks Workspace section above. Ensure that the service principal has workspace admin access to fetch metadata and job data from Databricks APIs.

4. Configuring the ADOC Platform (System Table Method): After the service principal is set up, configure the ADOC platform to use the System Table method for Databricks cost retrieval. In this setup, Databricks cost is fetched using system tables (list_prices, usage), while cloud vendor cost is still retrieved using Azure APIs.

Onboarding a New Datasource or Updating an Existing One

  • Open the ADOC platform and select Register from the left pane.
  • Add or Edit a Datasource
    • For a new Datasource: Click Add Data Source.
    • To update an existing Datasource: Click the three dots in the top-right corner of the Datasource and select Edit Configuration.

Configure the Cost Retrieval Method

FieldDescription
Cost Fetch MethodSelect System Table
Azure Tenant IDEnter the Tenant ID for the Service Principal
Azure Subscription IDEnter the Azure Subscription ID
Azure Client IDEnter the Client ID for the Service Principal
Azure Client Secret ValueEnter the Secret value for the Service Principal
Azure Resource GroupSpecify the Resource Group where the Databricks workspace is created
Azure Managed Resource GroupSpecify the Managed Resource Group for the Databricks Workspace
Databricks Warehouse IDRequired to fetch metadata from system tables
Databricks Workspace IDRequired to fetch metadata and job data via Databricks API

Provide Access to Databricks System Tables: To allow the service principal to read cost data from Databricks system tables, follow these steps:

  1. Log in to your Databricks workspace.

  2. Navigate to SQL WorkspaceData ExplorerSystem Catalog.

  3. Identify the required tables:

    • system.billing.list_prices
    • system.billing.usage
  4. Execute the following SQL commands to grant access:

SQL
Copy
  1. Make sure the service principal has sufficient permissions to query metadata if any access issues occur.

If your workspace enforces access control policies, you may need to add the service principal to specific Databricks groups as well.

Known Limitations

Known LimitationsDetailsRecommendations/Impact
System Time AdjustmentTo have an exact match in cost data against the Azure Portal, users need to change their system time to UTC.Ensure that your system's time zone is set to UTC before comparing cost data with the Azure Portal.
Job Studio Page MismatchThere may be a slight mismatch in the filter facet count between the Job Studio page and the Databricks Job runs page. This is due to the different update frequencies.Users may notice discrepancies in job counts when filtering data.
Cloud Vendor Cost Calculation DelayCloud Vendor cost calculations in Azure Portal can take up to 24-48 hours, causing a slight mismatch (below 0.5%) in the reported costs.It might take up to 48 hours to get the exact Cloud Vendor cost as shown in the Azure Portal.
Initial API Data RetrievalAfter enabling the API approach for the first time, it will take up to 24 hours to retrieve cost data for the last 30 days (Databricks and Cloud Vendor costs).Users may experience a delay in accessing historical cost data immediately after setup.
All Purpose Cluster Cost DisplayCosts on the All Purpose Cluster page are displayed on a daily basis. Selecting a date range ≤ 24 hours will not show cost data.Users should select a date range greater than 24 hours to view cost data on the All Purpose Cluster page.

By following these steps, you can successfully configure Databricks as a data source in ADOC for both Azure and AWS environments, ensuring comprehensive observability of your compute and data reliability operations.

Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard