Databricks Compute

The following tabs are present in Databricks Compute:

Filters

The Data Source Filter allows you to switch the Databricks data source. This enables you to view and analyze data across various sections based on the selected Databricks account or project, providing flexibility for monitoring and managing information across different data sources.

Overview

The Overview page in Databricks Compute provides a comprehensive and detailed view of your Databricks environment. It displays key information as widgets and graphs, offering insights into cluster performance, resource utilization, and potential issues across your Databricks clusters.

Overview Tab

This section helps you monitor critical metrics such as cluster states, resource consumption, and errors. By leveraging the interactive visualizations and adjustable filters, you can drill down into specific aspects of your environment, allowing you to make data-driven decisions to optimize performance and manage costs effectively.

Widgets	Description
Cluster States	Displays the number of clusters in different states (Pending, Running, Resizing, etc.). This widget provides an at-a-glance view of your cluster operations, helping you monitor the current status and identify any clusters that may require attention. Pending: The number of clusters in pending state during the time period selected in the Global Calendar. Running: The number of clusters in running state during the time period selected in the Global Calendar. Restarting: The number of clusters which are restarting during the time period selected in the Global Calendar. Resizing: The number of clusters which are being resized during the time period selected in the Global Calendar. Terminating: The number of clusters which are getting terminated during the time period selected in the Global Calendar. Terminated: The number of clusters which were terminated during the time period selected in the Global Calendar.
Databricks Users and Applications	Lists the most common errors encountered by your clusters. This is crucial for quickly identifying and resolving issues that could impact cluster performance. Each error entry includes a count, making it easier to prioritize troubleshooting efforts. This section displays two widgets described below. Users: The number of users using clusters during the time period selected in the Global Calendar. Applications: The number of applications used during the time period selected in the Global Calendar.
Average Core Usage Summary	Total Cores: The total number of available cores. Allocated Cores: The total number of cores allocated out of the total available cores. Used Cores: The total number of cores used out of the total number of allocated cores.
Average Memory Utilization Summary	This section displays three widgets which highlight usage of CPU memory. The values of the widgets are dependent on the filters selected in Global Calendar. Total Memory: The total amount of memory available. Allocated Memory: The amount of memory allocated out of the total available memory. Used Memory: The total amount of memory used out of the total amount of allocated memory.
Databricks Top 10 Users	This bar chart displays the list of top 10 users who are provisioning clusters on Databricks. Each bar represents a user. When you hover over a bar, you can view the number of clusters provisioned by that user. The x-axis represents the user's Email IDs and the y-axis represents the number of clusters provisioned.
Cluster Count by Instance Type	This chart shows the distribution of clusters based on different instance types. Each bar represents an instance type, and its height indicates the number of clusters using that type. This helps visualize the usage patterns of various instance types within the system.
Active Clusters Over Time	This bar graph represents the number of active clusters, during the time period, selected in the Global Calendar. The x-axis represents a date and time (values change as per the date and time selected in the Global Calendar). The y-axis represents the active clusters. Each bar represents a date and time. When you hover over a bar, you can view the number of active clusters on the selected date and time.
Cluster Failure Over Time	Visualizes the number of cluster failures recorded over time. This graph is essential for detecting patterns or spikes in cluster failures, which may indicate underlying issues in the environment or specific workloads that require optimization. The x axis represents a date and time (values change as per the date and time selected in the Global Calendar). The y axis represents the cluster failures. Each bar represents a date and time. There is an error code associated with cluster failures. When you hover over a bar, you can view the number of failed clusters on the selected date and time and also the error code for each failure. You can also filter the data of his graph to view data specific to error codes.
Top Cluster Errors	A cluster error is considered to be top error if is occurrence frequency is highest as compared to other errors. This table represents the top errors whose occurrence is higher as compared to other errors. This table has two columns. The first column displays the number of times an error occurred and the second column displays the error message associated with the column.
DBU Consumed	Displays the Databricks Units (DBUs) consumed over time. Tracking DBU consumption is vital for understanding your Databricks usage and associated costs. This widget helps you monitor usage trends and identify opportunities to optimize resource allocation. The x axis represents a date and time (values change as per the date and time selected in the Global Calendar). The y axis represents the number of DBUs consumed. Each data point represents a date and time. When you hover over a data point, you can view the number of DBUs consumed on that date and time.
Average CPU Usage	This trend graph represents the amount of CPU used by node of all cluster types for the selected time period. The x-axis displays the date and time (values change as per the date and time selected in the Global Calendar). The y-axis displays the amount of CPU used by the executor node or driver node.
Average Memory Used	This trend graph represents the amount of CPU memory used by node of all cluster types for the selected time period. The x-axis displays the date and time (values change as per the date and time selected in the Global Calendar). The y-axis displays the amount of CPU memory used by the executor node or driver node.
Average Core Usage	This trend graph displays the amount of CPU core used during the time period selected in the Global Calendar. The x axis displays the date and time (values change as per the date and time selected in the Global Calendar). The y axis displays the number of cores used. Each trend line represent a memory type; available cores, allocated cores, and used cores. Each data point represents a date and time. When you hover over a data point, you can view the total number of available cores, allocated cores, and used cores on that date and time.
Average Memory Utilization	This trend graph displays the amount of CPU memory used during the time period selected in the Global Calendar. The x axis displays the date and time (values change as per the date and time selected in the Global Calendar). The y axis displays the amount of memory used. Each trend line represent a memory type; available memory, allocated memory, and used memory. Each data point represents a date and time. When you hover over a data point, you can view the total amount of memory available, allocated, and used on that date and time.
Core Wastage Over Time	This trend graph represents the amount of wasted or unused CPU cores over a specific time period in a cluster or system. The x-axis represents the time period, and the y-axis represents the core wastage usually as the absolute number of unused cores. This graph is particularly useful for identifying periods of low workload or idle times when CPU cores are not fully utilized. It helps in assessing the efficiency of resource allocation and workload scheduling, allowing you to optimize resource utilization and minimize wastage.

Identifying Bottlenecks and Optimizing Performance: Use the Cluster States and Cluster Failures Over Time widgets to quickly identify any clusters that are not performing as expected. This information can guide you in troubleshooting and optimizing those clusters for better performance.
Cost Management: Leverage the DBU Consumed widget to monitor your usage costs closely. By analyzing trends in DBU consumption, you can make informed decisions on scaling resources up or down to manage costs effectively.
Error Resolution: The Top Cluster Errors widget allows you to quickly pinpoint the most frequent issues affecting your clusters. Resolving these errors promptly can prevent potential downtime and maintain the stability of your Databricks environment.

Enhanced Filter on Search

The Enhanced Filter on Search function offers users significant filtering options, but it also contains various additional features to optimize search functionality on the Compute page. Users can use several filter conditions to efficiently refine search results, and the UI has been decluttered and optimized for usability.

Expanded Filterable Columns: The search dropdown now includes a broader variety of columns, allowing users to filter results based on more precise criteria such cluster status, source, duration, and user.
Decluttering Mechanism: Columns that are already visible can be hidden from the filter dropdown, keeping the list tidy and manageable.
Contextual Filters: The system offers column-specific alternatives that adapt to the user's current view, resulting in more intuitive filter selections and a smoother navigation experience.
Preservation of Filters Across Navigations: Filters are now preserved across navigations, reducing the need to continually apply the same filters.
Primary Focus on Equality Operator: The = operator is now the primary focus of the filtering interface, which helps to expedite interactions and simplify data retrieval.

Clusters

Clusters Tab

Widget	Descriptions
Cluster Name	The name of the cluster. This column is frozen. You can view it even when you scroll right. Clicking the cluster name redirects you to the job studio page.
Cluster ID	A system-generated identifier unique to each cluster instance, used for backend tracking and reference. Clicking the Cluster ID redirects you to the past runs associated with that cluster.
Status	The current state of the cluster, such as `Running`, `Terminated`, `Pending`, or `Resizing`. Indicates the cluster’s operational status.
Duration	The total amount of time the cluster has been active, measured from the start time to the end time or current time if still active.
Total DBU Consumed	Displays the total Databricks Unit (DBU) consumption, representing the compute resources consumed by the cluster.
Actual Databricks Cost	Total cost incurred from using Databricks services for the workload.
Actual Cloud Total Cost	Combined cost of all cloud resources consumed during the workload.
Actual Cloud VM Cost	Cost specifically attributed to virtual machine usage in the cloud environment.
Recommended Cloud VM Cost	Estimated cost if a more optimal virtual machine configuration were used.
Recommended Instance Type	Suggested VM instance type that could improve cost-efficiency or performance.
Start Time	The exact time when the cluster was initiated, helping track when the job or task associated with the cluster began.
End Time	The time when the cluster terminated, either due to the job completing or a manual termination. If still running, this field is empty.
Cluster Source	The source that initiated the cluster, such as `Job`, `API`, `UI`, or `Pipeline`. This helps track how the cluster was created.
User	The email ID of the user who initiated or is running the cluster, identifying who is responsible for the cluster’s activities.
Termination Type	The method or reason for the cluster’s termination, such as `Success`, `Client Error`, or `User Request`.
Termination Code	Provides further details about why the cluster was terminated, such as `Job Finished`, `User Request`, or specific error codes.
Diagnostic Reason	Detailed diagnostic information about the termination or errors encountered during the cluster’s lifecycle.
Spark Version	Indicates the specific version of Apache Spark running on the cluster, ensuring compatibility with different jobs and tasks.
Worker Node Type	Specifies the type of worker nodes used in the cluster, which determine the resources allocated for executing tasks.
Driver Node Type	The type of driver node used in the cluster, which manages job execution and coordinates the tasks running on worker nodes.

Cluster Details

To proceed to the details page of a particular cluster, click on the cluster name.

On the cluster Details page, you can view the following information: Past Runs chart and Past Job Runs Details table.

The Past Runs chart presents a bar graph that visualizes the count of DBUs and their associated costs on the y-axis, while the x-axis denotes the corresponding date and time when the job consumed a specific number of DBUs.

Column Name	Descriptions
Creation Time	Date and time at which the cluster was created.
State	The current state of the cluster or job.
DBU Consumed	Amount of Databricks units consumed.
Start Time	Time at which the job execution began.
Termination Time	Time at which the job execution was complete.
Executor Config	The settings and specifications that determine how the cluster's executors are configured.
Number of Workers	The total count of worker nodes allocated across clusters for processing tasks.
Min Workers	The minimum number of worker nodes used for the job to run.
Max Workers	The maximum number of worker nodes used for the job to run.
Executor Memory	The memory capacity allocated to your Databricks cluster.
Duration	The time taken for execution of the job run.
Balanced Recommendation	Displays recommendations for balanced performances.
Cost Recommendation	Displays recommendations for cost objectives.
Runtime Recommendation	Displays runtime recommendation.
State Message	Displays the message on cluster state.
Username	Displays the name of the user.

Note Even in the absence of previous job runs for comparison, you will still have the ability to access and examine the details of a single job run for any type of cluster.

Job Studio

The Job Studio page provides a comprehensive overview of all Databricks jobs, offering a detailed interface that enables users to track, monitor, and manage various jobs within their system. The intuitive layout presents job-related data in a tabular format, supported by filter options for enhanced navigation and drill-down capabilities.

All cost-related data on ADOC is displayed in US Dollars (USD) as the standard unit of measurement. Note that currency conversion is currently not supported.

For example: If your Azure account shows costs in (e.g., ₹110, or £110), ADOC will display the numerical value in USD (e.g., $110). This applies to all cost charts, including both actual and estimated costs, which are exclusively shown in USD.

At the top of the Job Studio page, a graphical representation displays job counts over time, segmented by the following job statuses:

Canceled: Jobs that were intentionally halted before completion.
Failed: Jobs that encountered errors and could not complete.
Success: Jobs that successfully ran to completion without issues.

The central table displays all jobs that meet the filtering criteria, offering an in-depth view of their details.

Field	Description
Job Name	Displays the name of the job, often used to describe its function or purpose.
Cluster ID	A unique system-generated identifier for the cluster running the job, helping identify and track the cluster.
Job Status	The current status of the job, which could be `Success`, `Failed`, or `Canceled`.
Actual Databricks Cost	The actual cost incurred by running the job, calculated based on the resources used.
Estimate Databricks Cost	The estimated cost for running the job, specific to Databricks resources used.
Estimate Vendor Cost	Displays any additional costs related to third-party vendor resources used during the job.
Total Job Cost	The total cost associated with the job, combining Databricks and vendor estimates.
Start Time	Indicates the exact time when the job began execution.
End Time	Displays the time the job finished or was terminated.
Duration	The total time the job ran, calculated from start to end time.
Vendor Storage Cost	The storage cost charged by external vendors for storing data related to the job.
Vendor Virtual Machines Cost	The total expense incurred for using virtual machines provided by the vendor.
Vendor Virtual Network Cost	Costs incurred for using a virtual network provided by external vendors during the job’s execution.
Vendor Bandwidth Cost	Bandwidth costs charged by external vendors related to data transfers during job execution.
Run Page URL	Provides a direct link to the job’s run page for more detailed information and access to logs, metrics, and performance data.
Cluster State	The state of the cluster used to run the job, such as `Running`, `Terminated`, or `Pending`.
Creator User	The user who created the job, shown as their registered username or email.
Trigger	Indicates how the job was initiated, such as `PERIODIC` (scheduled job) or `ONE-TIME` (manual trigger).
Runtime Engine	Specifies the type of engine running the job, typically `Photon` or `Standard`.
Job ID	The unique identifier for the job instance.
Run ID	A system-generated ID that tracks each specific run of the job.

Other Notable Features:

Feature	Description
Top 20 Jobs	The Job Studio page offers pre-configured views such as Top 20 Expensive Jobs and Long Running Jobs, allowing users to quickly identify resource-intensive tasks.
Download Functionality	Users can export the job data in multiple formats using the Download option, enabling further analysis and reporting outside of the ADOC platform. Use the download function to extract the displayed data for further review, analysis, or sharing with team members.
Filter Section	Users can apply multiple filters to narrow down the list of jobs based on specific criteria such as status, creator, or runtime engine. These filters can be combined for more granular searches.
Job Review	Once the desired jobs are filtered, the user can explore detailed information such as cost breakdowns, job triggers, and run times. For deeper analysis, users can access the Run Page URL.
Graph Analysis	The top graph visually represents job performance over time, allowing users to quickly understand trends and investigate any potential performance bottlenecks or issues.

Job Details Page

The Job Details page provides users with an in-depth view of their Databricks job's performance and operational metrics. This page offers key insights into both driver and executor performance, trends over time, resource usage, and potential areas for optimization.

Here is a breakdown of the features and data sections visible on the page:

Summary

This section provides a high-level overview of the costs associated with running the job on the selected Databricks cluster. The data presented includes the following key details:

Widget	Descriptions
Actual Databricks Cost	This displays the total cost incurred from using Databricks resources for this specific job. In this example, the cost is $24.10. This cost is calculated based on the consumption of Databricks Units (DBUs) and other Databricks platform resources used during the job's execution.
Actual Vendor Cost	This reflects any additional costs that come from using external or third-party vendor resources in conjunction with the job. In this case, the vendor cost is $23.54. Vendor costs can include things like cloud storage or virtual networks provided by external vendors.
Total Cost	The total of both the Databricks and vendor costs, which provides a comprehensive view of the total cost for running the job. In this example, the total cost is $47.64.
Cluster ID	This is the unique identifier for the specific cluster on which the job was executed. The Cluster ID is important for tracking and analyzing job performance across different clusters.

Vendor Cost Breakdown

This section provides a detailed breakdown of costs incurred from using third-party vendor resources in conjunction with a data processing job. These costs are in addition to platform-specific costs (such as Databricks) and typically cover external infrastructure services used during the job’s execution. The data includes the following key cost components:

Widget	Description
Virtual Machines Cost	This indicates the cost associated with running virtual machines provided by an external cloud vendor. These machines may be used for compute tasks, supporting services, or extensions outside the primary processing environment.
Storage Cost	This refers to charges for storing data externally, such as intermediate files, logs, or outputs. Storage cost can vary based on data size, storage type (standard or premium), and duration.
Virtual Network Cost	This captures the cost of using virtual network infrastructure, such as private IPs, VPC peering, or internal communication between services. These costs apply when network traffic routes through vendor-managed infrastructure.
Bandwidth Cost	This reflects the cost of data transferred between systems or across network boundaries, especially when large volumes of data move between the processing environment and external systems or storage.

New Jobs: Job run details, including execution time, and resource consumption metrics, will be available only if the user has enabled the Databricks initialization(init) script. This ensures that the necessary monitoring and metrics collection tools are in place before jobs are executed.

Historical Data: For jobs that were executed prior to onboarding the data source or enabling the initialization script, detailed job run metrics and resource utilization data will not be available. Only jobs executed after the onboarding process or script enablement will show detailed metrics in the Job Details Page.

ADOC Recommendation: Ensure that the initialization script is configured at the time of data source onboarding to capture detailed job metrics for future analysis.

Node Size Recommendations

Node Size Recommendations in Databricks Compute show how to set Spark executor nodes depending on cost, runtime performance, and workload characteristics. These recommendations assists users in optimizing resource allocation and improve the job execution efficiency.

Where do Node Size Recommendations Apply?

Static Clusters: Recommendations are offered for jobs running on static clusters where the number of workers is predefined and fixed throughout the execution. The system offers suggestions for:

Optimal number of cores
Memory per executor
Number of workers

Auto-Scale Clusters: For clusters with autoscaling enabled, Databricks automatically scales up or down based on resource needs. In this case, node size recommendations offer:

Minimum and maximum worker configurations
Estimated job completion time for various configurations
Cost estimation for each worker configuration

What are the key metrics that drive the Node Recommendations?

Node size recommendations rely on Spark Job Performance metrics such as:

Metric	Description
CPU Utilization	High CPU usage indicates that more cores per executor are required, while low CPU usage recommends fewer cores.
Memory Utilization	If memory usage is high, the system suggests increasing the memory per executor. Conversely, low memory usage suggests reducing the allocated memory to avoid resource wastage.
Shuffle Operations	The recommendations also take into account the shuffle fetch wait time and shuffle remote bytes read, which affect the need for additional executors.

It is important to understand the conditions under which node size recommendations are not available:

Single Node Clusters: No recommendations are made for clusters with a single node.

Jobs Without Spark Stages: If a Databricks job does not contain Spark stages, such as non-Spark jobs, no recommendations will be made.

Failed or Cancelled Jobs: Recommendations are unavailable for failed or cancelled jobs, as the Spark context required for analysis is not accessible.

All-Purpose Clusters: Jobs operating on all-purpose clusters do not receive node size recommendations, as these clusters dynamically auto-scales, making static recommendations less useful.

Driver and Executer Summary

This section provides detailed information about the driver used during the job execution. The driver is responsible for managing and orchestrating tasks across executors in a distributed computing environment like Databricks.

Widget	Descriptions
Name	Displays the unique identifier for the driver instance. This name helps track and reference the specific driver used for the job. The driver instance is usually associated with the cluster where the job was executed.
User	Shows the user account that initiated or controlled the driver. In this case, the user is root, indicating that the job was run with administrative or elevated permissions.
Duration	Indicates how long the driver was active during the job's execution. In this example, the driver was active for 14.10 minutes. This metric is crucial for understanding the time taken by the driver to manage the job's execution and resource distribution.
Max Heap Used	Displays the maximum amount of heap memory consumed by the driver during the job. In this case, the driver used up to 5.32 GB of heap memory. Heap memory is critical for the driver's performance, as it is used for object creation, caching, and other memory-intensive tasks.
Instance Type	Shows the type of virtual machine or hardware configuration used for the driver instance. Here, the instance type specifies the resources (like CPU and memory) allocated to the driver. The instance type impacts the overall performance and efficiency of the driver.

Widget	Descriptions
Cores	Displays the number of CPU cores allocated to the driver. In this case, the driver is using 4 cores. More cores typically allow for better multitasking and parallel processing.
Memory Available	Shows the total memory allocated to the driver. Here, the driver has 8.62 GB of memory available. This is crucial for handling data processing and managing job tasks effectively.
Jobs	This metric indicates the number of jobs processed by the executors. In this case, the executors processed 6,244 jobs. Executors are responsible for running the actual tasks associated with the job.
Stages	Shows the number of stages executed by the job. This job completed 6,244 stages, which represent different phases of job execution, such as shuffling, sorting, or aggregating data.
Max used memory	Displays the peak memory usage by the executors during job execution. In this case, the maximum memory used was 743.184 MB. Monitoring this helps ensure that executors are not running out of memory during execution, which could cause performance degradation or task failures.
Instance type	Indicates the type of instance used for the executors, which in this case is Standard_DS3_v2. This type specifies the compute and memory configuration, impacting the executor’s ability to handle tasks
Cores per instance	This shows the number of CPU cores available for each executor instance. In this example, each executor is allocated 4 cores, allowing for concurrent processing of tasks.
Memory available	Reflects the total memory available per executor instance, which is 8.67 GB in this case. Sufficient memory is essential for efficient data processing and task execution.
Total instances	Indicates the total number of executor instances used in the job. Here, there is 1 executor instance. Increasing the number of instances can improve job performance by parallelizing tasks across more resources.

Executor Node Recommendation

The Executor Node Recommendation section provides guidance on the optimal configuration of executor nodes based on different optimization criteria such as cost, runtime, or a balanced approach. The section also offers recommendations for both Auto-Scale and Static Cluster configurations. These recommendations help users optimize job performance while managing resource usage and cost.

Optimization Types

Node size recommendations in the Executor Node Recommendation widget are provided for different optimization strategies:

Cost-Optimized Recommendation: Aims to reduce resource costs while maintaining acceptable performance.
Runtime-Optimized Recommendation: Focuses on minimizing job execution time, possibly at the expense of higher costs.
Balanced Recommendation: Strikes a balance between cost efficiency and performance.

The section contains the following widgets and sections:

Widget	Section	Descriptions
Recommendations for Auto-Scale Cluster Configuration	Recommendation	The optimization goal (e.g., Cost Optimized, Runtime Optimized, or Balanced Optimized).
	Instance Type	The type of virtual machine instance recommended for the job
	Estimated Time	The expected time to complete the job with the recommended instance configuration. In this example, the estimated time is approximately 12.79 minutes for all recommendations.
	Min Worker Count	The minimum number of workers allocated when the job starts. For the Cost Optimized recommendation, the minimum worker count is 1, while for Runtime Optimized and Balanced Optimized, it's 2.
	Max Worker Count	The maximum number of workers that can be dynamically added as the job's demand grows. In the Cost Optimized configuration, the maximum worker count is 5, while for the other configurations, it is 3.
	Vendor Cost	The estimated cost incurred by the vendor (e.g., cloud service provider) for running the job. In the Cost Optimized recommendation, the cost is $0.07, whereas for Runtime Optimized and Balanced Optimized, the cost is $0.14.
Recommendations for Static Cluster Configuration	Recommendation	The optimization goal for the static cluster configuration (e.g., Cost Optimized, Runtime Optimized, or Balanced Optimized).
	Instance Type	The recommended virtual machine instance type (e.g., Standard_DS3_v2).
	Estimated Time	The estimated time to complete the job with the given instance configuration, which is 12.79 minutes for all configurations.
	Worker Count	The fixed number of workers allocated for the entire duration of the job. For all optimization types, the recommended worker count is 2.
	Vendor Cost	The estimated cost for the static cluster configuration, which remains constant at $0.14 for all optimization types.
Recommendation Based on Different Instance Types with Auto-Scale	Instance Type	Lists the different instance types that can be used for the job
	Estimated Time	The expected time to complete the job with the specified instance type. In this case, the estimated time for all configurations is around 12.79 minutes.
	Min Worker Count	The minimum number of worker nodes assigned at the beginning of the job. Some instance types recommend starting with 1 worker, while others recommend 2.
	Max Worker Count	The maximum number of workers that can be dynamically added to scale the job. For certain instances, such as Standard_DS3_v2, the maximum worker count is 3, while for others, it's 5.
	Vendor Cost	The estimated vendor cost associated with using each instance type. For the Standard_DS3_v2 instance, the cost is around $0.14, while for Standard_D3_v2, the cost is $0.07.

Trends

The Trends section visualizes key metrics over time for the job run.

Executor Memory: Shows the amount of memory used by the executors.
Executor Cores: Displays the number of CPU cores utilized by the executors.
Input Bytes Read: Reflects the amount of input data read by the job.

These metrics help monitor resource consumption and job performance. Users can also use the Compare Runs option to analyze and compare these trends across different job runs for deeper insights.

Limits

The Limits section provides insights into the scalability constraints of a Spark application by analyzing three key metrics:

Wall Clock Time:

Driver Wall Clock Time: Measures the time spent by the driver in coordinating the job execution.

Executor Wall Clock Time: Shows the total time spent by executors in processing tasks.

Total Wall Clock Time: The combined time for the driver and executors, reflecting the overall job duration.

Ideal Times:

Critical Path: Represents the minimum time required for the job to complete under ideal conditions.

Ideal Application Time: The estimated optimal runtime for the application based on resource availability.

Actual Runtime: The real execution time taken by the application

OOCH (One Core Compute Hour)

Displays the available and wasted compute hours for the job.
OCCH wasted by Executor and Driver highlight inefficiencies in resource usage.

Metrics

The Metrics section provides detailed insights into the performance of the Spark executors. It consolidates various performance metrics to help users analyze resource usage and job behavior.

Metrics	Description
Storage Memory	Shows how much memory is allocated, used, and available for both on-heap and off-heap memory, helping monitor the memory footprint of the job.
Schedule Information	Tracks active tasks and thread pool size over time, providing insights into task execution concurrency and thread utilization.
Bytes Read/Written	Displays the total amount of data read and written by the job, giving a clear view of input/output performance.
File System Bytes Read/Written	Highlights the bytes read and written directly from and to the filesystem, helping identify heavy I/O operations.
Shuffle Information	Provides details on shuffle operations, including bytes written and bytes read from both local and remote sources. Shuffle operations are critical for job performance, especially in distributed data processing.
Spark JVM GC and CPU Time	Visualizes the time spent on JVM Garbage Collection (GC) and CPU time, which are key indicators of system performance. High GC times may suggest inefficient memory usage, while CPU time reflects the computational load.
Records Read/Written	Displays the number of records read and written during job execution, helping measure data throughput.

Spark Details Aggregate Metrics

Metrics	Description
Task Duration (milliseconds)	Total time spent by the task starting from its creation
JVM GC Time (milliseconds)	Amount of time spent in GC while this task was in progress
Executor CPU Time (nanoseconds)	CPU time the executor spent running this task. This includes time fetching shuffle data
Executor Deserialize CPU Time (nanoseconds)	CPU time taken on the executor to deserialize this task.
Executor Deserialize Time (milliseconds)	Elapsed time spent to deserialize this task
Executor Runtime (milliseconds)	Total time spent by executor core running this task
Peak Execution Memory (bytes)	Maximum execution memory used by a task
Input Bytes Read (bytes)	Number of bytes read by a task(using read API’s)
Output Bytes Written (bytes)	Number of bytes written by a task(using write API’s)
Disk Bytes Spilled (bytes)	Size of spilled bytes on disk(can be different if compressed)
Memory Bytes Spilled (bytes)	Number of bytes that were spilled to disk during the task
Result Size (bytes)	The number of bytes sent by the task back to driver
Result Serialization Time (milliseconds)	Elapsed time spent serializing the task result. The value is expressed in milliseconds
Shuffle Read Bytes Read (bytes)	Total bytes read by a task for shuffle data
Shuffle Read Fetch Wait Time (milliseconds)	Time spent by the task waiting for shuffle data
Shuffle Read Local Blocks (number)	Shuffle blocks fetched from local machine(disk access)
Shuffle Read Records Read (number)	Total records read by a task for shuffle data
Shuffle Read Remote Blocks (number)	Shuffle blocks fetched from remote machine(network access)
Shuffle Write Bytes Written (bytes)	Total shuffle bytes written by a task
Shuffle Write Records Written (number)	Total shuffle records written by a task
Shuffle Write Time (nanoseconds)	Amount of time spent in a task writing shuffle data

Spark SQL Executions

This table provides details about each Spark SQL execution within the job.

Field	Description
Execution ID	A unique identifier for each SQL execution, used to track and differentiate executions.
Description	A brief description of the SQL query or operation being executed, which provides insight into the query or code being run (e.g., the specific `pyspark.sql.functions` being used).
Start Time	The exact timestamp when the SQL execution started, helping track when the query began processing.
End Time	The timestamp when the SQL execution completed, giving the total execution duration for that query.
Duration	The total time taken for the execution to complete, displayed in milliseconds or seconds depending on the length of the query execution.
State	The current status of the SQL execution, which can be Running, Completed, or other statuses depending on the progress of the query.
More details	A link that provides additional details about the specific SQL execution, including deeper insights into the query’s performance and execution plan.

Stages

The Stages section provides insights into the different stages of a Spark application. There are two available views: List and Timeline.

The List tab displays a detailed breakdown of each stage in a table format, providing insights into the tasks and performance of each stage
The Timeline tab shows the stages of the application in a visual format, where each stage is represented as a horizontal bar.

Driver & Executor Stats

The section shows driver and executor CPU and Memory usage. All executors are listed in the charts.


CPU Usage Driver(Driver & Executor)	These charts display the percentage of CPU usage over time for both the driver and the executors. Monitoring CPU usage helps identify whether resources are underutilized or overutilized.
Memory Usage Driver(Driver & Executor)	This chart shows the memory usage percentage for both the driver and the executor, helping users track how efficiently memory is being consumed.
Heap Usage Driver(Driver & Executor)	Displays the heap memory usage over time, which is important for identifying potential memory leaks or inefficiencies in memory allocation.
Core Wastage Driver(Driver & Executor)	Tracks the number of CPU cores that are being wasted during job execution. High core wastage may indicate inefficient resource allocation or over-provisioning.

Other Widgets and Interactive Features


Compare Runs	Allows users to compare the current job run with other previous runs. This comparison can help identify performance improvements or degradations over time.
Spark SQL Executions	If applicable, provides detailed statistics on the execution of SQL queries run during the job. This can help users analyze how efficiently SQL tasks were executed.
Stages	Displays detailed information about the different stages of the job, helping users track the progress and identify potential bottlenecks in the pipeline.
Driver & Executor Summary	Users can review the performance of both the driver and executors in detail, allowing for comprehensive analysis of memory and resource usage.
Metric Analysis	Users can explore the different metrics that provide insights into job execution. Metrics such as Memory Usage and Heap Usage are critical for performance tuning and identifying resource constraints.
Trends and Stages	When available, users can use the Trends section to understand job behavior over time. The Stages section helps in identifying slow or inefficient stages of the job.
Comparison	Users can make use of the Compare Runs feature to evaluate how job performance has changed over time, especially after modifying configurations or code optimizations.

All Purpose Cluster

All Purpose Cluster Tab

Key Features and Sections

Clusters Name Panel

The left sidebar displays a list of all the clusters along with their corresponding total costs. Users can select a specific cluster to view more detailed information, including the total cost breakdown by date.

Search Bar: Users can search for a specific cluster by entering the cluster name in the search box, which filters the displayed clusters accordingly.
Cluster Name List: Displays the names of the clusters along with their total costs. The selected cluster is highlighted, and the total cost is updated accordingly in the graphical and tabular sections

Graphical Representation of Total Costs

A bar chart visualizes the Total Cost for the selected cluster over time. The x-axis represents the dates, and the y-axis shows the costs in USD. This graph allows users to quickly identify cost patterns and spikes on specific dates.

The chart title updates based on the selected cluster
Hovering over individual bars provides detailed cost information for that specific date.

Cost Breakdown Table

A detailed table below the graph breaks down the Databricks Cost, Vendor Cost, and Total Cost by date for the selected cluster. Users can sort the table by each column to view the highest or lowest costs over time.

Field	Descriptions
Date	The date on which the cost was incurred.
Databricks Cost	The cost associated with Databricks resources for that day.
Vendor Cost	Any costs related to third-party vendors for that specific day.
Total Cost	The sum of the Databricks Cost and Vendor Cost for each day.

Interaction Workflows

Cluster Selection: From the left sidebar, users can select a cluster to analyze. Upon selection, the graph and table will automatically update to reflect the costs for the chosen cluster.

Graph Analysis: The graph provides a visual summary of the total costs over time, allowing users to quickly identify cost spikes or changes in resource usage.

Detailed Cost Breakdown: Users can scroll through the detailed cost breakdown in the table, sorting by any column to identify key cost drivers on a daily basis.

Exporting Data: By using the Download button, users can easily export the cost data for further analysis or integration into external reporting tools.

The Databricks Compute interface provides a comprehensive and versatile set of tools and insights that empower users to manage and optimize their Databricks environments effectively. Through various tabs—Overview, Clusters, Job Studio, and All Purpose Cluster—users gain the ability to monitor and analyze critical metrics such as cluster states, job performance, resource utilization, and associated costs.

These tabs, combined with enhanced filter functionalities and interactive data visualizations, give users the ability to make informed, data-driven decisions that improve operational efficiency, optimize resource allocation, and manage costs effectively. The intuitive interface and powerful insights make it easier to detect potential issues, resolve errors, and fine-tune both workflows and infrastructure for optimal performance.

Job Runs

Job Runs Tab

This page provides an overview of Job Runs within the Databricks environment. It includes a detailed table listing information about completed and ongoing jobs, with multiple filtering options to narrow down the data. Key sections and features of the page include:

Filters

Cluster Type: Allows filtering by the type of cluster used, such as job clusters or all-purpose clusters.
Status: Filters jobs based on their status, such as Success, Failed, Canceled, or Running.
Owner: Filters jobs by the user who initiated or owns the job.

Job Runs Aggregate Table

Field	Description
Cluster Name	The name of the cluster associated with the job run.
Cluster ID	The unique identifier for the cluster.
Cluster Type	Indicates the type of cluster (e.g., job_cluster __or all purpose_cluster).
Job Name	The name of the job or query being run.
Status	The completion status of the job (e.g., SUCCESS, FAILED, CANCELED).
Job ID	The unique identifier of the job run.
Duration	Specifies the time taken to complete the job.
DBU Consumed	Indicates the Databricks Units consumed during the job run.
Estimated Databricks Cost	The estimated cost incurred for Databricks usage.
Estimated Vendor Cost	The estimated cost incurred from the underlying cloud vendor.
Start Time	The time when the job run started.
End Time	The time when the job run ended.
Executor Heap Used %	The percentage of heap memory utilized by the executor during job execution.
CPU Used %	The percentage of CPU resources consumed by the executor.
Executor Memory	The total memory allocated to the executor for processing tasks.
Diagnostics	Additional diagnostic information or status for the job, such as errors or workload details.
Owner	The email address or identifier of the person who owns or initiated the job.
App Id	A unique identifier assigned to each application or job instance.
App Name	The name or reference label associated with the job or application.

DLT Pipelines

DLT Pipelines Tab

This page provides an interface for managing and monitoring Delta Live Tables (DLT) pipelines. It displays a list of pipelines along with key details such as their current state, execution mode, recent run information, and performance metrics. You can apply filters to refine the pipeline list based on specific criteria.

Filters

Current State: Filter by the pipeline's operational status (e.g., Idle, Running, Failed).
Owner: Filter by the user or team managing the pipeline (e.g., usernames, email addresses).

DLT Pipelines Aggregate Table

Field	Description
Name	The name of the Delta Live Table (DLT) pipeline.
Current State	The current operational state of the pipeline (e.g., Idle).
Owner	The user who owns or manages the pipeline.
Pipeline Execution	The environment in which the pipeline is executed (e.g., Development).
Pipeline Mode	The mode in which the pipeline runs (e.g., Triggered).
Total Runs	The total number of times the pipeline has executed.
Last Job Run	A link or identifier for the most recent pipeline job execution.
Last Run State	The result of the most recent run (e.g., Failed, Completed).
Last Run Duration	The amount of time the last run took to complete.
Last Run Start Time	The date and time when the most recent run started.

Clicking a pipeline name opens the Pipeline Run Details side panel, displaying key information about the pipeline, such as:

Pipeline Run Details

Field	Description
Pipeline Name	The name of the pipeline being executed.
Cluster ID	The unique identifier of the cluster associated with the pipeline run.
State	The current status of the pipeline run (e.g., `FAILED`, `SUCCESSFUL`).
Cause	The reason for the pipeline run state (e.g., `JOB_TASK`).
Start Time	The date and time when the pipeline run started.
End Time	The date and time when the pipeline run ended.
Is Validate Only	Indicates whether the run was for validation only (`true` or `false`).
Is Full Refresh	Specifies whether the pipeline run performed a full refresh (`true` or `false`).
Execution	Details about the execution.
Databricks Cost	The cost associated with Databricks resources for the pipeline run.
Cloud Vendor Cost	The cost incurred for cloud resources used in the pipeline run.
Total Cost	The combined total cost of Databricks and cloud vendor resources.

Last updated on

Was this page helpful?