Databricks Compute

The following tabs are present in Databricks Compute:

Filters

The Data Source Filter allows you to switch the Databricks data source. This enables you to view and analyze data across various sections based on the selected Databricks account or project, providing flexibility for monitoring and managing information across different data sources.

Overview

The Overview page in Databricks Compute provides a comprehensive and detailed view of your Databricks environment. It displays key information as widgets and graphs, offering insights into cluster performance, resource utilization, and potential issues across your Databricks clusters.

Overview Tab

Overview Tab

This section helps you monitor critical metrics such as cluster states, resource consumption, and errors. By leveraging the interactive visualizations and adjustable filters, you can drill down into specific aspects of your environment, allowing you to make data-driven decisions to optimize performance and manage costs effectively.

WidgetsDescription
Cluster States

Displays the number of clusters in different states (Pending, Running, Resizing, etc.). This widget provides an at-a-glance view of your cluster operations, helping you monitor the current status and identify any clusters that may require attention.

Pending: The number of clusters in pending state during the time period selected in the Global Calendar.

Running: The number of clusters in running state during the time period selected in the Global Calendar.

Restarting: The number of clusters which are restarting during the time period selected in the Global Calendar.

Resizing: The number of clusters which are being resized during the time period selected in the Global Calendar.

Terminating: The number of clusters which are getting terminated during the time period selected in the Global Calendar.

Terminated: The number of clusters which were terminated during the time period selected in the Global Calendar.

Databricks Users and Applications

Lists the most common errors encountered by your clusters. This is crucial for quickly identifying and resolving issues that could impact cluster performance. Each error entry includes a count, making it easier to prioritize troubleshooting efforts.

This section displays two widgets described below.

Users: The number of users using clusters during the time period selected in the Global Calendar.

Applications: The number of applications used during the time period selected in the Global Calendar.

Average Core Usage Summary

Total Cores: The total number of available cores.

Allocated Cores: The total number of cores allocated out of the total available cores.

Used Cores: The total number of cores used out of the total number of allocated cores.

Average Memory Utilization Summary

This section displays three widgets which highlight usage of CPU memory. The values of the widgets are dependent on the filters selected in Global Calendar.

Total Memory: The total amount of memory available.

Allocated Memory: The amount of memory allocated out of the total available memory.

Used Memory: The total amount of memory used out of the total amount of allocated memory.

Databricks Top 10 UsersThis bar chart displays the list of top 10 users who are provisioning clusters on Databricks. Each bar represents a user. When you hover over a bar, you can view the number of clusters provisioned by that user. The x-axis represents the user's Email IDs and the y-axis represents the number of clusters provisioned.
Cluster Count by Instance TypeThis chart shows the distribution of clusters based on different instance types. Each bar represents an instance type, and its height indicates the number of clusters using that type. This helps visualize the usage patterns of various instance types within the system.
Active Clusters Over TimeThis bar graph represents the number of active clusters, during the time period, selected in the Global Calendar. The x-axis represents a date and time (values change as per the date and time selected in the Global Calendar). The y-axis represents the active clusters. Each bar represents a date and time. When you hover over a bar, you can view the number of active clusters on the selected date and time.
Cluster Failure Over Time

Visualizes the number of cluster failures recorded over time. This graph is essential for detecting patterns or spikes in cluster failures, which may indicate underlying issues in the environment or specific workloads that require optimization.

The x axis represents a date and time (values change as per the date and time selected in the Global Calendar). The y axis represents the cluster failures. Each bar represents a date and time. There is an error code associated with cluster failures. When you hover over a bar, you can view the number of failed clusters on the selected date and time and also the error code for each failure. You can also filter the data of his graph to view data specific to error codes.

Top Cluster ErrorsA cluster error is considered to be top error if is occurrence frequency is highest as compared to other errors. This table represents the top errors whose occurrence is higher as compared to other errors. This table has two columns. The first column displays the number of times an error occurred and the second column displays the error message associated with the column.
DBU Consumed

Displays the Databricks Units (DBUs) consumed over time. Tracking DBU consumption is vital for understanding your Databricks usage and associated costs. This widget helps you monitor usage trends and identify opportunities to optimize resource allocation.

The x axis represents a date and time (values change as per the date and time selected in the Global Calendar). The y axis represents the number of DBUs consumed. Each data point represents a date and time. When you hover over a data point, you can view the number of DBUs consumed on that date and time.

Average CPU UsageThis trend graph represents the amount of CPU used by node of all cluster types for the selected time period. The x-axis displays the date and time (values change as per the date and time selected in the Global Calendar). The y-axis displays the amount of CPU used by the executor node or driver node.
Average Memory UsedThis trend graph represents the amount of CPU memory used by node of all cluster types for the selected time period. The x-axis displays the date and time (values change as per the date and time selected in the Global Calendar). The y-axis displays the amount of CPU memory used by the executor node or driver node.
Average Core UsageThis trend graph displays the amount of CPU core used during the time period selected in the Global Calendar. The x axis displays the date and time (values change as per the date and time selected in the Global Calendar). The y axis displays the number of cores used. Each trend line represent a memory type; available cores, allocated cores, and used cores. Each data point represents a date and time. When you hover over a data point, you can view the total number of available cores, allocated cores, and used cores on that date and time.
Average Memory UtilizationThis trend graph displays the amount of CPU memory used during the time period selected in the Global Calendar. The x axis displays the date and time (values change as per the date and time selected in the Global Calendar). The y axis displays the amount of memory used. Each trend line represent a memory type; available memory, allocated memory, and used memory. Each data point represents a date and time. When you hover over a data point, you can view the total amount of memory available, allocated, and used on that date and time.
Core Wastage Over Time

This trend graph represents the amount of wasted or unused CPU cores over a specific time period in a cluster or system. The x-axis represents the time period, and the y-axis represents the core wastage usually as the absolute number of unused cores.

This graph is particularly useful for identifying periods of low workload or idle times when CPU cores are not fully utilized. It helps in assessing the efficiency of resource allocation and workload scheduling, allowing you to optimize resource utilization and minimize wastage.

  • Identifying Bottlenecks and Optimizing Performance: Use the Cluster States and Cluster Failures Over Time widgets to quickly identify any clusters that are not performing as expected. This information can guide you in troubleshooting and optimizing those clusters for better performance.
  • Cost Management: Leverage the DBU Consumed widget to monitor your usage costs closely. By analyzing trends in DBU consumption, you can make informed decisions on scaling resources up or down to manage costs effectively.
  • Error Resolution: The Top Cluster Errors widget allows you to quickly pinpoint the most frequent issues affecting your clusters. Resolving these errors promptly can prevent potential downtime and maintain the stability of your Databricks environment.

The Enhanced Filter on Search function offers users significant filtering options, but it also contains various additional features to optimize search functionality on the Compute page. Users can use several filter conditions to efficiently refine search results, and the UI has been decluttered and optimized for usability.

  • Expanded Filterable Columns: The search dropdown now includes a broader variety of columns, allowing users to filter results based on more precise criteria such cluster status, source, duration, and user.
  • Decluttering Mechanism: Columns that are already visible can be hidden from the filter dropdown, keeping the list tidy and manageable.
  • Contextual Filters: The system offers column-specific alternatives that adapt to the user's current view, resulting in more intuitive filter selections and a smoother navigation experience.
  • Preservation of Filters Across Navigations: Filters are now preserved across navigations, reducing the need to continually apply the same filters.
  • Primary Focus on Equality Operator: The = operator is now the primary focus of the filtering interface, which helps to expedite interactions and simplify data retrieval.

Clusters

Clusters Tab

Clusters Tab

WidgetDescriptions
Cluster NameThe name of the cluster. This column is frozen. You can view it even when you scroll right. Clicking the cluster name redirects you to the job studio page.
Cluster IDA system-generated identifier unique to each cluster instance, used for backend tracking and reference. Clicking the Cluster ID redirects you to the past runs associated with that cluster.
StatusThe current state of the cluster, such as Running, Terminated, Pending, or Resizing. Indicates the cluster’s operational status.
DurationThe total amount of time the cluster has been active, measured from the start time to the end time or current time if still active.
Total DBU ConsumedDisplays the total Databricks Unit (DBU) consumption, representing the compute resources consumed by the cluster.
Actual Databricks CostTotal cost incurred from using Databricks services for the workload.
Actual Cloud Total CostCombined cost of all cloud resources consumed during the workload.
Actual Cloud VM CostCost specifically attributed to virtual machine usage in the cloud environment.
Recommended Cloud VM CostEstimated cost if a more optimal virtual machine configuration were used.
Recommended Instance TypeSuggested VM instance type that could improve cost-efficiency or performance.
Start TimeThe exact time when the cluster was initiated, helping track when the job or task associated with the cluster began.
End TimeThe time when the cluster terminated, either due to the job completing or a manual termination. If still running, this field is empty.
Cluster SourceThe source that initiated the cluster, such as Job, API, UI, or Pipeline. This helps track how the cluster was created.
UserThe email ID of the user who initiated or is running the cluster, identifying who is responsible for the cluster’s activities.
Termination TypeThe method or reason for the cluster’s termination, such as Success, Client Error, or User Request.
Termination CodeProvides further details about why the cluster was terminated, such as Job Finished, User Request, or specific error codes.
Diagnostic ReasonDetailed diagnostic information about the termination or errors encountered during the cluster’s lifecycle.
Spark VersionIndicates the specific version of Apache Spark running on the cluster, ensuring compatibility with different jobs and tasks.
Worker Node TypeSpecifies the type of worker nodes used in the cluster, which determine the resources allocated for executing tasks.
Driver Node TypeThe type of driver node used in the cluster, which manages job execution and coordinates the tasks running on worker nodes.

Cluster Details

To proceed to the details page of a particular cluster, click on the cluster name.

On the cluster Details page, you can view the following information: Past Runs chart and Past Job Runs Details table.

The Past Runs chart presents a bar graph that visualizes the count of DBUs and their associated costs on the y-axis, while the x-axis denotes the corresponding date and time when the job consumed a specific number of DBUs.

Column NameDescriptions
Creation TimeDate and time at which the cluster was created.
StateThe current state of the cluster or job.
DBU ConsumedAmount of Databricks units consumed.
Start TimeTime at which the job execution began.
Termination TimeTime at which the job execution was complete.
Executor ConfigThe settings and specifications that determine how the cluster's executors are configured.
Number of WorkersThe total count of worker nodes allocated across clusters for processing tasks.
Min WorkersThe minimum number of worker nodes used for the job to run.
Max WorkersThe maximum number of worker nodes used for the job to run.
Executor MemoryThe memory capacity allocated to your Databricks cluster.
DurationThe time taken for execution of the job run.
Balanced RecommendationDisplays recommendations for balanced performances.
Cost RecommendationDisplays recommendations for cost objectives.
Runtime RecommendationDisplays runtime recommendation.
State MessageDisplays the message on cluster state.
UsernameDisplays the name of the user.

Note Even in the absence of previous job runs for comparison, you will still have the ability to access and examine the details of a single job run for any type of cluster.

Job Studio

The Job Studio page provides a comprehensive overview of all Databricks jobs, offering a detailed interface that enables users to track, monitor, and manage various jobs within their system. The intuitive layout presents job-related data in a tabular format, supported by filter options for enhanced navigation and drill-down capabilities.

All cost-related data on ADOC is displayed in US Dollars (USD) as the standard unit of measurement. Note that currency conversion is currently not supported.

For example: If your Azure account shows costs in (e.g., 110, or £110), ADOC will display the numerical value in USD (e.g., $110). This applies to all cost charts, including both actual and estimated costs, which are exclusively shown in USD.

At the top of the Job Studio page, a graphical representation displays job counts over time, segmented by the following job statuses:

  • Canceled: Jobs that were intentionally halted before completion.
  • Failed: Jobs that encountered errors and could not complete.
  • Success: Jobs that successfully ran to completion without issues.

The central table displays all jobs that meet the filtering criteria, offering an in-depth view of their details.

FieldDescription
Job NameDisplays the name of the job, often used to describe its function or purpose.
Cluster IDA unique system-generated identifier for the cluster running the job, helping identify and track the cluster.
Job StatusThe current status of the job, which could be Success, Failed, or Canceled.
Actual Databricks CostThe actual cost incurred by running the job, calculated based on the resources used.
Estimate Databricks CostThe estimated cost for running the job, specific to Databricks resources used.
Estimate Vendor CostDisplays any additional costs related to third-party vendor resources used during the job.
Total Job CostThe total cost associated with the job, combining Databricks and vendor estimates.
Start TimeIndicates the exact time when the job began execution.
End TimeDisplays the time the job finished or was terminated.
DurationThe total time the job ran, calculated from start to end time.
Vendor Storage CostThe storage cost charged by external vendors for storing data related to the job.
Vendor Virtual Machines CostThe total expense incurred for using virtual machines provided by the vendor.
Vendor Virtual Network CostCosts incurred for using a virtual network provided by external vendors during the job’s execution.
Vendor Bandwidth CostBandwidth costs charged by external vendors related to data transfers during job execution.
Run Page URLProvides a direct link to the job’s run page for more detailed information and access to logs, metrics, and performance data.
Cluster StateThe state of the cluster used to run the job, such as Running, Terminated, or Pending.
Creator UserThe user who created the job, shown as their registered username or email.
TriggerIndicates how the job was initiated, such as PERIODIC (scheduled job) or ONE-TIME (manual trigger).
Runtime EngineSpecifies the type of engine running the job, typically Photon or Standard.
Job IDThe unique identifier for the job instance.
Run IDA system-generated ID that tracks each specific run of the job.

Other Notable Features:

FeatureDescription
Top 20 JobsThe Job Studio page offers pre-configured views such as Top 20 Expensive Jobs and Long Running Jobs, allowing users to quickly identify resource-intensive tasks.
Download FunctionalityUsers can export the job data in multiple formats using the Download option, enabling further analysis and reporting outside of the ADOC platform. Use the download function to extract the displayed data for further review, analysis, or sharing with team members.
Filter SectionUsers can apply multiple filters to narrow down the list of jobs based on specific criteria such as status, creator, or runtime engine. These filters can be combined for more granular searches.
Job ReviewOnce the desired jobs are filtered, the user can explore detailed information such as cost breakdowns, job triggers, and run times. For deeper analysis, users can access the Run Page URL.
Graph AnalysisThe top graph visually represents job performance over time, allowing users to quickly understand trends and investigate any potential performance bottlenecks or issues.

Job Details Page

The Job Details page provides users with an in-depth view of their Databricks job's performance and operational metrics. This page offers key insights into both driver and executor performance, trends over time, resource usage, and potential areas for optimization.

Here is a breakdown of the features and data sections visible on the page:

Summary

This section provides a high-level overview of the costs associated with running the job on the selected Databricks cluster. The data presented includes the following key details:

WidgetDescriptions
Actual Databricks CostThis displays the total cost incurred from using Databricks resources for this specific job. In this example, the cost is $24.10. This cost is calculated based on the consumption of Databricks Units (DBUs) and other Databricks platform resources used during the job's execution.
Actual Vendor CostThis reflects any additional costs that come from using external or third-party vendor resources in conjunction with the job. In this case, the vendor cost is $23.54. Vendor costs can include things like cloud storage or virtual networks provided by external vendors.
Total CostThe total of both the Databricks and vendor costs, which provides a comprehensive view of the total cost for running the job. In this example, the total cost is $47.64.
Cluster IDThis is the unique identifier for the specific cluster on which the job was executed. The Cluster ID is important for tracking and analyzing job performance across different clusters.

Vendor Cost Breakdown

This section provides a detailed breakdown of costs incurred from using third-party vendor resources in conjunction with a data processing job. These costs are in addition to platform-specific costs (such as Databricks) and typically cover external infrastructure services used during the job’s execution. The data includes the following key cost components:

WidgetDescription
Virtual Machines CostThis indicates the cost associated with running virtual machines provided by an external cloud vendor. These machines may be used for compute tasks, supporting services, or extensions outside the primary processing environment.
Storage CostThis refers to charges for storing data externally, such as intermediate files, logs, or outputs. Storage cost can vary based on data size, storage type (standard or premium), and duration.
Virtual Network CostThis captures the cost of using virtual network infrastructure, such as private IPs, VPC peering, or internal communication between services. These costs apply when network traffic routes through vendor-managed infrastructure.
Bandwidth CostThis reflects the cost of data transferred between systems or across network boundaries, especially when large volumes of data move between the processing environment and external systems or storage.

New Jobs: Job run details, including execution time, and resource consumption metrics, will be available only if the user has enabled the Databricks initialization(init) script. This ensures that the necessary monitoring and metrics collection tools are in place before jobs are executed.

Historical Data: For jobs that were executed prior to onboarding the data source or enabling the initialization script, detailed job run metrics and resource utilization data will not be available. Only jobs executed after the onboarding process or script enablement will show detailed metrics in the Job Details Page.

ADOC Recommendation: Ensure that the initialization script is configured at the time of data source onboarding to capture detailed job metrics for future analysis.

Node Size Recommendations

Node Size Recommendations in Databricks Compute show how to set Spark executor nodes depending on cost, runtime performance, and workload characteristics. These recommendations assists users in optimizing resource allocation and improve the job execution efficiency.

Where do Node Size Recommendations Apply?

Static Clusters: Recommendations are offered for jobs running on static clusters where the number of workers is predefined and fixed throughout the execution. The system offers suggestions for:

  • Optimal number of cores
  • Memory per executor
  • Number of workers

Auto-Scale Clusters: For clusters with autoscaling enabled, Databricks automatically scales up or down based on resource needs. In this case, node size recommendations offer:

  • Minimum and maximum worker configurations
  • Estimated job completion time for various configurations
  • Cost estimation for each worker configuration

What are the key metrics that drive the Node Recommendations?

Node size recommendations rely on Spark Job Performance metrics such as:

MetricDescription
CPU UtilizationHigh CPU usage indicates that more cores per executor are required, while low CPU usage recommends fewer cores.
Memory UtilizationIf memory usage is high, the system suggests increasing the memory per executor. Conversely, low memory usage suggests reducing the allocated memory to avoid resource wastage.
Shuffle OperationsThe recommendations also take into account the shuffle fetch wait time and shuffle remote bytes read, which affect the need for additional executors.

It is important to understand the conditions under which node size recommendations are not available:

Single Node Clusters: No recommendations are made for clusters with a single node.

Jobs Without Spark Stages: If a Databricks job does not contain Spark stages, such as non-Spark jobs, no recommendations will be made.

Failed or Cancelled Jobs: Recommendations are unavailable for failed or cancelled jobs, as the Spark context required for analysis is not accessible.

All-Purpose Clusters: Jobs operating on all-purpose clusters do not receive node size recommendations, as these clusters dynamically auto-scales, making static recommendations less useful.

Driver and Executer Summary

This section provides detailed information about the driver used during the job execution. The driver is responsible for managing and orchestrating tasks across executors in a distributed computing environment like Databricks.

WidgetDescriptions
NameDisplays the unique identifier for the driver instance. This name helps track and reference the specific driver used for the job. The driver instance is usually associated with the cluster where the job was executed.
UserShows the user account that initiated or controlled the driver. In this case, the user is root, indicating that the job was run with administrative or elevated permissions.
DurationIndicates how long the driver was active during the job's execution. In this example, the driver was active for 14.10 minutes. This metric is crucial for understanding the time taken by the driver to manage the job's execution and resource distribution.
Max Heap UsedDisplays the maximum amount of heap memory consumed by the driver during the job. In this case, the driver used up to 5.32 GB of heap memory. Heap memory is critical for the driver's performance, as it is used for object creation, caching, and other memory-intensive tasks.
Instance TypeShows the type of virtual machine or hardware configuration used for the driver instance. Here, the instance type specifies the resources (like CPU and memory) allocated to the driver. The instance type impacts the overall performance and efficiency of the driver.
WidgetDescriptions
CoresDisplays the number of CPU cores allocated to the driver. In this case, the driver is using 4 cores. More cores typically allow for better multitasking and parallel processing.
Memory AvailableShows the total memory allocated to the driver. Here, the driver has 8.62 GB of memory available. This is crucial for handling data processing and managing job tasks effectively.
JobsThis metric indicates the number of jobs processed by the executors. In this case, the executors processed 6,244 jobs. Executors are responsible for running the actual tasks associated with the job.
StagesShows the number of stages executed by the job. This job completed 6,244 stages, which represent different phases of job execution, such as shuffling, sorting, or aggregating data.
Max used memoryDisplays the peak memory usage by the executors during job execution. In this case, the maximum memory used was 743.184 MB. Monitoring this helps ensure that executors are not running out of memory during execution, which could cause performance degradation or task failures.
Instance typeIndicates the type of instance used for the executors, which in this case is Standard_DS3_v2. This type specifies the compute and memory configuration, impacting the executor’s ability to handle tasks
Cores per instanceThis shows the number of CPU cores available for each executor instance. In this example, each executor is allocated 4 cores, allowing for concurrent processing of tasks.
Memory availableReflects the total memory available per executor instance, which is 8.67 GB in this case. Sufficient memory is essential for efficient data processing and task execution.
Total instancesIndicates the total number of executor instances used in the job. Here, there is 1 executor instance. Increasing the number of instances can improve job performance by parallelizing tasks across more resources.

Executor Node Recommendation

The Executor Node Recommendation section provides guidance on the optimal configuration of executor nodes based on different optimization criteria such as cost, runtime, or a balanced approach. The section also offers recommendations for both Auto-Scale and Static Cluster configurations. These recommendations help users optimize job performance while managing resource usage and cost.

Optimization Types

Node size recommendations in the Executor Node Recommendation widget are provided for different optimization strategies:

  1. Cost-Optimized Recommendation: Aims to reduce resource costs while maintaining acceptable performance.
  2. Runtime-Optimized Recommendation: Focuses on minimizing job execution time, possibly at the expense of higher costs.
  3. Balanced Recommendation: Strikes a balance between cost efficiency and performance.

The section contains the following widgets and sections:

WidgetSectionDescriptions
Recommendations for Auto-Scale Cluster ConfigurationRecommendationThe optimization goal (e.g., Cost Optimized, Runtime Optimized, or Balanced Optimized).
Instance TypeThe type of virtual machine instance recommended for the job
Estimated TimeThe expected time to complete the job with the recommended instance configuration. In this example, the estimated time is approximately 12.79 minutes for all recommendations.
Min Worker CountThe minimum number of workers allocated when the job starts. For the Cost Optimized recommendation, the minimum worker count is 1, while for Runtime Optimized and Balanced Optimized, it's 2.
Max Worker CountThe maximum number of workers that can be dynamically added as the job's demand grows. In the Cost Optimized configuration, the maximum worker count is 5, while for the other configurations, it is 3.
Vendor CostThe estimated cost incurred by the vendor (e.g., cloud service provider) for running the job. In the Cost Optimized recommendation, the cost is $0.07, whereas for Runtime Optimized and Balanced Optimized, the cost is $0.14.
Recommendations for Static Cluster ConfigurationRecommendationThe optimization goal for the static cluster configuration (e.g., Cost Optimized, Runtime Optimized, or Balanced Optimized).
Instance TypeThe recommended virtual machine instance type (e.g., Standard_DS3_v2).
Estimated TimeThe estimated time to complete the job with the given instance configuration, which is 12.79 minutes for all configurations.
Worker CountThe fixed number of workers allocated for the entire duration of the job. For all optimization types, the recommended worker count is 2.
Vendor CostThe estimated cost for the static cluster configuration, which remains constant at $0.14 for all optimization types.
Recommendation Based on Different Instance Types with Auto-ScaleInstance TypeLists the different instance types that can be used for the job
Estimated TimeThe expected time to complete the job with the specified instance type. In this case, the estimated time for all configurations is around 12.79 minutes.
Min Worker CountThe minimum number of worker nodes assigned at the beginning of the job. Some instance types recommend starting with 1 worker, while others recommend 2.
Max Worker CountThe maximum number of workers that can be dynamically added to scale the job. For certain instances, such as Standard_DS3_v2, the maximum worker count is 3, while for others, it's 5.
Vendor CostThe estimated vendor cost associated with using each instance type. For the Standard_DS3_v2 instance, the cost is around $0.14, while for Standard_D3_v2, the cost is $0.07.

Trends

The Trends section visualizes key metrics over time for the job run.

  • Executor Memory: Shows the amount of memory used by the executors.
  • Executor Cores: Displays the number of CPU cores utilized by the executors.
  • Input Bytes Read: Reflects the amount of input data read by the job.

These metrics help monitor resource consumption and job performance. Users can also use the Compare Runs option to analyze and compare these trends across different job runs for deeper insights.

Limits

The Limits section provides insights into the scalability constraints of a Spark application by analyzing three key metrics:

Wall Clock Time:

Driver Wall Clock Time: Measures the time spent by the driver in coordinating the job execution.

Executor Wall Clock Time: Shows the total time spent by executors in processing tasks.

Total Wall Clock Time: The combined time for the driver and executors, reflecting the overall job duration.

Ideal Times:

Critical Path: Represents the minimum time required for the job to complete under ideal conditions.

Ideal Application Time: The estimated optimal runtime for the application based on resource availability.

Actual Runtime: The real execution time taken by the application

OOCH (One Core Compute Hour)

  • Displays the available and wasted compute hours for the job.
  • OCCH wasted by Executor and Driver highlight inefficiencies in resource usage.

Metrics

The Metrics section provides detailed insights into the performance of the Spark executors. It consolidates various performance metrics to help users analyze resource usage and job behavior.

MetricsDescription
Storage MemoryShows how much memory is allocated, used, and available for both on-heap and off-heap memory, helping monitor the memory footprint of the job.
Schedule InformationTracks active tasks and thread pool size over time, providing insights into task execution concurrency and thread utilization.
Bytes Read/WrittenDisplays the total amount of data read and written by the job, giving a clear view of input/output performance.
File System Bytes Read/WrittenHighlights the bytes read and written directly from and to the filesystem, helping identify heavy I/O operations.
Shuffle InformationProvides details on shuffle operations, including bytes written and bytes read from both local and remote sources. Shuffle operations are critical for job performance, especially in distributed data processing.
Spark JVM GC and CPU TimeVisualizes the time spent on JVM Garbage Collection (GC) and CPU time, which are key indicators of system performance. High GC times may suggest inefficient memory usage, while CPU time reflects the computational load.
Records Read/WrittenDisplays the number of records read and written during job execution, helping measure data throughput.

Spark Details Aggregate Metrics

MetricsDescription
Task Duration (milliseconds)Total time spent by the task starting from its creation
JVM GC Time (milliseconds)Amount of time spent in GC while this task was in progress
Executor CPU Time (nanoseconds)CPU time the executor spent running this task. This includes time fetching shuffle data
Executor Deserialize CPU Time (nanoseconds)CPU time taken on the executor to deserialize this task.
Executor Deserialize Time (milliseconds)Elapsed time spent to deserialize this task
Executor Runtime (milliseconds)Total time spent by executor core running this task
Peak Execution Memory (bytes)Maximum execution memory used by a task
Input Bytes Read (bytes)Number of bytes read by a task(using read API’s)
Output Bytes Written (bytes)Number of bytes written by a task(using write API’s)
Disk Bytes Spilled (bytes)Size of spilled bytes on disk(can be different if compressed)
Memory Bytes Spilled (bytes)Number of bytes that were spilled to disk during the task
Result Size (bytes)The number of bytes sent by the task back to driver
Result Serialization Time (milliseconds)Elapsed time spent serializing the task result. The value is expressed in milliseconds
Shuffle Read Bytes Read (bytes)Total bytes read by a task for shuffle data
Shuffle Read Fetch Wait Time (milliseconds)Time spent by the task waiting for shuffle data
Shuffle Read Local Blocks (number)Shuffle blocks fetched from local machine(disk access)
Shuffle Read Records Read (number)Total records read by a task for shuffle data
Shuffle Read Remote Blocks (number)Shuffle blocks fetched from remote machine(network access)
Shuffle Write Bytes Written (bytes)Total shuffle bytes written by a task
Shuffle Write Records Written (number)Total shuffle records written by a task
Shuffle Write Time (nanoseconds)Amount of time spent in a task writing shuffle data

Spark SQL Executions

This table provides details about each Spark SQL execution within the job.

FieldDescription
Execution IDA unique identifier for each SQL execution, used to track and differentiate executions.
DescriptionA brief description of the SQL query or operation being executed, which provides insight into the query or code being run (e.g., the specific pyspark.sql.functions being used).
Start TimeThe exact timestamp when the SQL execution started, helping track when the query began processing.
End TimeThe timestamp when the SQL execution completed, giving the total execution duration for that query.
DurationThe total time taken for the execution to complete, displayed in milliseconds or seconds depending on the length of the query execution.
StateThe current status of the SQL execution, which can be Running, Completed, or other statuses depending on the progress of the query.
More detailsA link that provides additional details about the specific SQL execution, including deeper insights into the query’s performance and execution plan.

Stages

The Stages section provides insights into the different stages of a Spark application. There are two available views: List and Timeline.

  • The List tab displays a detailed breakdown of each stage in a table format, providing insights into the tasks and performance of each stage
  • The Timeline tab shows the stages of the application in a visual format, where each stage is represented as a horizontal bar.

Driver & Executor Stats

The section shows driver and executor CPU and Memory usage. All executors are listed in the charts.

CPU Usage Driver(Driver & Executor)These charts display the percentage of CPU usage over time for both the driver and the executors. Monitoring CPU usage helps identify whether resources are underutilized or overutilized.
Memory Usage Driver(Driver & Executor)This chart shows the memory usage percentage for both the driver and the executor, helping users track how efficiently memory is being consumed.
Heap Usage Driver(Driver & Executor)Displays the heap memory usage over time, which is important for identifying potential memory leaks or inefficiencies in memory allocation.
Core Wastage Driver(Driver & Executor)Tracks the number of CPU cores that are being wasted during job execution. High core wastage may indicate inefficient resource allocation or over-provisioning.

Other Widgets and Interactive Features

Compare RunsAllows users to compare the current job run with other previous runs. This comparison can help identify performance improvements or degradations over time.
Spark SQL ExecutionsIf applicable, provides detailed statistics on the execution of SQL queries run during the job. This can help users analyze how efficiently SQL tasks were executed.
StagesDisplays detailed information about the different stages of the job, helping users track the progress and identify potential bottlenecks in the pipeline.
Driver & Executor SummaryUsers can review the performance of both the driver and executors in detail, allowing for comprehensive analysis of memory and resource usage.
Metric AnalysisUsers can explore the different metrics that provide insights into job execution. Metrics such as Memory Usage and Heap Usage are critical for performance tuning and identifying resource constraints.
Trends and StagesWhen available, users can use the Trends section to understand job behavior over time. The Stages section helps in identifying slow or inefficient stages of the job.
ComparisonUsers can make use of the Compare Runs feature to evaluate how job performance has changed over time, especially after modifying configurations or code optimizations.

All Purpose Cluster

All Purpose Cluster Tab

All Purpose Cluster Tab

Key Features and Sections

Clusters Name Panel

The left sidebar displays a list of all the clusters along with their corresponding total costs. Users can select a specific cluster to view more detailed information, including the total cost breakdown by date.

  • Search Bar: Users can search for a specific cluster by entering the cluster name in the search box, which filters the displayed clusters accordingly.
  • Cluster Name List: Displays the names of the clusters along with their total costs. The selected cluster is highlighted, and the total cost is updated accordingly in the graphical and tabular sections

Graphical Representation of Total Costs

A bar chart visualizes the Total Cost for the selected cluster over time. The x-axis represents the dates, and the y-axis shows the costs in USD. This graph allows users to quickly identify cost patterns and spikes on specific dates.

  • The chart title updates based on the selected cluster
  • Hovering over individual bars provides detailed cost information for that specific date.

Cost Breakdown Table

A detailed table below the graph breaks down the Databricks Cost, Vendor Cost, and Total Cost by date for the selected cluster. Users can sort the table by each column to view the highest or lowest costs over time.

FieldDescriptions
DateThe date on which the cost was incurred.
Databricks CostThe cost associated with Databricks resources for that day.
Vendor CostAny costs related to third-party vendors for that specific day.
Total CostThe sum of the Databricks Cost and Vendor Cost for each day.

Interaction Workflows

Cluster Selection: From the left sidebar, users can select a cluster to analyze. Upon selection, the graph and table will automatically update to reflect the costs for the chosen cluster.

Graph Analysis: The graph provides a visual summary of the total costs over time, allowing users to quickly identify cost spikes or changes in resource usage.

Detailed Cost Breakdown: Users can scroll through the detailed cost breakdown in the table, sorting by any column to identify key cost drivers on a daily basis.

Exporting Data: By using the Download button, users can easily export the cost data for further analysis or integration into external reporting tools.

The Databricks Compute interface provides a comprehensive and versatile set of tools and insights that empower users to manage and optimize their Databricks environments effectively. Through various tabs—Overview, Clusters, Job Studio, and All Purpose Cluster—users gain the ability to monitor and analyze critical metrics such as cluster states, job performance, resource utilization, and associated costs.

These tabs, combined with enhanced filter functionalities and interactive data visualizations, give users the ability to make informed, data-driven decisions that improve operational efficiency, optimize resource allocation, and manage costs effectively. The intuitive interface and powerful insights make it easier to detect potential issues, resolve errors, and fine-tune both workflows and infrastructure for optimal performance.

Job Runs

Job Runs Tab

Job Runs Tab

This page provides an overview of Job Runs within the Databricks environment. It includes a detailed table listing information about completed and ongoing jobs, with multiple filtering options to narrow down the data. Key sections and features of the page include:

Filters

  • Cluster Type: Allows filtering by the type of cluster used, such as job clusters or all-purpose clusters.
  • Status: Filters jobs based on their status, such as Success, Failed, Canceled, or Running.
  • Owner: Filters jobs by the user who initiated or owns the job.

Job Runs Aggregate Table

FieldDescription
Cluster NameThe name of the cluster associated with the job run.
Cluster IDThe unique identifier for the cluster.
Cluster TypeIndicates the type of cluster (e.g., job_cluster __or all purpose_cluster).
Job NameThe name of the job or query being run.
StatusThe completion status of the job (e.g., SUCCESS, FAILED, CANCELED).
Job IDThe unique identifier of the job run.
DurationSpecifies the time taken to complete the job.
DBU ConsumedIndicates the Databricks Units consumed during the job run.
Estimated Databricks CostThe estimated cost incurred for Databricks usage.
Estimated Vendor CostThe estimated cost incurred from the underlying cloud vendor.
Start TimeThe time when the job run started.
End TimeThe time when the job run ended.
Executor Heap Used %The percentage of heap memory utilized by the executor during job execution.
CPU Used %The percentage of CPU resources consumed by the executor.
Executor MemoryThe total memory allocated to the executor for processing tasks.
DiagnosticsAdditional diagnostic information or status for the job, such as errors or workload details.
OwnerThe email address or identifier of the person who owns or initiated the job.
App IdA unique identifier assigned to each application or job instance.
App NameThe name or reference label associated with the job or application.

DLT Pipelines

DLT Pipelines Tab

DLT Pipelines Tab

This page provides an interface for managing and monitoring Delta Live Tables (DLT) pipelines. It displays a list of pipelines along with key details such as their current state, execution mode, recent run information, and performance metrics. You can apply filters to refine the pipeline list based on specific criteria.

Filters

  • Current State: Filter by the pipeline's operational status (e.g., Idle, Running, Failed).
  • Owner: Filter by the user or team managing the pipeline (e.g., usernames, email addresses).

DLT Pipelines Aggregate Table

FieldDescription
NameThe name of the Delta Live Table (DLT) pipeline.
Current StateThe current operational state of the pipeline (e.g., Idle).
OwnerThe user who owns or manages the pipeline.
Pipeline ExecutionThe environment in which the pipeline is executed (e.g., Development).
Pipeline ModeThe mode in which the pipeline runs (e.g., Triggered).
Total RunsThe total number of times the pipeline has executed.
Last Job RunA link or identifier for the most recent pipeline job execution.
Last Run StateThe result of the most recent run (e.g., Failed, Completed).
Last Run DurationThe amount of time the last run took to complete.
Last Run Start TimeThe date and time when the most recent run started.

Clicking a pipeline name opens the Pipeline Run Details side panel, displaying key information about the pipeline, such as:

Pipeline Run Details

Pipeline Run Details

FieldDescription
Pipeline NameThe name of the pipeline being executed.
Cluster IDThe unique identifier of the cluster associated with the pipeline run.
StateThe current status of the pipeline run (e.g., FAILED, SUCCESSFUL).
CauseThe reason for the pipeline run state (e.g., JOB_TASK).
Start TimeThe date and time when the pipeline run started.
End TimeThe date and time when the pipeline run ended.
Is Validate OnlyIndicates whether the run was for validation only (true or false).
Is Full RefreshSpecifies whether the pipeline run performed a full refresh (true or false).
ExecutionDetails about the execution.
Databricks CostThe cost associated with Databricks resources for the pipeline run.
Cloud Vendor CostThe cost incurred for cloud resources used in the pipeline run.
Total CostThe combined total cost of Databricks and cloud vendor resources.
Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard