Spark Job Details

The Spark Job Details page contains the following panels:

  • Summary Panel
  • Job Trends
  • Configurations
  • Spark Stages
  • Timeseries Information
  • Reports
  • Application Logs

Note

If you cannot view the Spark Jobs details page, you can view the instructions given on the page and follow them so that you can view the Spark job details.

Summary Panel

The top panel displays the following information:

NameDescription
UserThe name of the user that ran the job.
Final Status

Status of the job. The state can be one of the following:

Succeeded, Failed, Finished, Finishing, Killed, Running, Scheduled. Click on the status to view the list of the jobs.

Start TimeThe time at which the application started running.
DurationThe time taken to run the query.
# of JobsThe number of jobs in the application.
# of StagesThe number of stages of the Spark job. Click on the number to view the list of inefficient stages.
# Inefficient StagesThe number of inefficient stages. Click on the number to view the list of inefficient stages.
# of TasksThe total number of tasks the stages are broken into.
Avg MemoryThe average memory used by the application of the selected user.
Avg VCoreThe average VCore used by the application of the selected user.
Scheduling DelayThe time taken to start a task.

The Job Trends panel displays a chart showing the pattern of jobs running at a particular time.

The x-axis denotes the time at which the User executed a job.

The following table provides description of the factors displayed in the Job Trends chart:

NameDescription
Elapsed TimeThe time taken to run the jobs at a particular time.
VCoresThe number of VCores used to run the job.
MemoryThe amount of memory used to run the job.
Input ReadThe size of the input dataset.
Output WrittenThe size of the output written to a file format.

You can choose the metrics displayed in the Job Trends chart. To choose the metrics, perform the following:

  1. Click the Comparison metrics drop-down menu in the chart. The list of the metrics is displayed.
  2. Select the metrics you choose to view. The data respective to the selected metrics is displayed in the chart.
  3. (Optional) To add metrics, click and select the metrics from the list.
  4. (Optional) To remove the selected metrics, click corresponding to the metrics you want to remove.

You can switch between bar chart view and line chart view.

  • Click , to view the Job Trends as a bar chart.
  • Click to view the Job Trends as a line chart.

Configuration Difference

Click Compare Runs to compare different configuration of the job. Select the runs that you want to compare from the drop-down list. You can choose from up to 10 previous runs of the job. Apart from the previous 10 runs, you can also choose to compare an execution of the job with any other execution. You must click the Enter App ID button. You must enter the application ID, followed by an underscore and then the original attempt in the following format.

app.id_originalattempt

Configurations

With Configurations, you can view the Job Configurations and Anomalous hosts.

Job Configurations

Job Configurations displays the Current Value and the Recommended Value for the following parameters:

MetricDescription
# CoresThe number of cores in the current job.
# ExecutorsThe number of executors in the current job.
# Executor MemoryThe amount of memory used by a job executor.
Driver # CoresThe number of driver cores.
Driver MemoryThe amount of memory used by the driver.

Node Issues

This section displays the unhealthy nodes of this Spark application. Blacklisted nodes, nodes on which Kernel errors are reported, nodes whose disk utilization is close to being full are some examples which make a node unhealthy. This section displays a table with two columns. The HEALTH REPORT column displays the reasons as to why a node is marked unhealthy and the NODE LIST column displays the list of nodes which are unhealthy.

Anomalies

Anomalies board displays system metrics for the host which is used by the Spark job within the duration of that job. The host can be impacted by the usage of CPU, Memory, Network, or Disk.

With Anomalous data, you can monitor the host performance and make predictions on memory, CPU, Network I/O, and disk usage.

To view more details about Anomalous hosts, click the host link in the Anomalies tab.

Anomalies

Anomalies

You can detect anomalies based on the following metrics.

If an anomaly exists, the associated chart is highlighted with the number of anomalies detected.

MetricDescription
CPU UsageThe processor capacity usage of the job on the host.
Memory UsageThe RAM usage of the job on the host.
Network I/OThe network status of the job on the host displaying Sent Bytes and Received Bytes.
Disk UsageThe host storage currently in use by the Spark job. The data is displayed in Write Bytes and Read Bytes.

Error Details

The error information for the job that failed to complete is displayed. This panel is displayed for a job that has a final status of FAILED.

To copy the error details, perform the following:

  1. Click the copy icon. A confirmation dialog-box appears.
  2. Click Ok. The error details are copied to the clipboard.

Concurrent Running Apps

This table displays the list of applications which were running parallelly with the selected application. You can view the following details in this section.

Column NameDescription
IDThe unique Id of the application that was running parallelly. You can click the ID of the application to drill down to the details page of that application.
Original AttemptThe number of attempts made by the application to execute.
Wait TimeThe time for which application was waiting for resources.
Time TakenThe time taken by the application to complete the execution.
QueueThe queue to which the application belongs to.
MemoryThe total memory consumed.

Spark Jobs

This section displays the complete hierarchy of Spark applications. A Spark application consists of Jobs. A job can further have many stages and each stage can have tasks. In this chart, you can view the jobs, stages and tasks of the application.

Each job has a job ID. You can expand a job to view its stages.

This section has the following columns.

FieldDescription
Stage IDThe ID of the stage. Click the log icon to view the Logs of the job.
Job IDThe ID of the Spark job.
Task CountThe number of tasks in the stage.
TimelineThe graphical representation of the duration of the tasks.
DurationThe time taken to complete tasks in that stage.
BlacklistedSpecifies if the Job is blacklisted or not. If this field displays True, then the job is blacklisted.
Max Task MemoryThe maximum memory occupied by tasks.
IO PercentageThe rate of input/output operations (in %).
Shuffle WriteAmount of shuffling data written.
Shuffle ReadAmount of shuffling data read.
PRatioRatio of parallelism in the stage. A higher PRatio is better.
Task SkewThe value of task skewness which is less than -1 or greater than +1. (refer the dashboard).
Failure RateThe rate at which the tasks in the stage fail.
StatusThe status of the stage.
Failure ReasonThe reason for task failure,

You can choose to group or ungroup the data by Job ID. To group, you must select the Group By job Id check box. Also, when you ungroup you can choose to filter out inefficient stages.

You can click more details button to view the tasks of a Stage.

You can click the DAG tab to view the execution flow only for the specific task.

Spark Stages

Stages are units in which a job is divided into small tasks. You can view Spark Stages in the form of a List or a Timeline. Click More Details to view the details of a particular stage.

Timeline

The timeframe in which tasks in the stage executed. The timeline also includes the total driver execution time. You can sort the timeline of these tasks by Duration and Start Time other than the default view. To sort, perform the following:

  1. Click SortBy in the chart. The drop-down menu is displayed.
  2. Select the type of sort you want to apply. The data corresponding to the selected sort is displayed.
  3. (Optional) Select None, to remove the applied sort.

Viewing Timeseries Information

By default, the Timeseries Information charts display information on all the stages. To view Timeseries Information of an individual stage, perform the following:

  • Select the job stage from the list. The Timeseries Information for the selected stage is displayed.
  • To view information of all the stages, select the Show all Stages checkbox. The Timeseries Information for all the stage is displayed.

Timeseries Information

Timeseries information displays timeseries metrics of the application you are currently viewing. Within the time duration, you can see the time spent by the drivers, denoted by a red box. The drivers help in running Spark applications as sets of processes on a cluster.

Note You can see the name of the application you are currently viewing, above the user name in the top panel.

Timeseries Information Charts

The following table provides description of the Timeseries Information charts:

Info By moving the slider, you can specify the time range for driver execution. The data in the timeseries charts is displayed based on the range of time selected.

Chart NameDescription
Schedule InformationThe number of tasks running at a particular time and the number of tasks that were yet to be executed.
IOThe chart describes the number of input bytes read and the number of output bytes written during the duration of the task.
Driver Memory UsageThe amount of memory consumed by the driver.
Executor Memory UsageThe amount of memory used by the executor.
GC and CPU DistributionThe amount of garbage collection (in %) and amount of CPU used (in %) to execute jobs.
Shuffle informationThe chart describes the following shuffle information: Shuffle Bytes written, Shuffle local bytes read, Shuffle Remote bytes read, and Shuffle Remote Bytes Read to Disk.
Storage MemoryThe chart describes the amount of the following types of memory: Block Disk Space Used, Block Off Heap Memory User, Block On Heap Memory Used, Block Max Off Heap Memory, and Block Max On Heap Memory.
HDFS InformationThe chart describes number of HDFS read and written.

By default all the charts displays the aggregated executors. To view individual executors, perform the following:

  1. Click Show Individual Executors. A drop-down menu of the executors is displayed.
  2. Select the executor from the drop-down menu. The data corresponding to the selected executor is displayed in the chart.
  3. To view the aggregated executors, click Show Aggregated Executors.

Reports

The Reports has the following tabs:

Efficiency Statistics

Driver versus executor time spent determines how well the Spark program has been written and if the right amount of parallelism is achieved.

The following tables provides the description of the charts in the Efficiency Statistics panel:

MetricsDescription
Driver Wallclock TimeTotal time taken by driver to complete the execution.
Executor Wallclock TimeTotal time taken by all the executors to complete the execution.
Total Wallclock timeTotal time taken by all the driver and all the executors to complete the execution.
MetricsDescription
Critical Path TimeDisplays the shortest amount of time the application will take for infinite number of executors.
Ideal Application TimeDisplays the shortest amount of time the application will take, assuming perfect parallelism and no data skews.
Actual Run TimeDisplays the total time taken by the Critical Path time and Ideal Application Time.

One Core Compute Hour (OCCH) is derived from one core running for one hour in the executor. The following metrics are displayed in the chart.

  • OCCH wasted by driver.
  • OCCH wasted by executor.
  • Total OCCH wasted.
  • Total OCCH available.

Simulation

The Simulation chart displays the Estimated Wall Clock Time and Cluster Utilization (%) for the number of executors used. You can determines what should be the ideal number of executors on the Spark program and what would be the effect of such changes to the number of executors on the overall time and utilization.

YARN Diagnostics

The YARN Diagnostics displays the details of the YARN application that was running in that duration executed by the user. The following table provides description of the table in the YARN Diagnostics tab:

Field NameDescription
Start TimeDisplays the start time of the YARN application.
End TimeDisplays the end time of the YARN application.
StateThe state of the YARN application. The state can be one of the following: Created, Initialized, Compiled, Running,Finished,Exception, or Unknown.
MessageThe diagnostic message in the YARN application.
Message CountThe number of diagnostic messages.

To view more details, click on the row. The following table provides description of the table displayed:

MetricDescription
TimeThe start time of the application.
Preempted MBThe amount of memory preempted for the application.
Preempted VCoresThe number of VCores preempted for the application.
Allocated MBThe amount of memory allocated to the application.
Allocated VCoresThe number of VCores allocated to the application.
Avg MemThe average memory used by the application.
Avg VcoresThe average VCore used by the application.
Running ContainersThe number of containers running in the application.
Queue Usage %The amount of queue usage (in %).
Cluster Used %The amount of cluster usage (in %).
StateThe state of the query using YARN application. The state can be one of the following: Created, Initialized, Compiled, Running,Finished,Exception, or Unknown.
MessageThe diagnostic message.

Aggregate Metrics

The Aggregated Metrics tab displays the aggregated usage of different metrics in that application. The following table provides description of the metrics in the Aggregated Metrics tab:

Field NameDescription
NameName of the metric.
SumDisplays the total sum corresponding to the metric.
MinDisplays the minimum value corresponding to the metric.
MaxDisplays the maximum value corresponding to the metric.
MeanDisplays the mean value corresponding to the metric.
DescriptionDisplays the definition of the metric.

Core Usage By Locality

The panel provides the information for the core usage by locality processed by Spark application. The following metrics is displayed in the charts:

  • Process Local: The tasks in this locality are run within the same process as the source data
  • Node Local: The tasks in this locality are run on the same machine as the source data.
  • Wastage

The missing SparkTaskEnd events might cause discrepancies in the spark-core-usage calculation.

DAG

The DAG chart displays the exact flow of how each task has been executed. The DAG flow displayed on this section represents the entire application. You can view the flow in which all the tasks in the application are executed.

Why does the Spark UI take a long time to render long running for certain Spark applications?

In Spark applications where jobs run for an extended period and involve thousands of stages and tasks, the web page may become slow to render. The UI will eventually load, but it may take approximately one minute due to the large volume of data being processed and displayed.

Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard