Tez Query Details

The Tez Query Details page contains the following panels:

Summary Panel

The summary panel displays the following information:

Field NameDescription
UserThe name of the user that executed the job.
StateThe state of the job that can be one of the following: Created, Initialized, Compiled, Running, Finished, Exception, or Unknown.
DurationThe time taken to run the query.
Delay TimeThe YARN scheduler delay time. The difference in time taken for a query to move from Accepted state to Running state.
Start TimeThe time at which the query execution started.
End TimeThe time at which the query execution ended.
QueueThe name of the Queue in which the Query is stuck. You can
# of VerticesThe number of vertices in the query.
HDFS Data ReadThe amount of HDFS data read.
HDFS Data WrittenThe amount of HDFS data written to an output file format.
Application IDThe ID of the application of the user you are currently viewing.

The Query Trends panel displays a chart showing the pattern of jobs running at a particular time, based on the following factors.

MetricDescription
Elapsed TimeThe time taken to run the jobs at a particular time.
VCoresThe number of VCores consumed to execute the query within a timeframe.
MemoryThe amount of memory used to execute the query within a timeframe.
HDFS ReadThe amount of HDFS data read.
HDFS WriteThe amount of HDFS data written to an output file format.

Configuration Difference

Hover your mouse pointer to the Compare button on the right side of the panel and select Runs to compare the different configurations of the query. Select the runs that you want to compare from the drop-down list. You can choose from up to 10 previous runs of the query.

The following tabs are displayed in the Configuration Difference window:

Tab NameDescription
MetricDisplays the metric difference of the runs
ConfigDisplays the configuration difference of the runs.

DAG Comparison

Hover your mouse pointer to the Compare button on the right side of the panel and select the Dag option to compare the different DAGs. Select the ID's from the drop down or enter a Hive ID to compare the two DAGs next to each other.

Recommendations

The Recommendations panel displays recommendations that you can use to improve the performance of the SQL Query. If the table has small files attached to it, this information is provided in the Recommendations section.

RecommendationsRecommendation TextSample QueryQuery Pattern Type
No limit statement on final projectionNo limit clauseSELECT col1 FROM table WHERE col1 = true;
Star projectionUse qualified column namesSELECT * FROM table LIMIT 10;
Possible query errorNo where clause detectedSELECT col1 FROM table;
Table partition is unusedPartition not usedSELECT * FROM table;
Possible optimizationFunction call in predicate, use a CTE query.SELECT * FROM
Possible optimizationAggregate Function call in a Having clause, Lift in to where clause.SELECT col1 FROM GROUP BY col1 HAVING COUNT(col1) > 1
Possible query errorThis 1uery uses multiple tables, but has non joins and no where clause. This may cause a cartesian join.SELECT t1.col, t2.col FROM t1, t2
Possible order optimizationOrderby can be replaced with Sortby to use multiple reducers. Ordering will be enforced within reducers.SELECT col1, col2 FROM t1 ORDER BY col1, col2

Change conversion type or decrease threshold to improve performance

Severity: 1

For single stage using fetch operator, either modify task conversion type or threshold to improve performance.select * from call_center where cc_ call_center_id="AAAAAAAABAAAAAAA" limit 10;SELECT *, FILTER on partition columns (WHERE and HAVING clauses), LIMIT only.

Move transformations on projection

Severity: 1

select * from (

select st1,i1,sa,sb,a,b from t1

) t1 join t2 on (

st1=st2 and

i1=i2 and

case when sa = 'yes' then a when sb = 'yes' then b end = x and

st2 = 'asd1'

Avoid row_number() functions as its not vectorized

Severity: 1

Non-vectorized operator $operator identified in the $stageSQL with row_number()

Specifically typecast the columns which are going in for string comparison to be cast as string if not already string datatype

Severity: 1

Non-vectorized operator $operator identified in the $stagewhere string_col=double_ col

Cross join identified, revert and refrain from usage

Severity: 0

Cross join is identified, please check the join keys or query conditionsselect * from <table1> cross join <table2

Identify M/R non-vectorized

Severity: 1

Complex queries usually have large numbers of joins, often over 10 joins per query.

Severity: 1

More than 10 joins identified in the query

Query

The Query panel displays the SQL query along with the Join details and the details of table(s) used in the query. The following table provides description of the table details:

ColumnDescription
Table NameThe table used in the query.
Database NameThe Database to which the Table belongs to.
Filter expressionThe expression used in query filtering.
Expected SizeThe total number of rows in the table.
Expected RowsThe number of rows returned on executing the query.

Note To copy SQL, click and to beautify the SQL, click .

Click Hive Config, to view Hive configuration details for the selected query.

To view the table details click on the table name. The Tez Table Details for the table is displayed.

Tables with Small Files

If any of the table accessed by the query has small files attached to it, you can see those table names with an yellow border. The legend at the top of the Query table indicates the same.

Predicate and Group By Details

Pulse displays the predicate ranges used in the query (if any) along with the predicate type. If multiple predicates are used in a query, all of them are displayed in this column.

Also, Pulse displays the Group By clause (if applied) om the query.

YARN Diagnostics

This panel displays the following diagnostics metrics of YARN.

Column NameDescription
Start TimeThe time at which the YARN application started.
End TimeThe time at which the YARN application ended.
StateThe state of the YARN application. The state can be one of the following: Created, Initialized, Compiled, Running, Finished, Exception, or Unknown.
MessageThe diagnostic message in the YARN application.
Message CountThe number of diagnostic messages.

The following details are displayed for jobs in a YARN container.

Note: A row contains data for a minute of the selected duration.

Column NameDescription
TimeThe minute at which the job is executed.
Preempted MBThe amount of processes that need priority to run the job.
Preempted VCoresThe number of VCores that need priority to run the job.
Allocated MBThe amount of memory allocated to the query (in Mb).
Avg MemoryThe average a mount of memory used.
Avg VCoreThe average amount of VCores used.
Running ContainersThe number of containers running in the query.
Queue Usage %The amount of queue usage (in %).
Cluster Usage %The amount of cluster usage (in %).
StateDisplays the state of the cluster
MessageThe diagnostic message.

Map Reduce Stats

This tile displays the statistics of processing of large data sets on a worker node. You can monitor the statistics of following processes by elapsed time.

  • Mappers
  • Reducers

These statistics can be sorted by Duration and Start Time. To sort perform the following:

  1. Click SortBy. The drop-down list is displayed.
  2. Select Duration, if you want to filter by duration. Select Start Time if you want to filter by time. The data is sorted.
  3. (Optional) Select None, to the reset the applied sort.

To filter the data by mappers and reducers, perform the following:

  1. Click Show. The drop-down list is displayed.
  2. Select Mappers if you want to view only data related to mappers. Select Reducers if you want to view only data related to reducers.
  3. (Optional) Click All, to remove the applied filter.

Query Execution Stats

This tile displays the values and graphs of data scanned during the query execution, time taken to execute the query, and the amount of records generated. You can also monitor the statistics in the Query Execution Summary.

The Post Query Execution Stats panel displays values for the following fields:

Column NameDescription
Vertex IDDisplays the Vertex ID. Clicking on the Vertex ID directs you to the Query Plan and DAG
StatusDisplays the Tez job status, such as Succeeded, Failed, Submitted, Killed, and Error.
Total TasksThe total number of tasks in each vertex
DurationThe time taken to run the query
Failed AttemptsThe number of completed job attempts with an unexpected status value
Killed TasksThe number of duplicate copies of a task that were attempted and terminated
CPU TimeThe CPU time taken to complete the query
GC TimeTime spent in garbage collection while executing a query
Input RecordsThe number of input records
Output RecordsThe number of output records

The Query Execution Summary panel shows a time series graph with data points drawn in the timeline based on the Run Dag, Start Dag, Submit Plan, and Compile Query flows.

Query Execution Metrics

The Query Execution Metrics panel displays the following set of the metrics.

Metric TypeMetric NameDescription
TaskCommitted Heap BytesThe maximum amount of memory (in bytes) that can be used for memory management.
CPU MillisecondsThe CPU time (in ms).
GC TimeTime spent by the JVM in garbage collection while executing a query.
Input Records ProcessedThe number of input records processed.
Merge Phase TimeThe time taken to merge a query phase.
Merged Map OutputsThe number of map outputs that were merged.
Output BytesThe number of output bytes written to a file format while executing the query at a given time.
Output RecordsThe number of output records.
Physical Memory BytesThe amount of physical memory the query uses.
Shuffle Bytes To DiskThe number of shuffle records written to disk.
Shuffle Bytes To MemThe number of shuffle records written to memory.
Shuffle BytesThe number of shuffle bytes the query uses.
Shuffle Phase TimeThe time taken to shuffle a query phase.
Spilled RecordsThe number of spilled records.
Virtual Memory BytesThe amount of virtual memory used by queries.
File System MetricsHDFS Bytes WrittenThe number of HDFS bytes written to the query executor.
File Bytes WrittenThe number of file bytes written to the query executor.
HDFS Bytes ReadThe number of HDFS bytes read.
File Bytes ReadThe number of file bytes read.
DAG MetricsTotal Launched TasksThe number of tasks launched in DAG.
Rack Local TasksThe number of tasks that are local to the rack.
Data Local TasksThe number of tasks that are local to the data.
# of Succeeded TasksThe number of tasks that completed successfully.
Application Master Cpu TimeThe time taken by the CPU in application master.
Application Master GC TimeThe time taken by the GC in the application master.
HIVE MetricsHive Files CreatedThe number of HIVE files that were created.

Query DAG and Plan

The panel displays the distribution of query logic in the form of a DAG and a physical execution plan.

DAG

The Direct Acyclic Graph (DAG) is an execution graph that displays a flow diagram of the compiled Hive SQL queries. This graph is a work scheduling graph with finite elements connected in edges and vertices. The order of execution of the jobs in DAG is specified by the directions of the edges in the graph. The graph is acyclic as it has no loops or cycles.

The following image displays an example for a DAG. Click to zoom in and zoom out. Click Reset, to restore back to original size.

The DAG displays a list of nodes. Each node can either be a Map node or a Reducer node. you can find information like the node which has the maximum input record, node that has the maximum output record, the slowest node, if a specific node is vectorized or not, and so on. You can also find counter information for specific vertices. Furthermore, you can also get information if an edge connecting two nodes is a simple edge.

You can find the information on inner tree by clicking the node.

You can navigate to a specific node in the DAG plan from the Post Query Execution Stats table by clicking a Vertex ID. the selected node is highlighted for some time.

Plan

Plan is a logical representation of how Spark executes the query, where a query is broken into different logical plans.

Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard