Configure Fingerprinting of Applications

This page describes how to configure application fingerprinting in Pulse. Fingerprinting helps Pulse uniquely identify jobs for accurate analysis and resource tracking.

You can:

Configure fingerprinting for Spark and Hive/Tez.
Enable Grok to define job patterns.
Add rules to classify jobs based on extracted fields.
Set blacklist rules to identify containers or applications that can be safely terminated, or killed if the node exceeds the overutilization threshold defined.

Ensure to enable fingerprinting for applications. For details, see Enable Fingerprinting.

Overview

Every organization names its jobs differently. For example, you might name your Spark jobs using a format like etl-orders-acme to show the job type, dataset, and organization. Pulse uses fingerprinting to uniquely recognize these naming patterns and classify jobs automatically for better analysis and resource tracking.

Pulse provides default property fields (such as spark.app.name).
- If your jobs follow these property naming patterns, you can enable Grok and define a Grok Pattern.
You can also add new custom property fields and define corresponding Grok patterns.
When Grok is enabled and patterns are defined, it is mandatory to add the rules you use to identify jobs so that Pulse also uniquely identifies those jobs.

Example Scenario

If your job name follows this pattern: spark.app.name = etl-orders-acme
You can enable Grok and define a pattern such as: %{WORD:job_type}-%{WORD:dataset}-%{WORD:organization}
Pulse then extracts:
- job_type = etl
- dataset = orders
- organization = acme

This allows Pulse to accurately identify, group, and analyze jobs across Spark and Hive/Tez, even if parts of the job name change (for example, a date suffix).

When to Use Grok Patterns

Enable Grok Pattern when:

Your job names include dynamic fields such as dates or environment identifiers.
You want Pulse to ignore changing parts of the name and treat similar jobs as one recurring job.
You need to extract multiple attributes (e.g., job type, organization) from a single property for classification.

Example:

A job name like spark-py-2025-11-10 changes daily due to the date suffix. Using a Grok pattern, Pulse extracts the constant part (spark-py) and removes the date so that all daily runs are treated as the same job to analyze the historical data.

Rules

When Grok is enabled, adding at least one Rule is mandatory.

Rules define how extracted fields are used to classify or categorize jobs. They can compare values, match strings or dates, or apply numeric and custom logic for flexible job grouping.

Example:

To classify jobs where the start and end dates are the same as “simple” jobs, and others as “complex”:

Operation Type: Date Equal
Date 1 Field: start_dt
Date 2 Field: end_dt
Date Layout: 2006-01-02
Rule Output Field: job_category

You can create multiple rules to define additional classifications, such as job type, priority, or duration category.

Fields Included

You can include multiple extracted or rule-based fields for job classification, such as:

job_category
organization
dataset

These fields come from both the Grok patterns and the Rule output fields.

How It Helps

This configuration ensures that Pulse can:

Recognize recurring jobs even if their names or parameters vary slightly.
Maintain historical continuity in job fingerprints.
Provide more granular insights into job patterns and resource behavior.

Spark Configuration

You can define property keys and patterns to classify Spark jobs based on how they are created and named.

When a Spark job is created, it generates several property categories—Spark, Hadoop, System, and Metric.

Spark Properties

Default Spark Properties:

spark.app.name
spark.driver.memory
spark.executor.instances
spark.executor.memory
spark.executor.cores
spark.jars
spark.extraListeners
spark.scheduler.mode
spark.shuffle.service.name
spark.sql.catalogImplementation
spark.sql.shuffle.partitions

To configure:

For each property field, click Enable Grok.
Enter the Grok Pattern that matches your job naming structure.
Add at least one Rule (mandatory when Grok is enabled) to define how Pulse interprets or classifies the extracted job fields.

Hadoop Properties

There are no default fields. You can add the properties based on the pattern you use in creating jobs.

For example:

spark.job.name
Hive.job.queuename

Configuration:

Click Enable Grok.
Enter the Grok Pattern.
Add at least one Rule if Grok is enabled.

System Properties

You can add the properties based on the pattern you use in creating jobs.

Configuration:

Click Enable Grok.
Enter the Grok Pattern.
Add at least one Rule if Grok is enabled.

Metrics Properties

You can add the properties based on the pattern you use in creating jobs.

For example:

app.runtime.duration
app.memory.usage

Configuration:

Click Enable Grok.
Enter the Grok Pattern.
Add at least one Rule if Grok is enabled.

Hive and Tez Configuration

Define property field names and Grok patterns for Hive and Tez so Pulse can uniquely identify all jobs running in your environment.

Tez Properties

Example field names: appName, queue, etc.

Hive Properties

Example field names: hive.query.id, hive.query.string, etc.

To configure:

Click Enable Grok for each property field.
Enter the Grok Pattern to extract job details for classification.
Add Rules (mandatory when Grok is enabled) to define how Pulse interprets extracted job details.

Black List Rules

Set blacklist rules to identify containers or applications that can be safely terminated, or killed if the node exceeds the overutilization threshold defined in Configure Overcommitment > Overutilization Threshold.

Examples:

Blacklist Rule 1

Rule Key: appName_agg_name
Rule Value: Hive
→ Terminates all applications whose names start with Hive__.

Blacklist Rule 2

Rule Key: spark.app.name_agg_name
Rule Value: forbidden
→ Terminates all Spark applications whose names contain forbidden__.

When QoS is enabled under Configure Quality of Service (QoS), and a node exceeds the defined resource utilization threshold, the containers or applications are automatically terminated based on your configuration and the blacklist rules set.

Last updated on

Was this page helpful?