Configure YARN Optimizer Service

This page describes the configuration of the optimization parameters; in a way controls how the optimization calculation works.

Configure Quality of Service (QoS)

When you enable QoS, the Optimizer monitors all nodes for rule violations. If a node exceeds its overutilization threshold, the Optimizer kills applications based on the black rules defined one by one until the node stabilizes.

Kill Mode

Select how Pulse handles nodes that violate QoS rules:

  • Container – Terminates only the affected container.
  • Application – Terminates the entire application.

Configure Overcommitment

Configure resource overcommitment thresholds and behavior.

These properties define how Pulse detects node states (underutilized, optimal, or overcommitted) and manages optimization actions.

Adjust the values based on your cluster size and workload patterns.

PropertyDescriptionDefault Value
Enable OptimizationEnables or disables resource optimization across nodes.Disabled
Overutilization ThresholdSets the utilization percentage above which a node is considered overutilized, and the optimization stops till the node comes back to stability.80 (in %)
Underutilization Margin (MB)Minimum unused memory (in MB) required for a node to be considered underutilized.500 (in MB)
Underutilization Margin (vCores)Minimum unused CPU cores required for a node to be considered underutilized.1
Underutilized Cycle CountNumber of consecutive cycles a node must remain underutilized before it is marked for optimization.3
Underutilized Cycle Window (seconds)Time window (in seconds) for evaluating underutilization cycles.10
Max Overcommit Multiplier (Memory)The maximum factor by which memory can be overcommitted relative to the physical capacity. Example: 1.2 = 20% overcommitment.1.5
Max Overcommit Multiplier (vCores)The maximum factor by which CPU cores can be overcommitted relative to the physical capacity.1.5
Node History Evaluation Window (seconds)Time window for evaluating historical node performance before making overcommitment decisions.-
Safe Buffer Overcommit Factor (Memory)Safety buffer that limits memory overcommitment to prevent node overloads.-
Safe Buffer Overcommit Factor (CPU)Safety buffer that limits CPU overcommitment to maintain performance stability.-
Resource Manager Poll Interval (seconds)Interval at which Pulse polls the YARN ResourceManager for node and application metrics.10 seconds
Threshold CVCoefficient of variation threshold used to detect performance instability or fluctuations in node metrics.
Employ Node PressureEnables node pressure analysis to dynamically adjust overcommitment based on real-time node stress.False
Max Percentage Node Free MemorySpecifies the maximum percentage of free memory to maintain per node before overcommitment stops.-
Employ Memory DynamicsEnables adaptive adjustment of memory overcommitment based on usage trends and node behavior.False
Ensure Fingerprint Seen Min CountSets the minimum number of times a container’s fingerprint must be observed before it is considered reliable for optimization decisions.-
Ensure Fingerprint SeenEnsures that only containers with previously recorded fingerprints are included in optimization to maintain data consistency and accuracy.-
Default Context Switch Multiplier

Defines the default multiplier applied to context switch pressure when calculating overall node pressure. It adjusts the impact of frequent context switches on optimization decisions, ensuring that systems with higher thread or process activity are weighted appropriately.

Example: A multiplier of 1.5 increases the contribution of context switch pressure by 50%, making the optimizer more sensitive to thread-switching overhead.

-

Configure Pressure Weights

Configure the relative importance of different system pressure metrics—such as CPU, memory, disk, network, and context switches—that influence optimization decisions.

These weights determine how much each resource contributes to the overall node pressure score used by the optimizer.

For details on how pressure thresholds are defined and measured per node type, see Node Pressure Configurations

OptionDescriptionDefault Value
CPU Pressure WeightDefines how much influence CPU pressure has on the overall node pressure score.1
Memory Pressure WeightDefines the importance of memory usage pressure on the node pressure calculation.1
Context Switch Pressure WeightRepresents the effect of frequent context switches on overall node pressure. Higher values indicate a heavier impact.1
Disk Pressure WeightSpecifies the impact of disk I/O pressure (read/write latency or queue) on node pressure.1
Network Pressure WeightDetermines the contribution of network utilization or congestion to the node pressure score.1

How Node Pressure Works

Node pressure represents the combined load on a system based on multiple resource types—CPU, memory, context switches, disk, and network.

Each type of pressure is multiplied by its assigned weight, and the weighted values are averaged to form the overall Node Pressure Score.

Example:

  • If the system detects high CPU usage (0.8), moderate memory usage (0.6), and other pressures averaging around 0.4 with the CPU weight set to 2, the resulting node pressure might be around 0.57.
  • Since this is below 0.75, optimization continues normally. If it rises above 0.75, optimization pauses to prevent system overload.

The pressure values (like 0.8 or 0.6) are calculated automatically from real-time system metrics, while only the weights are configured manually in the UI.

Optimization Impact

  • When node pressure is low, optimization continues normally.
  • When node pressure increases gradually, optimization becomes more conservative.
  • When node pressure exceeds 0.75 (on a 0–1 scale), optimization pauses to prevent overloading the system.

Example:

A node pressure score of 0.75 means the system is under 75% total stress from combined resources. At this level, optimization stops temporarily until pressure drops.

Example Scenarios

ScenarioDescriptionExample Recommended Weighting
CPU-Intensive NodeMost pressure comes from high CPU usage (e.g., data processing or computation-heavy tasks).Increase CPU Pressure Weight (e.g., 2) to make optimization more sensitive to CPU load.
Memory-Intensive NodeHeavy memory usage (e.g., large Spark or YARN jobs).Increase Memory Pressure Weight (e.g., 2) so optimization reduces load sooner when memory pressure rises.
High Context SwitchingFrequent task switching or excessive thread activity increases CPU overhead.Raise Context Switch Pressure Weight slightly (e.g., 1.5) to account for system inefficiency.
Disk-Intensive NodeHigh I/O activity, long disk queue times.Increase Disk Pressure Weight (e.g., 1.5) to limit optimization when storage becomes a bottleneck.
Network-Intensive NodeHigh network usage or data transfer load.Increase Network Pressure Weight (e.g., 2) to prevent over-optimization under network strain.

Summary

Each weight tells the optimizer how important that type of system pressure is when deciding whether to optimize.

  • A higher weight means “pay more attention” to that resource.
  • If overall node pressure (after weighting) goes above 0.75, optimization pauses to keep the system stable.
  • For resource-specific thresholds, see Configure Node Pressure.

Configure Windows

Configure time windows and their weights to analyze the past behavior of node, CPU, and memory usage.

This helps you make informed decisions about overcommitting resources.

How It Works

  • Define time windows to specify the historical intervals for collecting metrics. Shorter windows capture recent spikes, while longer ones show sustained trends.
  • Assign weights to each window to decide how much influence it has on the final analysis. Higher weights give that time window more importance.

Example: Node (Memory) Analysis

  • Node Thriver Windows: 5m, 15m, 30m
  • Node Thriver Window Weights: 5m:0.7, 15m:0.2, 30m:0.1. This configuration prioritizes recent data (5 minutes) more heavily while still considering longer trends.

Example: CPU Analysis

  • Node CPU Windows: 5m, 15m, 30m
  • Node CPU Window Weights: 5m:0.5, 15m:0.3, 30m:0.2. This setup helps identify both short-term CPU spikes and long-term usage patterns.

Node Thriver Weighted Sum Overcommit Threshold: Threshold for determining if a node is overcommitted based on weighted sum analysis. When the weighted sum of node thriver metrics across all windows exceeds this value (0.0-1.0), the node is considered overcommitted or overloaded, and optimization actions may be triggered (killing containers or applications one by one until the node is stable based on the configured rules).

  • Lower values (e.g., 0.7) trigger optimization more aggressively,
  • while higher values (e.g., 0.9) are more conservative.
  • Recommended: 0.8 for balanced optimization.

Configure Node Pressure

Configure pressure thresholds for different node types to help the optimizer evaluate and manage node performance under varying load conditions.

This configuration allows tuning for different node profiles, such as memory-heavy or compute-heavy nodes.

Priority Order

all → Specific node types (e.g., 16Gx8, 32Gx8) → default

  • all: Applies globally to all nodes and overrides other configurations.
  • Specific node type: Applies only to matching nodes (for example, 16Gx8).
  • default: Acts as a fallback for node types not explicitly defined.
FieldDescriptionExample / Default Value
Node TypeDefines the node category this configuration applies to. Use all to apply globally or specify a node type (e.g., 16Gx8).all
Context Switches (Involuntary) TotalSets the maximum number of involuntary context switches allowed before marking the node as pressured.-
Context Switches (All) TotalSets the total number of context switches (voluntary + involuntary) allowed for a node.-
Disk Max MB/sSpecifies the disk throughput threshold (in MB/s). When this limit is exceeded, disk pressure increases.-
Network Max MB/sDefines the maximum network throughput in MB/s. Used to measure network-related pressure.-
Network Max Packets/sSets the maximum packets per second threshold for network usage. High packet rates may indicate congestion.-

Example Scenario

If your cluster includes node types such as 16Gx8 and 32Gx8:

  • Use all to define global thresholds.
  • Configure specific node types with tuned values (for example, higher disk throughput for data nodes).
  • Use default as a fallback for any unmatched node types.

In summary, the Node Pressure Configurations and Pressure Weights work together to determine when and how optimization occurs.

By setting proper thresholds per node type, the optimizer can make resource-aware decisions while maintaining system stability.

Configure Advanced Screening

This configuration defines advanced screening tests and their related parameters used by the YARN Optimizer to refine optimization decisions.

When a main option is enabled, its related sub-options become active for fine-tuning behavior.

OptionDescriptionDefault / Example Value
Enable Advanced ScreeningEnables or disables all advanced screening tests and filters.Disabled
Enable Realtime Usage Test

Activates real-time analysis of resource usage patterns for screening.

Sub-options:

  • Realtime Usage Hard Filter: Applies a strict threshold to filter based on real-time usage data.
  • Runtime Usage Threshold Factor: Defines the threshold factor used to compare current runtime usage with historical fingerprint values.

If the current usage exceeds this threshold, the container or job is flagged as unstable and excluded from optimization.

Example: If the threshold factor is set to 1.2, and the historical average memory usage is 1 GB, then any container using more than 1.2 GB at runtime will be considered over the limit.

Disabled
Enable Age Test

Enables filtering based on the age of containers, applications, or data. When enabled, the optimizer ensures that only containers running for a minimum duration are considered for optimization.

Sub-options:

  • Age Test Hard Filter: Applies a strict check to exclude containers that do not meet the minimum age criteria.
  • Minimum Age Allowed: Sets the minimum time a container must run before it qualifies for optimization. Containers younger than this threshold are skipped to avoid premature optimization.

Example: If the Minimum Age Allowed = 3m (3 minutes), only containers that have been running for 3 minutes or longer are considered stable enough for optimization. Containers younger than 3 minutes are excluded, as they might not have sufficient runtime data for accurate assessment.

Disabled
Enable Trend Stability Test

Enables evaluation of stability trends across multiple application runs to identify consistent performance patterns.

Sub-option:

- Trend Stability Hard Filter: Applies strict filtering based on trend stability results. If the 95th percentile of current metric values deviates significantly from the historical fingerprint data, the application is excluded from optimization calculations.

Disabled
Enable Fingerprint Trust Test

Enables validation of container fingerprint trust levels for screening accuracy. When enabled, the optimizer compares current runtime data with historical fingerprint data to ensure stability before optimization.

Sub-options:

  • Fingerprint Trust Hard Filter: Applies a strict filter to exclude containers whose runtime data significantly deviates from trusted fingerprint values.
  • Memory CV Threshold: Sets the maximum allowed coefficient of variation (CV) between fingerprint and runtime memory usage. Containers exceeding this threshold are excluded from optimization.

By default, a CV value of 3 means runtime memory can vary up to three times from the fingerprinted memory before being flagged as unstable.

Example: If a container’s fingerprinted memory usage is 500 MB, the optimizer allows runtime memory usage up to about 1.5 GB (≈3×). If the current runtime usage exceeds this limit, the container is marked unstable and excluded from optimization.

  • Minimum Run Count Required: Defines the minimum number of container runs required to validate fingerprint reliability before inclusion in optimization.

Example: A container must have at least 20 successful runs to build a trusted fingerprint before being considered for optimization.

Disabled

  • Memory CV Threshold: 3
  • Minimum Run Count Required: 20
Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard