Configure YARN Optimizer Service
This page describes the configuration of the optimization parameters; in a way controls how the optimization calculation works.
Configure Quality of Service (QoS)
When you enable QoS, the Optimizer monitors all nodes for rule violations. If a node exceeds its overutilization threshold, the Optimizer kills applications based on the black rules defined one by one until the node stabilizes.
Kill Mode
Select how Pulse handles nodes that violate QoS rules:
- Container – Terminates only the affected container.
- Application – Terminates the entire application.
Configure Overcommitment
Configure resource overcommitment thresholds and behavior.
These properties define how Pulse detects node states (underutilized, optimal, or overcommitted) and manages optimization actions.
Adjust the values based on your cluster size and workload patterns.
| Property | Description | Default Value |
|---|---|---|
| Enable Optimization | Enables or disables resource optimization across nodes. | Disabled |
| Overutilization Threshold | Sets the utilization percentage above which a node is considered overutilized, and the optimization stops till the node comes back to stability. | 80 (in %) |
| Underutilization Margin (MB) | Minimum unused memory (in MB) required for a node to be considered underutilized. | 500 (in MB) |
| Underutilization Margin (vCores) | Minimum unused CPU cores required for a node to be considered underutilized. | 1 |
| Underutilized Cycle Count | Number of consecutive cycles a node must remain underutilized before it is marked for optimization. | 3 |
| Underutilized Cycle Window (seconds) | Time window (in seconds) for evaluating underutilization cycles. | 10 |
| Max Overcommit Multiplier (Memory) | The maximum factor by which memory can be overcommitted relative to the physical capacity. Example: 1.2 = 20% overcommitment. | 1.5 |
| Max Overcommit Multiplier (vCores) | The maximum factor by which CPU cores can be overcommitted relative to the physical capacity. | 1.5 |
| Node History Evaluation Window (seconds) | Time window for evaluating historical node performance before making overcommitment decisions. | - |
| Safe Buffer Overcommit Factor (Memory) | Safety buffer that limits memory overcommitment to prevent node overloads. | - |
| Safe Buffer Overcommit Factor (CPU) | Safety buffer that limits CPU overcommitment to maintain performance stability. | - |
| Resource Manager Poll Interval (seconds) | Interval at which Pulse polls the YARN ResourceManager for node and application metrics. | 10 seconds |
| Threshold CV | Coefficient of variation threshold used to detect performance instability or fluctuations in node metrics. | |
| Employ Node Pressure | Enables node pressure analysis to dynamically adjust overcommitment based on real-time node stress. | False |
| Max Percentage Node Free Memory | Specifies the maximum percentage of free memory to maintain per node before overcommitment stops. | - |
| Employ Memory Dynamics | Enables adaptive adjustment of memory overcommitment based on usage trends and node behavior. | False |
| Ensure Fingerprint Seen Min Count | Sets the minimum number of times a container’s fingerprint must be observed before it is considered reliable for optimization decisions. | - |
| Ensure Fingerprint Seen | Ensures that only containers with previously recorded fingerprints are included in optimization to maintain data consistency and accuracy. | - |
| Default Context Switch Multiplier | Defines the default multiplier applied to context switch pressure when calculating overall node pressure. It adjusts the impact of frequent context switches on optimization decisions, ensuring that systems with higher thread or process activity are weighted appropriately. Example: A multiplier of 1.5 increases the contribution of context switch pressure by 50%, making the optimizer more sensitive to thread-switching overhead. | - |
Configure Pressure Weights
Configure the relative importance of different system pressure metrics—such as CPU, memory, disk, network, and context switches—that influence optimization decisions.
These weights determine how much each resource contributes to the overall node pressure score used by the optimizer.
For details on how pressure thresholds are defined and measured per node type, see Node Pressure Configurations
| Option | Description | Default Value |
|---|---|---|
| CPU Pressure Weight | Defines how much influence CPU pressure has on the overall node pressure score. | 1 |
| Memory Pressure Weight | Defines the importance of memory usage pressure on the node pressure calculation. | 1 |
| Context Switch Pressure Weight | Represents the effect of frequent context switches on overall node pressure. Higher values indicate a heavier impact. | 1 |
| Disk Pressure Weight | Specifies the impact of disk I/O pressure (read/write latency or queue) on node pressure. | 1 |
| Network Pressure Weight | Determines the contribution of network utilization or congestion to the node pressure score. | 1 |
How Node Pressure Works
Node pressure represents the combined load on a system based on multiple resource types—CPU, memory, context switches, disk, and network.
Each type of pressure is multiplied by its assigned weight, and the weighted values are averaged to form the overall Node Pressure Score.
Example:
- If the system detects high CPU usage (0.8), moderate memory usage (0.6), and other pressures averaging around 0.4 with the CPU weight set to 2, the resulting node pressure might be around 0.57.
- Since this is below 0.75, optimization continues normally. If it rises above 0.75, optimization pauses to prevent system overload.
The pressure values (like 0.8 or 0.6) are calculated automatically from real-time system metrics, while only the weights are configured manually in the UI.
Optimization Impact
- When node pressure is low, optimization continues normally.
- When node pressure increases gradually, optimization becomes more conservative.
- When node pressure exceeds 0.75 (on a 0–1 scale), optimization pauses to prevent overloading the system.
Example:
A node pressure score of 0.75 means the system is under 75% total stress from combined resources. At this level, optimization stops temporarily until pressure drops.
Example Scenarios
| Scenario | Description | Example Recommended Weighting |
|---|---|---|
| CPU-Intensive Node | Most pressure comes from high CPU usage (e.g., data processing or computation-heavy tasks). | Increase CPU Pressure Weight (e.g., 2) to make optimization more sensitive to CPU load. |
| Memory-Intensive Node | Heavy memory usage (e.g., large Spark or YARN jobs). | Increase Memory Pressure Weight (e.g., 2) so optimization reduces load sooner when memory pressure rises. |
| High Context Switching | Frequent task switching or excessive thread activity increases CPU overhead. | Raise Context Switch Pressure Weight slightly (e.g., 1.5) to account for system inefficiency. |
| Disk-Intensive Node | High I/O activity, long disk queue times. | Increase Disk Pressure Weight (e.g., 1.5) to limit optimization when storage becomes a bottleneck. |
| Network-Intensive Node | High network usage or data transfer load. | Increase Network Pressure Weight (e.g., 2) to prevent over-optimization under network strain. |
Summary
Each weight tells the optimizer how important that type of system pressure is when deciding whether to optimize.
- A higher weight means “pay more attention” to that resource.
- If overall node pressure (after weighting) goes above 0.75, optimization pauses to keep the system stable.
- For resource-specific thresholds, see Configure Node Pressure.
Configure Windows
Configure time windows and their weights to analyze the past behavior of node, CPU, and memory usage.
This helps you make informed decisions about overcommitting resources.
How It Works
- Define time windows to specify the historical intervals for collecting metrics. Shorter windows capture recent spikes, while longer ones show sustained trends.
- Assign weights to each window to decide how much influence it has on the final analysis. Higher weights give that time window more importance.
Example: Node (Memory) Analysis
- Node Thriver Windows: 5m, 15m, 30m
- Node Thriver Window Weights: 5m:0.7, 15m:0.2, 30m:0.1. This configuration prioritizes recent data (5 minutes) more heavily while still considering longer trends.
Example: CPU Analysis
- Node CPU Windows: 5m, 15m, 30m
- Node CPU Window Weights: 5m:0.5, 15m:0.3, 30m:0.2. This setup helps identify both short-term CPU spikes and long-term usage patterns.
Node Thriver Weighted Sum Overcommit Threshold: Threshold for determining if a node is overcommitted based on weighted sum analysis. When the weighted sum of node thriver metrics across all windows exceeds this value (0.0-1.0), the node is considered overcommitted or overloaded, and optimization actions may be triggered (killing containers or applications one by one until the node is stable based on the configured rules).
- Lower values (e.g., 0.7) trigger optimization more aggressively,
- while higher values (e.g., 0.9) are more conservative.
- Recommended: 0.8 for balanced optimization.
Configure Node Pressure
Configure pressure thresholds for different node types to help the optimizer evaluate and manage node performance under varying load conditions.
This configuration allows tuning for different node profiles, such as memory-heavy or compute-heavy nodes.
Priority Order
all → Specific node types (e.g., 16Gx8, 32Gx8) → default
- all: Applies globally to all nodes and overrides other configurations.
- Specific node type: Applies only to matching nodes (for example, 16Gx8).
- default: Acts as a fallback for node types not explicitly defined.
| Field | Description | Example / Default Value |
|---|---|---|
| Node Type | Defines the node category this configuration applies to. Use all to apply globally or specify a node type (e.g., 16Gx8). | all |
| Context Switches (Involuntary) Total | Sets the maximum number of involuntary context switches allowed before marking the node as pressured. | - |
| Context Switches (All) Total | Sets the total number of context switches (voluntary + involuntary) allowed for a node. | - |
| Disk Max MB/s | Specifies the disk throughput threshold (in MB/s). When this limit is exceeded, disk pressure increases. | - |
| Network Max MB/s | Defines the maximum network throughput in MB/s. Used to measure network-related pressure. | - |
| Network Max Packets/s | Sets the maximum packets per second threshold for network usage. High packet rates may indicate congestion. | - |
Example Scenario
If your cluster includes node types such as 16Gx8 and 32Gx8:
- Use all to define global thresholds.
- Configure specific node types with tuned values (for example, higher disk throughput for data nodes).
- Use default as a fallback for any unmatched node types.
In summary, the Node Pressure Configurations and Pressure Weights work together to determine when and how optimization occurs.
By setting proper thresholds per node type, the optimizer can make resource-aware decisions while maintaining system stability.
Configure Advanced Screening
This configuration defines advanced screening tests and their related parameters used by the YARN Optimizer to refine optimization decisions.
When a main option is enabled, its related sub-options become active for fine-tuning behavior.
| Option | Description | Default / Example Value |
|---|---|---|
| Enable Advanced Screening | Enables or disables all advanced screening tests and filters. | Disabled |
| Enable Realtime Usage Test | Activates real-time analysis of resource usage patterns for screening. Sub-options:
If the current usage exceeds this threshold, the container or job is flagged as unstable and excluded from optimization. Example: If the threshold factor is set to 1.2, and the historical average memory usage is 1 GB, then any container using more than 1.2 GB at runtime will be considered over the limit. | Disabled |
| Enable Age Test | Enables filtering based on the age of containers, applications, or data. When enabled, the optimizer ensures that only containers running for a minimum duration are considered for optimization. Sub-options:
Example: If the Minimum Age Allowed = 3m (3 minutes), only containers that have been running for 3 minutes or longer are considered stable enough for optimization. Containers younger than 3 minutes are excluded, as they might not have sufficient runtime data for accurate assessment. | Disabled |
| Enable Trend Stability Test | Enables evaluation of stability trends across multiple application runs to identify consistent performance patterns. Sub-option: - Trend Stability Hard Filter: Applies strict filtering based on trend stability results. If the 95th percentile of current metric values deviates significantly from the historical fingerprint data, the application is excluded from optimization calculations. | Disabled |
| Enable Fingerprint Trust Test | Enables validation of container fingerprint trust levels for screening accuracy. When enabled, the optimizer compares current runtime data with historical fingerprint data to ensure stability before optimization. Sub-options:
By default, a CV value of 3 means runtime memory can vary up to three times from the fingerprinted memory before being flagged as unstable. Example: If a container’s fingerprinted memory usage is 500 MB, the optimizer allows runtime memory usage up to about 1.5 GB (≈3×). If the current runtime usage exceeds this limit, the container is marked unstable and excluded from optimization.
Example: A container must have at least 20 successful runs to build a trusted fingerprint before being considered for optimization. | Disabled
|