Troubleshooting JVM: JStack, Heap Dumps and JFR

Overview

Java Virtual Machine (JVM) troubleshooting requires systematic data collection and analysis to identify performance bottlenecks, memory leaks, and threading issues. This page provides practical approaches for collecting diagnostic dumps and leveraging them for effective problem resolution in production environments.

Prerequisites and Setup

  • Before beginning troubleshooting activities, ensure your environment is properly configured for diagnostic data collection.
  • Enable core dump generation with ulimit -c unlimited and consider using the -XX:+HeapDumpOnOutOfMemoryError flag to automatically capture heap dumps during memory errors.
  • For continuous monitoring, set up Java Flight Recorder with minimal overhead using continuous recording mode.

Thread Dump Collection with Jstack

Finding the Java Process

First, identify the target Java process using one of the following methods:

Bash
Copy

Collecting Thread Dumps

Thread dumps provide snapshots of all active threads and their current state. The jstack utility is the primary tool used for collecting thread dumps.

Bash
Copy

For modern JDK versions, jcmd is the recommended alternative to jstack due to enhanced diagnostics and reduced performance overhead.

Bash
Copy

Alternative Collection Methods

The thread dumps can also be generated through:

  • Java VisualVM graphical interface
  • Java Mission Control (JMC)
  • Programmatically using Thread.getAllStackTraces()
  • Sending QUIT signal (Ctrl+\ on Unix systems)

Heap Dump Collection

The jcmd utility provides the most reliable method for heap dump generation.

Bash
Copy

Using jmap

While jmap is available, it's considered experimental and unsupported in newer JDK versions.

Bash
Copy

Using Java VisualVM

Java VisualVM provides a graphical interface for heap dump collection. Connect to the target process and use the "Heap Dump" button in the application tab. The dump can be analyzed immediately or saved for later analysis.

Starting JFR at Application Startup

Enable JFR from application launch for comprehensive profiling.

Bash
Copy

Runtime JFR Collection

For running applications, use jcmd to control JFR recordings.

Bash
Copy

JFR Configuration Options

JFR supports various configuration parameters for customized data collection.

  • duration: Maximum recording duration
  • maxage: Maximum age of recorded data to keep
  • maxsize: Maximum size of recording data
  • settings: Predefined configuration (default, profile, custom)

Troubleshooting Strategies Using Dump Analysis

Thread Dump Analysis Strategies

Identifying Deadlocks

Thread dumps automatically detect and report deadlocks. Look for threads in BLOCKED state and examine the lock chain to identify circular dependencies.

Bash
Copy

Analyzing Thread States

Focus on thread state distribution to identify performance issues:

  • RUNNABLE: Threads actively executing or ready to execute
  • BLOCKED: Threads waiting for monitor locks
  • WAITING: Threads waiting indefinitely for another thread
  • TIMED_WAITING: Threads waiting for a specified period

Thread Contention Analysis

Examine threads waiting on the same monitors to identify contention hotspots. Multiple threads blocked on identical resources indicate synchronization bottlenecks.

Heap Dump Analysis Strategies

Memory Leak Detection

Use JMC, JVisual VM, or similar tools for automated leak detection. The automated analysis feature generates leak suspect reports with minimal user intervention:

  1. Load JFR dump in specified tool
  2. Click "Leak Suspects" for automated analysis
  3. Review suspect objects and their reference chains
  4. Identify objects that should have been garbage collected

Object Retention Analysis

Analyze object retention patterns by examining:

  • Largest objects by size
  • Object count by class
  • Reference chains preventing garbage collection
  • Duplicate strings and arrays

Memory Usage Optimization

Focus on:

  • Classes consuming the most memory
  • Objects with excessive duplication
  • Large collections that may need optimization
  • String internalization opportunities

JFR Analysis Strategies

Performance Bottleneck Identification: JFR provides comprehensive performance insights through Java Mission Control:

  • Method Profiling: Identify CPU-intensive methods through stack trace sampling. Look for methods consuming disproportionate CPU time.
  • Memory Allocation Analysis: Track object allocation patterns to identify memory pressure sources. Focus on allocation rate and object types.
  • Garbage Collection Analysis: Monitor GC frequency, duration, and impact on application performance. Excessive GC activity indicates memory tuning opportunities.

Threading and Concurrency Issues

JFR captures detailed threading information:

  • Thread contention events showing lock competition
  • Thread parking and blocking events
  • Context switching frequency and overhead
  • Thread pool utilization patterns

I/O and Network Performance

Monitor I/O operations and network activity:

  • File I/O latency and throughput
  • Network connection patterns
  • Database query performance
  • Resource utilization trends

JVM Diagnostics: Performance Impact and Pauses

There are significant performance impacts and application pauses when collecting JVM diagnostic dumps. Each method has different overhead characteristics and pause behaviors that you need to understand before implementing them in production environments.

Thread Dump Collection Impact

  • jstack Performance Impact

Thread dump collection using jstack causes stop-the-world pauses where all application threads are suspended. This occurs because jstack requires all threads to reach a safepoint before the dump can be generated. The pause duration is typically brief (milliseconds to seconds) but can vary depending on application complexity and thread count.

The jstack utility operates through the HotSpot Serviceability Agent, which suspends the entire target process during execution. This means not only are application threads stopped, but the whole process becomes unresponsive during dump collection.

  • jcmd Thread Dump Performance

The jcmd utility is recommended over jstack for modern JDK versions due to enhanced diagnostics and reduced performance overhead. While jcmd still requires safepoint synchronization, it generally has lower impact than legacy tools. The performance difference becomes more pronounced in high-throughput applications where even brief pauses can affect response times.

Heap Dump Collection Impact

  • Stop-the-World Behavior

Heap dump collection represents one of the most significant performance impacts among diagnostic methods. Heap dumps are stop-the-world operations that pause all application activity during collection. This pause can last from seconds to multiple minutes depending on heap size.

  • jmap vs jcmd Performance Comparison

Both jmap and jcmd cause application pauses during heap dump generation, but with different characteristics:

  • jmap: Uses the Serviceability Agent approach, which suspends the entire target process. The -heap option specifically causes stop-the-world pauses
  • jcmd: Performs heap dumps in-process through the Dynamic Attach Mechanism, creating an AttachListener thread while terminating other threads
  • Production Environment Risks

In production environments, heap dump collection can cause health check failures and service termination. Large heap sizes (100GB+) can result in multi-minute pauses that trigger monitoring systems to restart applications. Additionally, heap dumps require substantial disk space equal to the heap size, potentially filling disk partitions if insufficient storage is available.

Java Flight Recorder (JFR) Overhead

  • Minimal Production Impact

JFR represents the lowest overhead option among all diagnostic methods. The overhead for standard profiling recordings is less than 2 percent for most applications. Running with continuous recording generally has no measurable performance impact, making it suitable for production environments.

  • Configuration Considerations

The primary performance consideration with JFR involves Heap Statistics events, which are disabled by default. When enabled, these events trigger old generation garbage collections at recording start and end, adding pause times that may impact latency-sensitive applications.

Safepoint Pause Characteristics

  • Time-to-Safepoint (TTSP) Issues

Application pauses during diagnostic collection depend heavily on time-to-safepoint behavior. Some threads may take seconds or even minutes to reach Safepoints, especially in applications with long-running loops or extensive native code execution. This can result in diagnostic operations taking much longer than expected.

Safepoint Operation Types

Different diagnostic operations trigger various Safepoint types:

  • Thread dumps: Require global Safepoints for consistent thread state capture
  • Heap dumps: Need Safepoints for heap consistency during memory snapshot
  • JFR events: Most events require minimal Safepoint coordination

Production Environment Recommendations

Risk Mitigation Strategies

To minimize production impact when collecting diagnostic data:

  • Pre-allocate sufficient disk space for heap dumps to prevent storage exhaustion
  • Use cloud resources with adequate memory and storage for large heap analysis
  • Schedule collection during maintenance windows when possible
  • Implement automated collection with -XX:+HeapDumpOnOutOfMemoryError for critical failures

Tool Selection Guidelines

Based on performance impact considerations:

  • JFR: Preferred for continuous production monitoring due to minimal overhead
  • jcmd: Recommended over legacy tools for better performance and enhanced features
  • jstack/jmap: Use sparingly in production due to stop-the-world behavior

Best Practices and Recommendations

Production Environment Considerations

When collecting diagnostic data in production:

  • Use continuous JFR recordings with circular buffers to minimize storage impact
  • Enable automatic heap dump generation on OutOfMemoryError
  • Maintain verboseGC logging for historical memory analysis
  • Archive diagnostic data regularly during maintenance windows

Performance Impact Considerations

Minimize diagnostic overhead:

  • JFR overhead is typically less than 2% for standard profiling
  • Heap dump collection can cause brief application pauses
  • Thread dump collection has minimal performance impact
  • Avoid enabling heap statistics in latency-sensitive environments
Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard
  Last updated