Troubleshooting JVM: JStack, Heap Dumps and JFR
Overview
Java Virtual Machine (JVM) troubleshooting requires systematic data collection and analysis to identify performance bottlenecks, memory leaks, and threading issues. This page provides practical approaches for collecting diagnostic dumps and leveraging them for effective problem resolution in production environments.
Prerequisites and Setup
- Before beginning troubleshooting activities, ensure your environment is properly configured for diagnostic data collection.
- Enable core dump generation with
ulimit -c
unlimited and consider using the -XX:+HeapDumpOnOutOfMemoryError
flag to automatically capture heap dumps during memory errors. - For continuous monitoring, set up Java Flight Recorder with minimal overhead using continuous recording mode.
Thread Dump Collection with Jstack
Finding the Java Process
First, identify the target Java process using one of the following methods:
# List all Java processes
jps -l
# Alternative method using ps
ps -ef | grep java
Collecting Thread Dumps
Thread dumps provide snapshots of all active threads and their current state. The jstack utility is the primary tool used for collecting thread dumps.
# Basic thread dump collection
jstack <pid>
# Thread dump with extended lock information
jstack -l <pid>
# Force thread dump for hung processes (Linux/Solaris only)
jstack -F <pid>
For modern JDK versions, jcmd
is the recommended alternative to jstack
due to enhanced diagnostics and reduced performance overhead.
# Using jcmd for thread dumps
jcmd <pid> Thread.print
Alternative Collection Methods
The thread dumps can also be generated through:
- Java VisualVM graphical interface
- Java Mission Control (JMC)
- Programmatically using
Thread.getAllStackTraces()
- Sending QUIT signal (Ctrl+\ on Unix systems)
Heap Dump Collection
Using jcmd (Recommended over jmap)
The jcmd
utility provides the most reliable method for heap dump generation.
# Generate heap dump with jcmd
jcmd <pid> GC.heap_dump /path/to/heapdump.hprof
# Generate heap dump with only live objects
jcmd <pid> GC.heap_dump -live /path/to/heapdump.hprof
Using jmap
While jmap
is available, it's considered experimental and unsupported in newer JDK versions.
# Basic heap dump with jmap
jmap -dump:format=b,file=/tmp/heapdump.hprof <pid>
# Heap dump with only reachable objects
jmap -dump:live,format=b,file=/tmp/heapdump.hprof <pid>
Using Java VisualVM
Java VisualVM provides a graphical interface for heap dump collection. Connect to the target process and use the "Heap Dump" button in the application tab. The dump can be analyzed immediately or saved for later analysis.
Java Flight Recorder (JFR) Collection (Highly Recommended)
Starting JFR at Application Startup
Enable JFR from application launch for comprehensive profiling.
# Basic JFR recording at startup
java -XX:StartFlightRecording=filename=recording.jfr,duration=60s MyApp
# JFR with profiling settings
java -XX:StartFlightRecording=filename=recording.jfr,duration=60s,settings=profile MyApp
# Continuous recording with size and time limits
java -XX:StartFlightRecording=filename=recording.jfr,maxage=4h,maxsize=400MB MyApp
Runtime JFR Collection
For running applications, use jcmd
to control JFR recordings.
# Start a JFR recording
jcmd <pid> JFR.start name=myrecording duration=120s filename=recording.jfr
# Start with profiling settings for detailed analysis
jcmd <pid> JFR.start name=detailed duration=300s filename=detailed.jfr settings=profile
# Stop a specific recording
jcmd <pid> JFR.stop name=myrecording
# Dump current recording data
jcmd <pid> JFR.dump name=myrecording filename=dump.jfr
JFR Configuration Options
JFR supports various configuration parameters for customized data collection.
duration
: Maximum recording durationmaxage
: Maximum age of recorded data to keepmaxsize
: Maximum size of recording datasettings
: Predefined configuration (default, profile, custom)
Troubleshooting Strategies Using Dump Analysis
Thread Dump Analysis Strategies
Identifying Deadlocks
Thread dumps automatically detect and report deadlocks. Look for threads in BLOCKED
state and examine the lock chain to identify circular dependencies.
"Thread-1" #10 prio=5 os_prio=0 tid=0x... nid=0x... waiting for monitor entry
java.lang.Thread.State: BLOCKED (on object monitor)
Analyzing Thread States
Focus on thread state distribution to identify performance issues:
- RUNNABLE: Threads actively executing or ready to execute
- BLOCKED: Threads waiting for monitor locks
- WAITING: Threads waiting indefinitely for another thread
- TIMED_WAITING: Threads waiting for a specified period
Thread Contention Analysis
Examine threads waiting on the same monitors to identify contention hotspots. Multiple threads blocked on identical resources indicate synchronization bottlenecks.
Heap Dump Analysis Strategies
Memory Leak Detection
Use JMC, JVisual VM, or similar tools for automated leak detection. The automated analysis feature generates leak suspect reports with minimal user intervention:
- Load JFR dump in specified tool
- Click "Leak Suspects" for automated analysis
- Review suspect objects and their reference chains
- Identify objects that should have been garbage collected
Object Retention Analysis
Analyze object retention patterns by examining:
- Largest objects by size
- Object count by class
- Reference chains preventing garbage collection
- Duplicate strings and arrays
Memory Usage Optimization
Focus on:
- Classes consuming the most memory
- Objects with excessive duplication
- Large collections that may need optimization
- String internalization opportunities
JFR Analysis Strategies
Performance Bottleneck Identification: JFR provides comprehensive performance insights through Java Mission Control:
- Method Profiling: Identify CPU-intensive methods through stack trace sampling. Look for methods consuming disproportionate CPU time.
- Memory Allocation Analysis: Track object allocation patterns to identify memory pressure sources. Focus on allocation rate and object types.
- Garbage Collection Analysis: Monitor GC frequency, duration, and impact on application performance. Excessive GC activity indicates memory tuning opportunities.
Threading and Concurrency Issues
JFR captures detailed threading information:
- Thread contention events showing lock competition
- Thread parking and blocking events
- Context switching frequency and overhead
- Thread pool utilization patterns
I/O and Network Performance
Monitor I/O operations and network activity:
- File I/O latency and throughput
- Network connection patterns
- Database query performance
- Resource utilization trends
JVM Diagnostics: Performance Impact and Pauses
There are significant performance impacts and application pauses when collecting JVM diagnostic dumps. Each method has different overhead characteristics and pause behaviors that you need to understand before implementing them in production environments.
Thread Dump Collection Impact
- jstack Performance Impact
Thread dump collection using jstack
causes stop-the-world pauses where all application threads are suspended. This occurs because jstack
requires all threads to reach a safepoint before the dump can be generated. The pause duration is typically brief (milliseconds to seconds) but can vary depending on application complexity and thread count.
The jstack
utility operates through the HotSpot Serviceability Agent, which suspends the entire target process during execution. This means not only are application threads stopped, but the whole process becomes unresponsive during dump collection.
- jcmd Thread Dump Performance
The jcmd
utility is recommended over jstack
for modern JDK versions due to enhanced diagnostics and reduced performance overhead. While jcmd
still requires safepoint synchronization, it generally has lower impact than legacy tools. The performance difference becomes more pronounced in high-throughput applications where even brief pauses can affect response times.
Heap Dump Collection Impact
- Stop-the-World Behavior
Heap dump collection represents one of the most significant performance impacts among diagnostic methods. Heap dumps are stop-the-world operations that pause all application activity during collection. This pause can last from seconds to multiple minutes depending on heap size.
- jmap vs jcmd Performance Comparison
Both jmap
and jcmd
cause application pauses during heap dump generation, but with different characteristics:
- jmap: Uses the Serviceability Agent approach, which suspends the entire target process. The
-heap
option specifically causes stop-the-world pauses - jcmd: Performs heap dumps in-process through the Dynamic Attach Mechanism, creating an AttachListener thread while terminating other threads
- Production Environment Risks
In production environments, heap dump collection can cause health check failures and service termination. Large heap sizes (100GB+) can result in multi-minute pauses that trigger monitoring systems to restart applications. Additionally, heap dumps require substantial disk space equal to the heap size, potentially filling disk partitions if insufficient storage is available.
Java Flight Recorder (JFR) Overhead
- Minimal Production Impact
JFR represents the lowest overhead option among all diagnostic methods. The overhead for standard profiling recordings is less than 2 percent for most applications. Running with continuous recording generally has no measurable performance impact, making it suitable for production environments.
- Configuration Considerations
The primary performance consideration with JFR involves Heap Statistics events, which are disabled by default. When enabled, these events trigger old generation garbage collections at recording start and end, adding pause times that may impact latency-sensitive applications.
Safepoint Pause Characteristics
- Time-to-Safepoint (TTSP) Issues
Application pauses during diagnostic collection depend heavily on time-to-safepoint behavior. Some threads may take seconds or even minutes to reach Safepoints, especially in applications with long-running loops or extensive native code execution. This can result in diagnostic operations taking much longer than expected.
Safepoint Operation Types
Different diagnostic operations trigger various Safepoint types:
- Thread dumps: Require global Safepoints for consistent thread state capture
- Heap dumps: Need Safepoints for heap consistency during memory snapshot
- JFR events: Most events require minimal Safepoint coordination
Production Environment Recommendations
Risk Mitigation Strategies
To minimize production impact when collecting diagnostic data:
- Pre-allocate sufficient disk space for heap dumps to prevent storage exhaustion
- Use cloud resources with adequate memory and storage for large heap analysis
- Schedule collection during maintenance windows when possible
- Implement automated collection with
-XX:+HeapDumpOnOutOfMemoryError
for critical failures
Tool Selection Guidelines
Based on performance impact considerations:
- JFR: Preferred for continuous production monitoring due to minimal overhead
- jcmd: Recommended over legacy tools for better performance and enhanced features
- jstack/jmap: Use sparingly in production due to stop-the-world behavior
Best Practices and Recommendations
Production Environment Considerations
When collecting diagnostic data in production:
- Use continuous JFR recordings with circular buffers to minimize storage impact
- Enable automatic heap dump generation on OutOfMemoryError
- Maintain verboseGC logging for historical memory analysis
- Archive diagnostic data regularly during maintenance windows
Performance Impact Considerations
Minimize diagnostic overhead:
- JFR overhead is typically less than 2% for standard profiling
- Heap dump collection can cause brief application pauses
- Thread dump collection has minimal performance impact
- Avoid enabling heap statistics in latency-sensitive environments