Enabling the ODP Ranger KMS High Availability (HA) and Troubleshooting
Debugging Ranger KMS when HA is Enabled
This document provides a structured approach to diagnosing and troubleshooting performance issues with Ranger KMS (Key Management Server) in a High Availability (HA) setup. It covers key configuration validation, ZooKeeper cleanup, system health checks, data collection, and advanced debugging techniques to ensure efficient encryption key management in HDFS.
Enable KMS HA
- Stop the KMS service during off-peak hours.
- Clean up any stale data in ZooKeeper:
zookeeper-client -server `ZK-hostname:2181`
deleteall /zkdtsm
quit
- Set or update the following properties under Ambari → Ranger KMS → Configuration:
hadoop.kms.cache.enable=true
hadoop.kms.authentication.zk-dt-secret-manager.enable=true
hadoop.kms.authentication.signer.secret.provider=zookeeper
hadoop.kms.proxyuser.hdfs.groups=*
hadoop.kms.proxyuser.hdfs.hosts=*
hadoop.kms.proxyuser.hdfs.users=*
- Separate Note (on ZooKeeper connection issues):
java.lang.NullPointerException: Zookeeper connection string cannot be null
at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:895)
hadoop.kms.authentication.zk-dt-secret-manager.zkConnectionString=<zookeeper quorum>
hadoop.kms.authentication.zk-dt-secret-manager.zkAuthType=none
- Make sure to replace
<zookeeper quorum>
with the correct ZooKeeper hosts and ports (for example:zk1.example.com:2181,zk2.example.com:2181,zk3.example.com:2181
).
- Start or restart the KMS service. Validate the KMS connectivity to ZooKeeper and confirm it is fully operational.
Clean Up ZooKeeper Nodes
ZooKeeper facilitates coordination between KMS instances and manages distributed key metadata. Stale data in ZooKeeper can lead to inconsistent encryption key states or service failures.
Perform the following cleanup steps:
- Stop the KMS service during off-peak hours to minimize service disruption.
- Delete stale KMS-related entries from ZooKeeper using the following command:
deleteall /zkdtsm
This removes outdated encryption key metadata.
- Restart a single KMS server first to verify a clean startup before bringing additional servers online.
Post-Restart Validation
After restarting the first KMS server, confirm that it is functioning correctly before adding more instances to the cluster:
- Verify the ZooKeeper state by executing these commands in the ZooKeeper shell:
get /zkdtsm/ZKDTSMRoot/ZKDTSMMasterKeyRoot
ls /zkdtsm/ZKDTSMRoot/ZKDTSMMasterKeyRoot
These commands should return valid encryption key metadata.
The missing or inconsistent entries may indicate the ZooKeeper synchronization issues.
- Monitor request latency:
- Use
jstack
orjcmd
to inspect the active KMS server threads. - Run a test encryption request to validate the KMS response time.
- Use
Application and HDFS Validation
To verify that KMS is functioning correctly within the Hadoop ecosystem:
- Trigger encryption-related operations, such as:
- Reading/writing encrypted data in HDFS
- Listing encryption zones using:
hdfs crypto -listZones
Monitor for performance issues, including:
- High latency in KMS responses
- Errors or warnings in NameNode logs
- Sluggish decryption operations
Collect diagnostic data:
- Run:
hdfs dfsadmin -report
This provides an overview of cluster health.
Check configuration files:
hdfs-site.xml
(for encryption-related settings)core-site.xml
(for KMS authentication settings)
Monitor resource utilization on KMS servers:
- CPU/memory usage (
top
,ps aux
,htop
) - Disk I/O (
iostat
,iotop
) - Network activity (
netstat
,ss
)
- CPU/memory usage (
Performance Data Collection
If performance issues persist, gather detailed data for analysis:
Capture Java Flight Recorder (JFR) dumps to analyze system behavior:
Collect 3–5 JFR dumps, each lasting 10 minutes, from:
- The active NameNode
- All KMS servers
JFR captures thread activity, garbage collection (GC) patterns, and CPU utilization.
Gather logs from:
- NameNode (
hdfs-audit.log
,namenode.log
) - KMS servers (
/var/log/ranger/kms/ranger-kms-hostname -f-kms.log, /var/log/ranger/kms/kms-audit-hostname -f-kms.log
) - Ranger Admin (/var/log/ranger/admin/ranger-admin-
hostname -f
-ranger.log, )
- NameNode (
Debugging Further Issues
If performance degradation persists:
- Enable DEBUG mode in KMS logging by modifying the log4j properties file (
kms-log4j.properties
):
log4j.logger.org.apache.hadoop.crypto.key.kms=DEBUG
- Restart the KMS service and reproduce the issue.
- Analyze logs for potential issues, such as:
- ZooKeeper session timeouts
- Slow SQL queries affecting the KMS database
- Authentication failures or token expiration
Additional Troubleshooting Steps
Check whether the performance issue is isolated to a specific KMS server.
- If one server exhibits higher latency, compare logs and resource usage with a healthy instance.
Migrate KMS to a different host to rule out hardware or network-related bottlenecks.
Analyze MySQL database performance:
- Execute the following to to detect slow queries.
SHOW PROCESSLIST;
- Inspect
mysqld.log
for deadlocks or excessive query execution times. - Verify that MySQL connection pool settings are correctly configured.
Advanced Troubleshooting Techniques
For persistent issues, consider:
Profiling KMS using Java Mission Control (JMC)
- JMC provides insight into memory usage, GC pauses, and thread contention.
Capturing network traffic between KMS and clients
- Use
tcpdump
or Wireshark to diagnose potential network latencies.
- Use
Verifying ZooKeeper quorum health
- Run:
echo ruok | nc <zk_host> <zk_port>
A healthy ZooKeeper instance should return imok
.
- Checking HDFS encryption key access
- Execute to ensure that KMS serves the correct encryption keys.
hdfs crypto -listKeys
Troubleshooting Checklist
Sr.No | Checkpoint | Expected Outcome | Actions if Failing |
---|---|---|---|
1 | Validate KMS cache settings | Cache is disabled (hadoop.kms.cache.enable=false ) | Update configuration and restart KMS. |
2 | Clean up ZooKeeper nodes | /zkdtsm is removed and recreated | Delete the node manually and restart KMS. |
3 | Verify ZooKeeper key storage | Keys exist under /zkdtsm/ZKDTSMRoot/ZKDTSMMasterKeyRoot | Check KMS logs and reinitialize keys. |
4 | Validate encryption operations | The encrypted files can be read or written without delays. | Check logs for authentication or Database issues. |
5 | Monitor KMS resource usage | The CPU, memory, and network activity are within the normal limits. | Analyze JFR dumps and optimize KMS heap settings. |
6 | Analyze MySQL performance | No slow queries or connection pool issues. | Tune MySQL settings and add indexes, if needed. |
7 | Enable Debug logging and restart | The Logs capture detailed request processing. | Analyze logs for timeouts and authentication failures |
Conclusion
This guide provides a comprehensive troubleshooting framework for diagnosing and resolving performance issues in a Ranger KMS HA environment. By following a structured validation and debugging process, administrators can identify and mitigate encryption-related bottlenecks, ensuring optimal HDFS security and performance.