HDFS: Separate RPC Queues and Enable Lifeline Protocol
To separate RPC queue for Hadoop Services and Clients, introduces a new lightweight RPC message that is used by the DataNodes to report their health status to the NameNode. This avoids DataNode going into Stale Mode.
RPC Handler Count
The Hadoop RPC server consists of a single RPC queue per port and multiple handler (worker) threads that dequeue and process requests. If the number of handlers is insufficient, then the RPC queue starts building up and eventually overflows. You may start seeing task failures and eventually job failures and unhappy users.
It is recommended that the RPC handler count be set to 20 * log2(Cluster Size) with an upper limit of 300.
For example, for a 250 node cluster, you must initialize this to 20 * log2(250) = 160
Set Service RPC port: The service RPC port gives the DataNodes a dedicated port to report their status via block reports and heartbeats. The port is also used by Zookeeper Failover Controllers for periodic health checks by the automatic failover logic. The port is never used by client applications hence it reduces RPC queue contention between client requests and DataNode messages. For an HA cluster, the service RPC port can be enabled with settings like the following, replacing name-service, and namenodes fqdn appropriately.
Navigate to Ambari -> HDFS -> Configs -> Advanced -> Custom hdfs-site -> Add Property and updated the following:
dfs.namenode.servicerpc-address.<name-service>.nn1=acceldata1.hadoop.local:8021
dfs.namenode.servicerpc-address.<name-service>.nn2=acceldata2.hadoop.local:8021
Restart these components: HDFS, Yarn, and Map-Reduce.
Set the DataNode Lifeline Protocol: The Lifeline protocol is a feature recently added by the Apache Hadoop Community (see Apache HDFS Jira _ [HDFS-9239](https://issues.apache.org/jira/browse/HDFS-9239))_. It introduces a new lightweight RPC message that is used by the DataNodes to report their health to the NameNode. It was developed in response to problems seen in some overloaded clusters where the NameNode was too busy to process heartbeats and spuriously marked DataNodes as dead. 1. For an HA cluster, the lifeline RPC port can be enabled with settings like the following, replacing name-service, and namenode fqdn appropriately.
- Make the below changes to enable lifeline.rpc:
dfs.namenode.lifeline.rpc-address.<name-service>.nn1=acceldata1.hadoop.local:8022
dfs.namenode.lifeline.rpc-address.<name-service>.nn2=acceldata2.hadoop.local:8022
- Make the below changes to enable lifeline.rpc:
Restart the Standby Namenode. You need to __wait until the standby NameNode exits safe mode.
You can check the safe mode status on the standby NameNode UI.
- Run the command to manually transition a standby NameNode to the active state in an HDFS High Availability (HA) setup.
hdfs haadmin -transitionToActive nn1/nn2 --forceactive
- Restart the other NameNode, you need to wait till the NameNode is out of safe mode.
- Stop the Standby NameNode ZKFC controller (This is already in the stopped state).
- Stop the Active NameNode ZKFC controller (This is already in the stopped state).
- Log in to the Active NameNode and reset the NameNode HA state.
#su - hdfs
$hdfs zkfc -formatZK
- Log in to the Standby NameNode and reset the NameNode HA state.
#su - hdfs
$hdfs zkfc -formatZK
- Rolling restart JN’s with a gap of 500 seconds between restarts of JNs.
- Rolling restart of other HDFS components pending for restart (except zkfc) .
The parameters to be tuned and set for better performance as per the availability of cores and the number of nodes in the cluster, connect with Acceldata Support for recommended changes based on the cluster sizing.