Erasure Coding for Data Durability
Data durability describes how resilient data is to loss. When data is stored in HDFS, Acceldata provides two options for data durability: Replication, which HDFS was originally built on, or Erasure Coding (EC).
Erasure coding reduces the storage footprint and increases data durability, while replication ensures rapid access.
Replication
HDFS creates two copies of data, resulting in three total instances of data. These copies are stored on separate DataNodes to guard against data loss when a node is unreachable. When the data stored on a node is lost or inaccessible, it is replicated from one of the other nodes to a new node so that there are always multiple copies.
- The number of replications is configurable, but the default is three.
- Acceldata recommends keeping the replication factor to at least three when you have three or more DataNodes.
- A lower replication factor leads to a situation where the data is more vulnerable to DataNode failures since there are fewer copies of data spread out across fewer DataNodes.
- When the data is written to an HDFS cluster that uses replication, additional copies of the data are automatically created. No additional steps are required.
Replication supports all data processing engines that Acceldata supports.
Erasure Coding (EC)
Erasure Coding (EC) is an alternative to replication. When an HDFS cluster uses EC, no additional direct copies of the data are generated. Instead, the data is striped into blocks and encoded to generate parity blocks. If there are any missing or corrupt blocks, HDFS uses the parity blocks to reconstruct the missing pieces in the background. This process provides a similar level of data durability to 3x replication but at a lower storage cost.
Additionally, EC is applied when data is written. This means that to use EC, you must first create a directory and configure it for EC. Then, you can either replicate existing data or write new data into this directory.
EC supports the following data processing engines:
- Hive
- MapReduce
- Spark
With both data durability schemes, replication and EC, recovery happens in the background and requires no direct input from a user.
HDFS clusters can be configured with a single data durability scheme (3x replication or EC), or with a hybrid data durability scheme where EC enabled directories co-exist on a cluster with other directories that are protected with the traditional 3x replication model. This decision must be based on the temperature of the data (how often the data is accessed) stored in HDFS.
- Hot data: The data that is accessed frequently, must use replication.
- Cold data: The data that is accessed less frequently, can take advantage of EC's storage savings.
For details about enabling Erasure Coding, access the following pages:
- Understanding Erasure Coding Policies: The EC policy determines how data is encoded and decoded. An EC policy is made up of the following parts:
codec-number of data blocks-number of parity blocks-cell size
. - Comparing Replication and Erasure Coding: You must consider factors such as data temperature, i/o cost, storage cost, and file size when comparing replication and erasure coding.
- Best Practices for Rack and Node Setup for EC: In an on-premises deployment, when setting up a cluster to take advantage of EC, consider the number of racks and nodes in your setup.
- Prerequisites for Enabling Erasure Coding: Before enabling erasure coding on your data, you must consider various factors such as the type of policy to use, the type of data, and the rack or node requirements.
- Limitations of Erasure Coding: The limitations of erasure coding include non-support of XOR codecs and certain HDFS functions.
- Using Erasure Coding for Existing Data: You must set a supported EC policy for a directory and copy the existing data to the directory.
- Using Erasure Coding for New Data: You must create a new directory and then set a supported EC policy for the directory.
- Advanced Erasure Coding Configuration: You can customize the behavior of EC through a combination of the hdfs ec subcommand.
- Erasure Coding CLI Commands: You can use the
hdfs ec
command to set erasure coding policies on directories. - Erasure Coding Examples: You can use the
hdfs ec
command with its various options to set erasure coding policies on directories.