Erasure Coding for Data Durability

Data durability describes how resilient data is to loss. When data is stored in HDFS, Acceldata provides two options for data durability: Replication, which HDFS was originally built on, or Erasure Coding (EC).

Erasure coding reduces the storage footprint and increases data durability, while replication ensures rapid access.

Replication

HDFS creates two copies of data, resulting in three total instances of data. These copies are stored on separate DataNodes to guard against data loss when a node is unreachable. When the data stored on a node is lost or inaccessible, it is replicated from one of the other nodes to a new node so that there are always multiple copies.

  • The number of replications is configurable, but the default is three.
  • Acceldata recommends keeping the replication factor to at least three when you have three or more DataNodes.
  • A lower replication factor leads to a situation where the data is more vulnerable to DataNode failures since there are fewer copies of data spread out across fewer DataNodes.
  • When the data is written to an HDFS cluster that uses replication, additional copies of the data are automatically created. No additional steps are required.

Replication supports all data processing engines that Acceldata supports.

Erasure Coding (EC)

Erasure Coding (EC) is an alternative to replication. When an HDFS cluster uses EC, no additional direct copies of the data are generated. Instead, the data is striped into blocks and encoded to generate parity blocks. If there are any missing or corrupt blocks, HDFS uses the parity blocks to reconstruct the missing pieces in the background. This process provides a similar level of data durability to 3x replication but at a lower storage cost.

Additionally, EC is applied when data is written. This means that to use EC, you must first create a directory and configure it for EC. Then, you can either replicate existing data or write new data into this directory.

EC supports the following data processing engines:

  • Hive
  • MapReduce
  • Spark

With both data durability schemes, replication and EC, recovery happens in the background and requires no direct input from a user.

HDFS clusters can be configured with a single data durability scheme (3x replication or EC), or with a hybrid data durability scheme where EC enabled directories co-exist on a cluster with other directories that are protected with the traditional 3x replication model. This decision must be based on the temperature of the data (how often the data is accessed) stored in HDFS.

  • Hot data: The data that is accessed frequently, must use replication.
  • Cold data: The data that is accessed less frequently, can take advantage of EC's storage savings.

For details about enabling Erasure Coding, access the following pages:

Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard
  Last updated