Data Protection Using Erasure Codes
Data Protection Using Erasure Codes
Erasure coding is a an alternative option for replication to ensure data redundancy. It's a method in which data is broken into fragments and some parity data is added to the actual data and stored across different locations. Read on to learn more about it.
Join the DZone community and get the full member experience.Join For Free
With the increase in digitization of devices, the digital data produced has exploded! A lot of this generated data, if not all, is preferred to be stored for future reference and to leverage analytics. How happy does a consumer feel when he is prompted with the history of the relevant items he has purchased in the past while he is trying to buy something?
Digital data is estimated to be doubling in size every two years and is expected to reach 44 zettabytes (44 trillion gigabytes) by 2020. (Source – IDC 2014). And, this is expected to increase exponentially with technologies like Internet of Things maturing!
Hence, enterprises are constantly reviewing their storage capacity and trying to scale for future. But, along with scaling, they are also trying to ensure that existing storage is being used optimally and efficiently.
A significant amount of storage space is reserved for redundancy, where copies of data are stored to ensure that data is always available in case of hardware and other failures. It is always a tradeoff between the number of copies of data (depending on the importance of data) vs. storage capacity. Many times, the capacity used to store replicas is more than the capacity to store the original files (more than 1 copy).
Hence, the industry has started looking for alternative options for replication to ensure data redundancy! One alternative specifically is in cloud storage, where a user is given the impression of infinite storage and continuous availability of data!
Erasure coding (EC) is one such option that many users are evaluating (some have already evaluated and implemented).
Erasure coding is a method in which data is broken into fragments and some parity data is added to the actual data and stored across different locations.
Parity data is calculated via some mathematical functions such that data can be recovered if any of the data fragments is lost.
A simple example of parity calculation is as follows:
- a = 2 (data)
- b = 5 (data)
- a + b = 7 (parity)
- 2a + b = 9 (parity)
Hence, if "a=2" is lost, it can be reconstructed using the remaining equations. Even if both data bits are lost: "a=2" and "b=5", both can be reconstructed.
Now, the number of fragments and parity bits is defined by Erasure coding configuration parameter (m:n). For instance, EC configuration of 6:3 implies 6 data fragments and 3 parity fragments, where data loss of up to 3 fragments can be tolerated. Data can be reconstructed using any 6 fragments. This configuration can be decided by the user based upon the importance of data. It is a tradeoff between storage and accepted number of fragment loss.
Erasure coding consumes less storage as compared to mirroring but can be more CPU-intensive.
The below table provides a sample comparison between Erasure coding and Tri-Replication:
Some of the common traditional erasure coding techniques are:
- Reed-Solomon Coding
- Cauchy Reed-Solomon Codes
- EVENODD Coding
Erasure coding is more suitable for scenarios involving huge data sets, archiving, object storage, cloud, etc.
Lack of knowledge and confidence are the major barriers in acceptance of erasure coding. People are more confident with replicas as reconstruction, which simply involves copying the replica, unlike in erasure coding, where data is being split and the user depends on mathematical calculations for reconstruction. Windows Azure, Cleversafe, and Atmos are various commercial products available in the market that implement erasure coding. Hence, erasure coding is picking up especially with the cloud market boom.
Opinions expressed by DZone contributors are their own.