Why Deduplication Matters for Cloud Storage
Why Deduplication Matters for Cloud Storage
Cloud storage has opened new doors with scalability and high availability, but it's not infinite. Copies and duplicates can pile on costs, so deduplication becomes essential.
Join the DZone community and get the full member experience.Join For Free
Learn how to migrate and modernize stateless applications and run them in a Kubernetes cluster.
Most people assume cloud storage is cheaper than on-premise storage. After all, why wouldn’t they? You can rent object storage for $276 per TB per year or less, depending on your performance and access requirements. Enterprise storage costs between $2,500 to $4,000 per TB per year, according to analysts at Gartner and ESG.
This comparison makes sense for primary data, but what happens when you make backups or copies of data for other reasons in the cloud? Imagine that an enterprise needs to retain 3 years of monthly backups of a 100TB data set. In the cloud, this can be easily equated to 3.6 PB of raw backup data, or a monthly bill of over $83,000. That’s about $1 million a year before you even factor in and data access or retrieval charges.
That is precisely why efficient deduplication is hugely important for both on-premise and cloud storage, especially when enterprises want to retain their secondary data (backup, archival, long-term retention) for weeks, months, and years. Cloud storage costs can add up quickly, surprising even astute IT professionals, especially as data sizes get bigger with web-scale architectures, data gets replicated and they discover it can’t be deduplicated in the cloud.
The Promise of Cloud Storage: Cheap, Scalable, Forever Available
Cloud storage is viewed as cheap, reliable and infinitely scalable – which is generally true. Object storage like AWS S3 is available at just $23/TB per month for the standard tier, or $12.50/TB for the Infrequent Access tier. Many modern applications can take advantage of object storage. Cloud providers offer their own file or block options, such as AWS EBS (Elastic Block Storage) that starts at $100/TB per month, prorated hourly. Third-party solutions also exist that connect traditional file or block storage to object storage as a back-end.
Even AWS EBS, at $1,200/TB per year, compares favorably to on-premise solutions that cost 2-3 times as much, and require high upfront capital expenditures. To recap, enterprises are gravitating to the cloud because the OPEX costs are significantly lower, there’s minimal up-front cost, and you pay as you go (vs. traditional storage where you have to buy far ahead of actual need).
How Cloud Storage Costs Can Get Out of Hand: Copies, Copies Everywhere
The direct cost comparison between cloud storage and traditional on-premise storage can distract from managing storage costs in the cloud, particularly as more and more data and applications move there. There are three components to cloud storage costs to consider:
- Cost for storing the primary data, either on object or block storage
- Cost for any copies, snapshots, backups, or archive copies of data
- Transfer charges for data
We’ve covered the first one. Let’s look at the other two.
Copies of data. It’s not how much data you put into the cloud — uploading data is free, and storing a single copy is cheap. It’s when you start making multiple copies of data — for backups, archives, or any other reason — that costs spiral if you’re not careful. Even if you don’t make actual copies of the data, applications or databases often have built-in data redundancy and replicate data (or in database parlance, a Replication Factor).
In the cloud, each copy you make of an object incurs the same cost as the original. Cloud providers may do some dedupe or compression behind the scenes, but this isn’t generally credited back to the customer. For example, in a consumer cloud storage service like DropBox, if you make a copy or ten copies of a file, each copy counts against your storage quota.
For enterprises, this means data snapshots, backups, and archived data all incur additional costs. As an example, AWS EBS charges $0.05/GB per month for storing snapshots. While the snapshots are compressed and only store incremental data, they’re not deduplicated. Storing a snapshot of that 100 TB dataset could cost $60,000 per year, and that’s assuming it doesn’t grow at all.
Data access. Public cloud providers generally charge for data transfer either between cloud regions or out of the cloud. For example, moving or copying a TB of AWS S3 data between Amazon regions costs $20, and transferring a TB of data out to the internet costs $90. Combined with GET, PUT, POST, LIST and DELETE request charges, data access costs can really add up.
Why Deduplication in the Cloud Matters
Cloud applications are distributed by design and are deployed on non-relational massively scalable databases as a standard. In non-relational databases, most data is redundant before you even make a copy. There are common blocks, objects, and databases like MongoDB or Cassandra have replication factor (RF) of 3 to ensure data integrity in a distributed cluster, so you start out with three copies.
Backups or secondary copies are usually created and maintained via snapshots (for example, using EBS snapshots as noted earlier). The database architecture means that when you take a snapshot, you’re really making three copies of the data. Without any deduplication, this gets really expensive. And existing solutions, designed for on-premise legacy storage, can’t help.
Not Just Deduplication — Semantic Deduplication
Most deduplication technology works at the storage layer, deduplicating blocks of data. This is highly efficient on centralized SAN or NAS storage, but breaks down if the data layer is abstracted from the storage — as it is in a distributed database such as MongoDB. Deduplication in this world needs to address two fundamental issues:
- It has to work at the data layer, not the storage layer. In order to deduplicate data from a distributed cluster, the software has to understand and interpret the underlying data structure.
- It has to eliminate redundant data before it gets written to the database. Once data is written, it gets replicated within the cluster, so it needs to be deduplicated in-flight.
Published at DZone with permission of Jeannie Liou , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.