DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
  1. DZone
  2. Data Engineering
  3. Data
  4. Cassandra for LOBS

Cassandra for LOBS

Dan Pritchett user avatar by
Dan Pritchett
·
Dec. 18, 11 · Interview
Like (0)
Save
Tweet
Share
7.97K Views

Join the DZone community and get the full member experience.

Join For Free

Database storage is expensive. This is especially true if you build a traditional SAN based M+N cluster. The cost of the storage array, fiber channel switches, fiber channel interfaces, drives the cost per terabyte into the thousands quite easily. And while storage costs in general are plummeting, SAN storage costs are falling at a slower rate, widening the gap between SAN and direct attached storage. Given the cost of SAN storage, it would be unfortunate to waste it which is what we discovered we were doing.

Our platform makes a lot of 3rd party service calls. Several of these are very complex conversational interfaces that generate a lot of text. In order for customer service to trouble shoot customer issues we retain these API interactions. Storing this 3rd party API request/response text was implemented from the beginning within our platform. At that time, the logical place to save this data was in database CLOBs. When we recently analyzed our SAN storage, we discovered that 40% of it was consumed by these API logs. Clearly there was an opportunity to save costs with a lower cost solution.

We looked at alternatives for managing LOBs and we settled on Cassandra for a few reasons:

  • Provides for reliable storage on commodity hardware
  • The shared nothing architecture of Cassandra brings an attractive availability model.
  • Eventual consistency wasn't a big concern. This data is stored and not retrieved at least for several hours, if at all.
  • The ability to achieve quorum on writes is important when using commodity storage.
  • Built-in expirations makes data life cycle management straightforward
  • The scale out model allows storage and transactional capacity to be easily added.
  • Data center awareness provides for clean multi-data-center deployments

There were a few concepts we wanted to standardize for the storage and management of large data objects. These were:

  1. Correlate the importance of the data to the business with the cost of storage.
  2. Ensure that life cycle was applied and LOBs weren't kept longer than meaningful.
  3. Ensure that data was as space efficient as possible.

The first was a new concept for us. The cost advantage for using Cassandra comes by using commodity hardware with commodity drives. This hardware can and will fail though. So to ensure data cannot be lost, there must be multiple copies. Redundancy comes at a high cost however. For example, if the cost of storage is $150/TB and you keep 6 copies of the data (3 each in 2 data centers) then your protected cost is $900/TB. Reducing redundancy increases the risk of a loss, but some data can afford to be lost or at least afford to be temporarily unavailable. We wanted to be able to trade off data importance against cost. We defined 3 levels of consistency and corresponding replication values for each.

We also require that every LOB is provided with an expiration date. That can be set to never, but by providing a simple way to control a meaningful life of data, we increase the likelihood of it being purged when no longer useful. For example, we can retain travel service debugging information for 6 months after the trip is complete. This is trivial for the caller to set and Cassandra will clean up the data automatically after the expiration.Another observation was that much of the text wasn't compressed even though it was highly compressible. When we designed the LOB service, the interface included both content type and content transfer encoding. Based on the type and the transfer encoding, we will transparently compress and decompress the data. Our Cassandra storage costs are about 1/3rd the SAN, for the highest replication level, while compression reduces the storage needed by 90% giving us an order of magnitude improvement in storage costs for text LOBs.

Our early usage of the Cassandra based LOB service has produced positive results. Cassandra has proven to be reliable and performant. We have even experienced a hardware failure, during a peak transaction period without impact to the platform. We plan on expanding both the usage of our LOB service as well as Cassandra based on the early results.

Source: http://www.addsimplicity.com/adding_simplicity_an_engi/2011/10/cassandra-for-lobs.html

Data (computing)

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Tracking Software Architecture Decisions
  • Multi-Cloud Integration
  • 5 Software Developer Competencies: How To Recognize a Good Programmer
  • 11 Observability Tools You Should Know

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: