DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
Building Scalable Real-Time Apps with AstraDB and Vaadin
Register Now

Trending

  • How To Manage Vulnerabilities in Modern Cloud-Native Applications
  • Send Email Using Spring Boot (SMTP Integration)
  • How To Scan and Validate Image Uploads in Java
  • Building a Robust Data Engineering Pipeline in the Streaming Media Industry: An Insider’s Perspective

Trending

  • How To Manage Vulnerabilities in Modern Cloud-Native Applications
  • Send Email Using Spring Boot (SMTP Integration)
  • How To Scan and Validate Image Uploads in Java
  • Building a Robust Data Engineering Pipeline in the Streaming Media Industry: An Insider’s Perspective
  1. DZone
  2. Data Engineering
  3. Data
  4. Cassandra 1.1 – Tuning for Frequent Column Updates

Cassandra 1.1 – Tuning for Frequent Column Updates

Comsysto Gmbh user avatar by
Comsysto Gmbh
·
Apr. 15, 13 · Interview
Like (0)
Save
Tweet
Share
7.40K Views

Join the DZone community and get the full member experience.

Join For Free

Cassandra is known for its good write performance. But there are scenarios, when you might run into trouble – especially when particular use case generates heavy disk IO. This could be the case for columns which receive frequent updates. However, you can avoid those problems, with proper configuration, or just by updating to recent Cassandra version. The good news is, that it can be applied to already ruining system, so when you are already having problems, there is still a hope.

Memtables are flushed to immutable SSTables, and it is possible, that single column value can be stored in different SSTables, when its value was changing over long enough time period. This guarantees fast inserts, because data is being just appended to disk. But on the other hand, unnecessary writes will decrease disk performance, not only because many SSTables has to be written, but mainly because duplicates on disk will have to be compacted later on.

The idea is to tune Cassandra in the way, that we take benefit from frequent updates. This can be achieved by keeping data in memory and by delaying disk flushes. In this case new updates will replace existing values in memory.
This will generate less disk traffic, because it will decrease amount of flushed duplicates. This is not all – this will also create write through cache, and read requests will benefit from it. Here are some confutation tips:

  • Make sure that you have at least Cassandra 1.1 – it contains optimization for frequently changing values (CASSANDRA-2498). For the cases where single value is stored in multiple SSTables, older Cassandra versions would need to read column values from all SSTables in order to find most recent one. Now SSTables are sorted by modification time, so it’s enough to read most recent value and simply ignore remaining outdated values.
  • Increase thresholds for flushing memtables. Each update on memtable, results in one less entry in SSTable.
  • Each read operation checks first memtables, if data is still there, it will be simply returned – this is the fastest possible access. Its like non blocking write through cache (based on skip list).
  • To large memtable on the other hand will result in larger commit log. This is not a problem, until your instance crashes. It will need some time to start, because it would need to read whole commit log.
  • Compaction merges SSTables together, and this increases read performance, since we have less data to go through. But this process does not have high priority. When Cassandra is nearly exhausted, it will skip compaction, and this can lead to data fragmentation.

Caching 

  • Row cache makes really sense for frequent reads of the same row(s), and additionally when you read most of the columns of each single row.
  • For active row cache, access to single column from particular row will load whole row with all its columns into memory. Analyze data access patterns, and makes sure thatit is not an overhead, and that you have enough memory. It would be really waste of resources, to load hundred columns into memory, to just access only a few.
  • Row cache works as wright through, for data that is already in it. Data is loaded into row cache first when it’s being read, and when it was not found in memtable. From this point of time it will get updated on each write operation. Frequently changing entry, without read access will not affect row cache, because it’s not there.
  • Updates on data in row cache will decrease performance, and actually, those frequently changing columns are probably also available in memtable. Read process will first search memtable, and in case of hit ignore row cache. From this point of view row cache makes sense, if you also read other columns which are not changing frequently. For example single row has 200 columns, 50 receive frequent updates, 100 sporadic, and read process reads always all. In this case row cache makes sense – we will have to actualize 50 columns on each insert, but we will gain fast access to remaining 150.
  • It might be good idea to disable row cache, increase memtable size in hope to reduce disk writes, and to use memtableas cache.
  • Disabling row cache does not necessary mean additional disk seeks. Cassandra uses memory mapped files, which means that each file access is being cached by operating system. Relaying on memory mapped files is nothing new – Mongo does not have cache at all – it’s not needed, since file system cache works just fine. But Mongo has different data structure on hard drive, because they store BSON document optimized for reads, its all in one place, Cassandra might (not always) need first to collect data from different locations.
  • Row cache would help also in situation where single row spreads over many SSTables. In this case putting all data together is CPU intensive operation, not mentioning possible disk access to read each column value.
  • When row cache is disabled, key cache must be used. Key cache is like an index – and you definitely want to load your whole index into memory.
  • When row cache is disabled and key cache is enabled, and read operation get hits on key cache, we have quiet performant solution. Searching SSTables runs fully in memory, only reading column value itself requires disk access. And maybe even not that, since it’s memory mapped file.
  • When disabling row cache remember to tune read ahead. The idea is, to read from disk only single column and not more data when it’s not needed.

Just to summary …. run performance tests, check Cassandra statistics, and verify how many SSTables has to be searched to find data, and what is the cache usage. This might be a good entry point to change memtable size, or to tune caching. In my case, disabling row cache, large key cache and increased memtable thresholds was the right decision.

Column (database) Cache (computing) Row (database) Data (computing) Memory (storage engine) File system

Published at DZone with permission of Comsysto Gmbh, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Trending

  • How To Manage Vulnerabilities in Modern Cloud-Native Applications
  • Send Email Using Spring Boot (SMTP Integration)
  • How To Scan and Validate Image Uploads in Java
  • Building a Robust Data Engineering Pipeline in the Streaming Media Industry: An Insider’s Perspective

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com

Let's be friends: