Cassandra Usurping MySQL on Twitter
King says that Twitter needs a system that can scale as quickly as the website is growing. The number of tweets per day rose from 2 million to over 50 million in 2009 alone. The MySQL + Memcached system just isn't cutting it, King said. The costs of operation are getting too high and a more automated, highly available, and scalable data system is needed, King said.
King's team tested several other solutions before arriving at Cassandra. One option was a more automated sharded MySQL model. They also tested the best-known open source NoSQL databases including HBase, Project Voldemort, MongoDB, MemcacheDB, Redis, Cassandra, and HyperTable. King looked for low requirements, a healthy open source community, scalability, and any single points of failure. Cassandra, notably, has no single point of failure. The system also had highly scalable writes for Twitter's extremely variable write traffic. Finally, King was impressed by the Cassandra community, which is now encapsulated as a Top Level Project at Apache.
Twitter is known for adopting new technologies if they can dramatically help improve their site's architecture. The company rewrote some of its backend code in Scala not too long ago. Now Twitter plans to completely replace the current solution with Cassandra over the coming months. King says they are currently moving their largest table (the statuses table, which includes tweets and retweets) over to Cassandra architecture. King explains how Twitter is making this transition:
We have a nice system for dynamically controlling features on our site. We commonly use this to roll out new features incrementally across our user base. We use the same system for rolling out new infrastructure.
So to roll out the new data store we do this:
- Write code that can write to Cassandra in parallel to Mysql, but keep it disabled by the tool I mentioned above
- Slowly turn up the writes to Cassandra (we can do this by user groups “turn this feature on for employees only” or by percentages “turn this feature on for 1.2% of users”)
- Find a bug :)
- Turn the feature off
- Fix the bug and deploy
- GOTO #2
Eventually we get to a point where we’re doing 100% doubling of our writes and comfortable that we’re going to stay there. Then we:
- Take a backup from the mysql databases
- Run an importer that imports the data to cassandra Some side notes here about importing. We were originally trying to use the BinaryMemtable interface, but we actually found it to be too fast — it would saturate the backplane of our network. We’ve switched back to using the Thrift interface for bulk loading (and we still have to throttle it). The whole process takes about a week now. With infinite network bandwidth we could do it in about 7 hours on our current cluster.
- Once the data is imported we start turning on real read traffic to Cassandra (in parallel to the mysql traffic), again by user groups and percentages.
- Once we’re satisfied with the new system (we’re using the real production traffic with instrumentation in our application to QA the new datastore) we can start turning down traffic to the mysql databases.
A philosophical note here — our process for rolling out new major infrastructure can be summed up as “integrate first, then iterate”. We try to get new systems integrated into the application code base as early in their development as possible (but likely only activated for a small number of people). This allows us to iterate on many fronts in parallel: design, engineering, operations, etc.
--Excerpt from MyNoSQL
Cassandra makes a lot of sense for Twitter and other large sites that deal with tremendous amounts of data. Facebook, the biggest social networking site in existence, was the architect of Cassandra, so it's no surprise that other social networking sites, like DIgg, find it to be extremely helpful. With Facebook and now Twitter using the Apache Cassandra system, the project, and the NoSQL movement, have been given the the seal of approval by the two great pillars of social networking.
For more information on Cassandra and the NoSQL databases, check out this article.