How DataSift is Datamining 120K Tweets Per Second

DZone 's Guide to

How DataSift is Datamining 120K Tweets Per Second

· Database Zone ·
Free Resource
Attention architectural gurus!  Get ready to learn about how one company puts together its amazing datamining architecture, and hopefully you'll also walk away with some ideas of your own after reading Todd Hoff's new post on High Scalability. 

His post reviews the architecture of DataSift, a realtime Twitter datelining platform.

For the TL;DR crowd, I'll try to further summarize a simple look at the stack with some brief commentary:

  • C++ for high performance components; makes sense
  • PHP for the site, API server, internal web services, and a speedy, custom job queue manager; nice traditional choice
  • Java/Scala for communication with HBase and Map/Reduce jobs; nice to see some Scala, I was waiting for the Java
  • Ruby for Chef; a great DevOps tool that's written in Ruby, so it makes sense here

Data Stores

  • MySQL Percona server on SSD; Peter Zaitsev would be proud of the Percona choice
  • HBase cluster with 400TB of storage on...
  • about 30 Hadoop nodes; HBase and Hadoop seems to be a common combo for major data operations
  • Memcached; great caching of course
  • Redis is still used for some internal queues but it's on its way out

Message Queues

  • ZeroMQ (or ØMQ), a fast and lightweight broker-free messaging using Pub-Sub, Push-Pull, and Req-Rep while at DataSift; this piece is really interesting and gaining lots of traction at real-time operations like Loggly
  • Kafka - the persistent and distributed message queue used by LinkedIn. DataSift uses it for high-performance persistent queues. The Java and Scala are used to process its data.

CI and Deployment

  • Any code is pulled from Jenkins every 5 minutes when there's a change.  It's then automatically tested by a variety QA tools; Jenkins is the real deal it seems
  • All projects are built and packaged as RPMs and sent to a dev package repo; good old RPM
  • Chef, mentioned before, automates deployments and infrastructure configuration.


  • Lots of tools you'll hear about in the DevOps community here, first thier services emit StatsD events
  • Those events are combined with more system-level checks and added to Zenoss
  • The data is then visualized in Graphite

That was nice and quick, but it'd be even nicer to have an illustration of interactions between these technologies, which is just as important as know the contents of the stack.  Well DataSift has provided that too:

The full sized version is here.

From Todd Hoff, here is what he says is the point of it all:

  • Democratization of data access. Consumers can do their own data processing and analytics. An individual should, if they wished, be able to determine which breakfast cereal gets the most tweets, process the data, make the charts, and sell it to the brands. With a platform in place that's possible whereas if you had to set up your own tweet processing infrastructure it would be nearly impossible.
  • Software as a service to data. Put in a credit card, by an hour or months worth of data. Amazon EC2 model. Charge for what you use. No huge sign up period or fee. Can be a small or large player.
  • You don’t really need big data, you need insight. Now what we do with the data? Tools to create insights yourself. Data is worthless. Data must be used in context. 
  • The idea is to create a whole new slew of applications that you couldn't with Twitter’s API.
-- Todd Hoff

And this only scratches the surface of this excellent post.  I encourage you to flag this post for more in-depth reading later.  There's tons of good info that I didn't even touch on.

Source: http://highscalability.com/blog/2011/11/29/datasift-architecture-realtime-datamining-at-120000-tweets-p.html


Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}