When it comes to real-time big data, stream processing frameworks are an interesting alternative to MapReduce. Instead of storing and crunching data in batches, they process the data as it comes along, which immediately makes much more sense if you’re dealing with event streams. Frameworks like Twitter’s Storm and Yahoo’s S4 allow you to scale such computations. Similar to MapReduce jobs, you specify small worker threads which are then deployed and monitored automatically to provide robust scalability.
So at first you may think “stream processing is basically MapReduce for events”, but there is an important and significant difference: There is no query layer in stream processing (well at least, there isn’t in Storm and S4).
With query layer, I mean the capability to query the results of your computations. For stream processing, in particular, this means while the computations are still running, because you are typically consuming a never-ending stream of new events.
For example, if we consider the ubiquituous word count example, where you pipe in some constant stream of sentences (let’s say, tweets), how can you query the counts for a given word at a given time?
The answer is a bit surprising to most people I’ve talked to: There is no way you can query the result (at least from within the stream processing framework). The information is there, distributed over numerous worker thread who all see and process a part of the stream, but there is no way to access that information. Instead, results have to be periodically output, either to screen or to some persistent storage.
Now these are only toy examples, of course, but it also means that for real-world applications, you need some database backend to store your results. Depending on the amount of data you process and the level of aggregation you do (or don’t do), this also means that your stock MySQL won’t suffice to keep up with your stream processing cluster.
The same can be said of MapReduce jobs which run periodically to update some statistics, but the difference is that MapReduce doubles as the storage solution while you need an additional backend for stream processing.
So I think stream processing is good when:
- you have a high frequency event stream,
- have to do quite complex analyses on each event independently,
- do a lot of aggregation so that there is a huge reduction in data volume.
But it’s not generally applicable when:
- you need to do a lot of persistent updates which each event,
- need to query results while the analysis is still ongoing.
Let me know if I’m wrong. I’d be interested in learning about some real-world experiences with scaling stream processing!