One question which pops up again and again when I talk about streamdrill is whether that cannot be done by X, where X is one of Hadoop, Spark, Go, or some other piece of Big Data infrastructure.
Of course, the reason why I find it hard to respond to that question is that the engineer in me is tempted to say “in principle, yes” which sort of questions why I put all that work to rebuild something which apparently already exists. But the truth is that there’s a huge gap between “in principle” and “in reality”, and I’d like to spell this difference out in this post.
The bottom line is that all those pieces of Big Data infrastructure which exists today provide you with a lot of pretty impressive functionality, distributed storage, scalable computing, resilience, and so on, but not in a way which solves your data analysis problems out of the box. The analogy I like is that Big Data is a lot like providing you with an engine, a transmission, some tires, a gearbox, and so on, but no car.
So let us consider an example where you have some clickstream and you want to extract some information about your users. Think, for example, recommendation, or churn prediction. So what steps are actually involved in putting together such a system?
First comes the hardware, either on the cloud or by buying or finding some spare machines, and then setting up the basic infrastructure. Nowadays, this would mean installing Linux, HDFS, the distributed filesystem of Hadoop, and YARN, the resource manager which allows you to run different kind of compute jobs on the cluster. Especially when you go for the raw Open Source version of Hadoop, this step requires a lot of manual configuration, and unless you already did this a few times, this might take a while to get to work.
Then, you need to take in the data in some way, for example, by something like Apache Kafka, which is essentially a mixture of a distributed log storage and an event transport plattform.
Next, you need to process the data, which could either be done by a system like Apache Storm, a stream processing framework which lets you distribute computing once you have it broken down to pieces of computation taking in an event at a time. Or you use Apache Spark which let’s you describe computation on a higher level with something like a functional collection API and can also be fed a stream of data.
Unfortunately, this still does nothing useful out of the box. Both Storm and Spark are just frameworks for distributed computing, meaning that they allow you to scale computation, but you need to tell them what you want to compute. So you first need to figure out what to do with your data and this involves looking at data, identifying the kind of statistical analysis which is suited to solve your problem, and so on, and probably requires a skilled data scientist to spend one to two month working on the data. There are projects like mllib which provide more advanced analytics, but again these projects don’t provide full solutions to application problems but are tools for a data scientist to work with (And they are still somewhat early stage IMHO.)
Still, there’s more work to do. One thing people are often unaware of is that Storm and Spark have no storage layer. This means that they both perform computation, but to get to the result of the computation, you have to store it somewhere and have some means to query it. This means usually to store the result in a database, something like redis, if you want the speed of a memory based data storage, or in some other way.
The resulting system probably looks a bit like this.
Lots of moving parts which need to be deployed and maintained. Contrast this with an integrated solution. To me this is difference between a bunch of parts and a car.