I recently wrote about how I’d historically been using Pig for some daily and some ad-hoc data analysis, and how I’d found Hive to be a much friendly tool for my purposes. As I mentioned then, I’m not a data analyst by any stretch of the imagination, but have occasional need to use these kinds of tools to get my job done. The title of this post (while originally a placeholder for something more accurate) is a representation of the feeling I have for these topics – only a vague idea of what is going on, but I know it has to do with Big Data (proper noun).
Since writing that post, attempting and failing to find a simple way of introducing Hive usage at work (it’s yet another tool and set of data representations to maintain and support) I’ve also been doing a bit of reading on comparable tools, and frankly Hive only scratches the surface. Having a mostly SQL-compliant interface, there is a lot of competition in this space (and this blog post from Cloudera sums up the issue very well). SQL as an interface to big data operations is desirable for the same reasons I found it useful, but it also introduces some performance implications that are not suited to traditional MapReduce-style jobs which tend to have completion times in the tens of minutes to hours rather than seconds.
Cloudera’s Impala and a few other competitors in this problem space are attempting to address this problem by combining large-scale data processing that is traditionally MapReduce’s strong-point, with very low latencies when generating results. Just a few seconds is not unusual. I haven’t investigated any of these in-depth, but I feel as a sometimes-user of Hadoop via Pig and Hive it is just as important to be abreast of these technologies as the “power users”, so that when we do have occasion to need such data analysis, it can be done with as low a barrier to entry as possible and with the maximum performance.
Spark is now an Apache project but originated in the AMPLab at UC Berkeley. My impression is that it is fairly similar to Apache Hadoop – its own parallel-computing cluster, with which you interact via native language APIs (in this case Java, Scala or Python). I’m sure it offers superior performance to Hadoop’s batch processing model, but unless you are already heavily integrating from these languages with Hadoop libraries it doesn’t offer a drastically different method of interaction.
On the other hand, there are already components built on top of the Spark framework which do allow this, for example, Shark (also from Berkeley). In this case, Shark even offers HiveQL compatibility, so if you are already using Hive there is a clear upgrade path. I haven’t tried it, but it sounds promising although being outside of the Cloudera distribution and not having first-class support on Amazon EMR makes it slightly harder to get at (although guides are available).
As already suggested, Impala was the first alternative I discovered and also is incorporated in Cloudera’s CDH distribution and available on Amazon EMR, which makes it more tempting to me for use both inside and outside of EMR. It supports ANSI SQL-92 rather than HiveQL, but coming from Pig or other non-SQL tools this may not matter to you.
Developed by Facebook, and can either use HDFS data without any additional metadata, or with the Hive metadata store using a plugin. For that reason I see it as somewhat closer to Impala, although it also lacks the wider support in MapReduce deployments like CDH and Amazon EMR just like Shark/Spark.
Not really an open source tool like the others above, but deserves a mention as it really fits in the same category. If you want to just get something up and running immediately, this is probably the easiest option.
I haven’t even begun to scratch the surface of tooling available in this part of the Big Data space, and the above are only the easiest to find amongst further open source and commercial varieties. Personally I am looking forward to the next occasion I have to analyse some data where I can really pit some of these solutions against each other and find the most efficient and easy framework for my ad-hoc data analysis needs.