In the early days of data warehousing, there was a raging debate between two architectural approaches. There was a camp that advocated Ralph Kimball’s federated data mart architecture, and a camp that advocated Bill Inmon’s enterprise data warehouse architecture.
The old “Kimbalite” vs “Inmonite” discussions of the 1990’s are reminiscent of a similar discussion going on today about the relative merits and promise of Hadoop versus conventional data warehouses built on relational databases. And I suspect the issue will get resolved in a similar fashion. People will get tired of discussing it, and both architectures will co-exist in perfect harmony. Each will find its’ appropriate place in the corporate IT landscape.
There are compelling arguments on each side of the question. Hadoop’s free open source distributions run on low cost commodity hardware, and provide virtually unlimited storage of structured and unstructured data. However, few organizations have stable, production- ready Hadoop deployments. And the tools and technologies currently available for accessing and analyzing Hadoop data are in early stages of maturity. There are issues associated with query performance, the ability to perform real time analytics, and the preference of business analysts and developers to leverage existing SQL skills.
In spite of these to-be-expected early stage challenges, I am coming across some real world use cases for Hadoop-based analytics. At a recent Silicon Valley Forum on Big Data, Pandora’s director of software engineering explained how they have migrated their relational data warehouse to an analytic infrastructure built on Hadoop, using Tableau as the front end to Hive for visualization and analysis.
Data warehouses represent the established technology, and they aren’t likely to go away. Nearly all medium to large scale enterprises have data warehouses and marts in place that took years to build, and they are delivering unquestioned business value. The old axiom “if it ain’t broke, don’t fix it” is hard to argue with. However, data warehouses are not designed to accommodate the increasing volumes of unstructured data from web logs, social media, mobile devices, sensors, medical equipment, industrial machines, and other sources. And there are both economic and performance limitations on the amount of data that can be stored and accessed.
The current industry debate about the relative merits of Hadoop and data warehouses is as lively as the data warehouse architecture debates of the 90’s, but perhaps a bit less controversial and passionate. Co-existence seems to be the prevailing sentiment among most practitioners, as well as the vendors of both Hadoop distributions and traditional data warehousing technologies. Cloudera, Hortonworks, MapR, and more recent Hadoop distro vendors ranging from Intel to WanDisco are promoting side-by-side use case scenarios, while IBM, Oracle, and Teradata are incorporating Hadoop into their core offerings.
So what’s it going to take to ignite more controversy and passion into the debate? In my view, new innovations that make Hadoop data more accessible, more usable, and more relevant to business users will obfuscate the distinctions between Hadoop and the traditional data warehouse. As the lines blur, the debate will intensify. Those innovations are coming to market at a fast and furious pace, forcing organizations to make architectural decisions that will fundamentally determine how effectively they can exploit Big Data. More on that in the third and final installment of this blog series. And make sure to read the first blog of this series, “Differentiation Across the Apache Hadoop Distribution Vendor Landscape.”