A few months ago, Jeff Kelly published a comprehensive article that captured the current state of Big Data and Hadoop nicely and laid out the blueprint of next-generation cloud computing.
For cloud computing nerds like us, the article was a joy to read, but especially the section called “Next Generation Data Warehousing” caught our eyes. In that section, Kelly listed out five characteristics of next-generation data warehouses. Here, we want to see how our platform measures up against Kelly’s criteria.
1 Massively parallel processing, or MPP, capabilities: Next Generation Data Warehouses employ massively parallel processing, or MPP, that allow for the ingest, processing and querying of data on multiple machines simultaneously. The result is significantly faster performance than traditional data warehouses that run on a single, large box and are constrained by a single choke point for data ingest.
Check. We implemented a job queue (Perfect Queue) that sends our customers’ jobs across hundreds of machines on Amazon Web Services.
2 Shared-nothing architectures: A shared-nothing architecture ensures there is no single point of failure in Next Generation Data Warehousing environments. Each node operates independently of the others so if one machine fails, the others keep running. This is particularly important in MPP environments, in which, with sometimes hundreds of machines processing data in parallel, the occasional failure of one or more machines is inevitable.
Check. We take advantage of Hadoop MapReduce running on EC2 to process our customers’ jobs.
3 Columnar architectures: Rather than storing and processing data in rows, as is typical with most relational databases, most Next Generation Data Warehouses employ columnar architectures. In columnar environments, only columns that contain the necessary data to determine the “answer” to a given query are processed, rather than entire rows of data, resulting in split-second query results. This also means data does not need to be structured into neat tables as with traditional relational databases.
Check. We designed and implemented a columnar database sitting on top of Amazon S3.
4 Advanced data compression capabilities: Advanced data compression capabilities allow Next Generation Data Warehouses to ingest and store larger volumes of data than otherwise possible and to do so with significantly fewer hardware resources than traditional databases. A warehouse with 10-to-1 compression capabilities, for example, can compress 10 terabytes of data down to 1 terabyte. Data compression, and a related technique called data encoding, are critical to scaling to massive volumes of data efficiently.
Check. We achieve a 5-10x compression ratio. Columnar data storage helps with compression considerably, but our secret sauce is a binary serializer called MessagePack. MessagePack is space-efficient and incredibly fast to serialize and deserialize. One of our co-founders is the original author of MessagePack, and we use it extensively throughout our stack.
5 Commodity hardware: Like Hadoop clusters, most Next Generation Data Warehouses run on off-the-shelf commodity hardware (there are some exceptions to this rule, however) from Dell, IBM and others, so they can scale-out in a cost effective manner.
Check. Since our data warehouse sits on top of Amazon S3, this is certainly the case.
In conclusion, Treasure Data’s Cloud Data Warehouse covers the requirements of Next Generation Data Warehouse pretty well. We know Kelly’s criteria are no silver bullet, but it is definitely a vote of confidence in our product =)