Brian leads design and development of a Master Data Management (MDM) solution for the Health Market industry. The solution runs on a Big Data platform powered by Cassandra. Brian recently contributed a Refcard on Cassandra to DZone.
DZone: What do you see as the elements of a Big Data Platform?
Brian: Big Data means different things to different people. For me, there are three dimensions. Sometimes Big Data relates to the shear volume of data. Often the data has grown large enough to require a distributed storage mechanism. For others, Big Data relates to processing time. The velocity at which the system needs to operate mandates distributed processing. Yet others, relate Big Data to the flexibility to accommodate a variety of data. A complete Big Data platform addresses all three of these dimensions.
DZone: Why did you select Cassandra?
Brian: We need to support both variety and velocity. Our system processes data from thousands of disparate sources. Each source can have a different schema that can change over time. Applying fuzzy matching to connect entities across those sources, consolidating the information and producing analytics requires a tremendous amount of processing time. Even though almost all of the NoSQL databases support the variety of data in our sources, Cassandra (and Datastax Enterprise), with Hadoop tightly integrated, seemed to best support the distributed processing we required.
DZone: Specifically, how did you setup the distributed processing?
Brian: Initially we explored a Datastax Enterprise-like solution, where we configured Hadoop on each node that ran Cassandra. This collocated the processing with the data on which it operated. We ran Map/Reduce jobs across the cluster analyzing the data in batches as it changed. This sufficed, but we were striving to react in near real-time.
We switched from Hadoop to Storm as our processing engine. We built and open-sourced a bolt that enables writes to Cassandra from a Storm Topology. Similar to Hadoop, we ran Storm on each Cassandra node. This allowed us to distribute the processing, but eliminated the need to batch process everything. Instead, we could react to changes in near real-time.
DZone: What about reporting on Big Data?
Brian: We evaluated many of the new reporting solutions that claim Big Data support, but we found that many of them relied on batch processing under the hood. Given our desire to react in near real-time, that wasn't going to cut it. Additionally, we wanted support for ad hoc queries, not just canned reports. It seems relational databases still have a leg up in this arena. To achieve what we needed, we decided to keep our tried-and-true relational databases in our Big Data platform.
To loop them in, we extended Cassandra with support for triggers. Using triggers, we could react to changes in Cassandra, reflecting the data in the relational database, without altering the client code. Essentially, this allowed us to support ad hoc queries and report on data in Cassandra.
DZone: Did you have to integrate with any other persistence systems?
Brian: Yes. I think Polyglot persistence is becoming the norm. For us, it is no different. We need to be able to search unstructured data in our Big Data platform. For this, we again leveraged triggers to index the data in Cassandra with SOLR. It is still very much a work in progress, but we expect to use this same pattern to integrate with other persistence mechanisms such as Neo4j.
Also, to ease the integration of Cassandra into our Polyglot persistence architecture, we implemented Virgil, a REST layer for Cassandra. The REST layer allowed us to integrate Cassandra using the same approach we used for the other persistence layers. SOLR and Neo4j both use REST. With Virgil, we could use REST with Cassandra too.
DZone: Any challenges in deploying and operating a Big Data platform?
Brian: Cassandra is remarkably easy to setup and administer. This weighed heavily into our decision to use Cassandra. Even in its simplicity however, we paid careful attention to Devops along the way. We took advantage of the tooling and educational courses that Datastax provides. Additionally, we made sure our deployment was automated entirely using Puppet. In Cassandra, configuration (e.g. consistency-levels) is critical to both the infrastructure and the application. Having puppet scripts allowed us to provision environments rapidly, and allowed development and IT to co-develop the configuration across those environments.
DZone: We noticed a shift in terminology used by Cassandra for its data model, any comment?
Brian: Cassandra has used its own terminology for structures in its data model. Specifically, Cassandra uses the terms keyspace and column family to represent constructs similar to the well-known concepts of schema and table respectively. This steepened the learning curve for people transitioning from other systems. To lower that curve, the community decided to change the terminology. It will take a while for all the touch-points to catch up (client APIs, tools, etc.), but in the end it makes sense to align terminology. It makes Cassandra more SQL-friendly and paves the way for easier assimilation into the broader ecosystem of data management tools (ETL, BI, etc).
DZone: Thanks for taking the time, Brian.
Brian: My pleasure DZone, enjoy the RefCard.