A Review of Persistence Strategies
A Review of Persistence Strategies
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
This article comes to you from the DZone Guide to Database and Persistence Management. For more information—including in-depth articles from industry experts, profiles on 35 database solutions, and more—click the link below to download your free copy of the guide.
THE BIG DATA AND NOSQL TRENDS ARE, AT BEST, misnamed. The same technologies used to process large amounts of data are sometimes needed for moderate amounts of data that then must be analyzed in ways that, until now, were impossible. Moreover, there are “NewSQL” databases, or ways to run SQL against NoSQL databases that can be used in place of traditional solutions.
Many of these types of databases are not all that new, but interest in them wasn’t high when they were first invented, or the hardware or economics were wrong. For the last 30 years or so, the tech industry assumed the relational database was the right tool for every structured persistence job. A better idea is to lump everything under “use the right tool for the right job.”
Your RDBMS isn’t going away. Some tasks are definitely tabular and lend themselves well to relational algebra. However, it is possible that some of the more expensive add-ons for your RDBMS will go away. Scaling a traditional RDBMS is difficult at best. Partitioning schemes, multi-master configurations, and redundancy systems offered by Oracle, SQL Server, and DB2 are expensive and problematic at best. They often fail to meet the needs of high-scale applications. That said, try joining two or three small tables and qualifying by one or two fields on Hadoop’s Hive and you will be rather underwhelmed with its performance. Try putting truly tabular data into a MongoDB document and achieving transactional consistency and it will frustrate you.
Organizations need to identify the patterns and characteristics of their systems, data, and applications and use the right data technologies for the job. Part of that is understanding the types of data systems available and their key characteristics.
To some degree all NoSQL databases derive from key value stores. These can be thought of as hash maps. Their capabilities are fairly limited: they just look up values by their key. Most have alternative index capabilities that allow you to look up values by other characteristics of the values, but this is relatively slow and a good indication that maybe a K-V store is not the best choice. “Sharding” or distributing the data across multiple nodes is very simple. The market is crowded with K-V stores such as Aerospike, Redis, and Riak, all of which are simple to implement.
Many K-V stores also operate as column family databases. Here the “value” evolves into a multidimensional array. These are again most efficient while looking up values. However, the “secondary indexes” are often implemented with pointer tables, meaning a hidden table with that field as the key points to the key of the original table. This is essentially O(1)+O(1) for a lookup time, which is not as good as O(1), but still decent. The two most popular, Cassandra and HBase, are both based on Hadoop but have different write semantics. HBase offers strong write integrity; Cassandra offers “eventual consistency.” If you have a fair number of reads and writes, and a dirty read won’t end your world, Cassandra will concurrently handle them well. A good example of this is Netflix: if they are updating the catalog and you see a new TV show, but the seasons haven’t been added and you click on the title and there is a momentary glitch where you only see season one until the next time you click—would you even notice? However, this is unlikely to happen because by the time you click it was likely added. Obviously this is no way to do accounting, but it is perfectly fine for a lot of datasets. HBase offers more paranoid locking semantics for when you need to be sure the reads are clean. This obviously comes at a cost in terms of concurrency, but is still about as good as row-locking in an RDBMS. Time series data is classically given as an optimal use case for column-family databases, but there are lots of other strong use cases such as catalogs, messaging, and fraud detection.
These are the “new hot thing” with SAP’s HANA being so well marketed. The only reason you can’t have a distributed RDBMS is because they didn’t evolve that way. Granted, distributed joins can be heavy, but if optimized correctly they can be done more efficiently than, say, MapReduce against Hadoop’s Hive (which is notoriously slow for many types of queries). Column- oriented relational databases may also perform better than traditional RDBMS for many types of queries. Conceptually, think of turning a table sideways: make your rows columns and columns into rows. Many values in an RDBMS are repetitive and can be “compressed” with either pointers or compression algorithms that work well on repetitive data. This also makes lookups or table scans much faster, and data partitioning becomes easier. In addition to HANA, there is also Splice Machine, which is built on top of HBase and Hadoop. The promises of these databases include high-scale distribution with column- oriented performance and good compression performance, and they still offer transactional integrity equivalent to a traditional RDBMS.
To some, these are the holy grail of NoSQL databases. They map well to modern object-oriented languages and map even better to modern web applications sending JSON to the back-end (i.e. Ajax, jQuery, Backbone, etc.). Most of these natively speak a JSON dialect. Generally the transactional semantics are different, but comparable to an RDBMS. That is, if you have a “person” in an RDBMS, their data (phone numbers, email addresses, physical addresses) will be distributed across multiple tables and require a transaction for a discrete write. With a document database, all of their characteristics can be stored in one “document,” which also can be done in discrete writes. These are not so great if you have people, money, and other things all being operated on at once, involving operations that complete all or none because they do not offer sufficient transactional integrity for that. However, document databases scale quite well and are great for web-based operational systems that operate on a single big entity, or systems that don’t require transactional integrity across entities. There are also those who use document databases as data warehouses for complex entities with moderate amounts of data where joining lots of tables caused problems. MongoDB and Couchbase are typically the leaders in this sector.
Complex relationships may call for databases where the relationships are treated as first order members of the database. Graph databases offer this feature along with better transactional integrity than relational databases for the types of datasets used with Graphs— the relationships can be added and removed in a way similar to adding a row to a table in an RDBMS, and that too is given all-or-none semantics. Social networks and recommendation systems are classic use cases for graph databases, but you should note that there are a few different types of graph databases. Some are aimed more at operational purposes (Neo4j) while others are aimed more at analytics (Apache Giraph).
Some datasets are very tabular, having use cases that lend themselves well to relational algebra. They require lots of different things to happen in all-or-none transactions, but don’t expand to tens of terabytes. This is the relational database’s sweet spot, and it is a big one. Even things that do not fit nicely but have been made to work after a decade of hammering may not be worth the migration cost. This is not about paring down your toolkit, it is about adding tools so you always have the right one for the job.
WHAT ABOUT HADOOP?
Hadoop is not a database. HDFS (Hadoop Distributed File System) distributes blocks of data redundantly across nodes, similar to Red Hat’s Gluster, EMC’s IFS, or Ceph. You can think of it as RAID over several servers on the network. You store files on it much as you would any file system; you can even mount it if you like. It uses MapReduce, which is a framework for processing data across a network. This was popularized by Hadoop but is quickly giving way to DAG (distributed acyclic graph) frameworks such as Spark.
TRY PUTTING TRULY TABULAR DATA INTO A MONGODB DOCUMENT AND ACHIEVING TRANSACTIONAL CONSISTENCY AND IT WILL FRUSTRATE YOU.
While the original use cases for Hadoop were pure analytics or data warehousing models where data warehousing tools were not sufficient or economically sensible, the ecosystem has since expanded to include a number of distributed computing solutions. Storm allows you to stream data, Kafka allows you to do event processing/messaging, and other add-ons like Mahout enable machine learning algorithms and predictive analytics. The use cases are as diverse as the ecosystem. Chances are if you need to analyze a large amount of data, then Hadoop is a fine solution.
PUTTING IT TOGETHER
When you understand the characteristics of the different types of storage systems offered, the characteristics of your data, and the scenarios that your data can be used for, choosing the right set of technologies is not a daunting task. The era of one big hammer (the RDBMS) to suit every use case is over.
ANDREW C. OLIVER is the president and founder of Mammoth Data, a Durham based Big Data/ NoSQL consulting company specializing in Hadoop, MongoDB and Cassandra. Oliver has over 15 years of executive consulting experience and is the Strategic Developer blogger for InfoWorld.com.
Opinions expressed by DZone contributors are their own.