Accessing data in a database depends on the data model implemented by that database. The data model affects the operations available and the API used by clients to operate on that data. Different data models may provide more or less raw functionality. Usually, the less functionality provided directly by the data model, the more the client application must do for itself.
The data model determines how a client application must encode its data for storage. Applications will have some natural domain model that must be mapped onto something the storage technology can support.
Until recently, the dominant data model was the relational model. This is well documented in many places today. My inention here is for this to be an exposition of various non-relational models, but I'll include a brief description of the relational model for completeness, and because it provides a common reference point to compare the other models to.
The data model supported by a data store is distinct from its consistency model, which I will cover elsewhere, separately.
The relational model provides for the storage of records that are made up of tuples. Records are stored in tables. Tables are defined by a schema, which determines what columns are in the table. Columns have a name and a type. All records within a table fit that table's definition. SQL is a query language designed to operate over tables. SQL provides syntax for finding records that meet criteria, as well as for relating records in one table to another via joins; a join finds a record in one table based on its relationship to a record in another table.
Records can be created (inserted) or deleted. Fields within a record can be updated individually.
Implementations of the relational model usually provide transactions, which provide a means to make modifications spanning multiple records atomically.
In terms of what programming languages provide, tables are like arrays or lists of records or structures. For high performance access, tables can be indexed in various ways using b-trees or hash maps.
Key-Value stores provide access to a value based on a key.
The key-value pair can be created (inserted), or deleted. The value associated with a key may be updated.
Key-value stores don't usually provide transactions.
In terms of what programming languages provide, key-value stores resemble hash tables; these have many names: HashMap (Java), hash (Perl), dict (Python), associative array (PHP), boost::unordered_map<...> (C++).
Key-value stores provide one implicit index on the key itself.
A key-value store may not sound like the most useful thing, but a lot of information can be stored in the value. It is quite common for the value to be an XML document, a JSON object, or some other serialized form. The key point here is that the storage engine is not aware of the internal structure of the value. It is up to the client application to interpet the value and
manage its contents. The value can only be written as a whole; if the client is storing a JSON object, and only wants to update one field, the entire value must be fetched, the new value substituted, and then the entire value must be written back.
The inability to fetch data by anything other than one key may appear limited, but there are workarounds. If the application requires a secondary index, the application can maintain one itself. To do this, the application manages a second collection of key-value pairs where the key is the value of another field in the first collection, and the value is the primary key in the first collection. Because there are no transactions that can be used to make sure that the secondary index is kept synchronized with the original collection, any application that does this would be wise to have a periodic syncing process to clean up after any partial changes that occur due to application crashes, bugs, or errors.
Document stores provide access to structured data, but unlike the relational model, there may not be a schema that is enforced. In essence, the application stores bags of key-value pairs. In order to operate in this environment, the application adopts some conventions about how to deal with differing bags it may retrieve, or it may take advantage of the storage engine's ability to put different documents in different collections, which the application will use to manage its data.
Unlike a relational store, document stores usually support nested structures. For example, for document stores that support XML or JSON documents, the value of a field may be something that looks like another document. Document stores can also support array or list-valued keys.
Unlike a key-value store, document stores are aware of the internal structure of the document. This allows the storage engine to support secondary indexes directly, allowing for efficient queries on any field. The ability to support nested document storage leads to query languages that can be used to search for items nested inside others; XQuery is one example of this. MongoDB supports some similar functionality by allowing the specification of JSON field paths in queries.
Column stores are like relational stores, except that they flip the data around. Instead of storing records, column stores store all the values for a column together in a stream. An index provides a means to get column values for any particular record.
Map-reduce implementations such as Hadoop are most efficient if they can stream in their data. Column stores work particularly well for that. As a result, stores like HBase and Hypertable are often used as non-relational data warehouses to feed map-reduce for analytics.
A relational-style column scalar may not be the most useful for analytics, so users often store more complex structures in columns. This manifests directly in Cassandra, which introduces the notion of "column families," which get treated as a "super-column."
Column-oriented stores support retrieving records, but this requires fetching the column values from their individual columns and re-assembling the record.
Graph databases store vertices and the edges between them. Some support adding annotations to the vertices and/or edges. This can be used to model things like social graphs (people are represented by vertices, and their relationships are the edges), or real-world objects (components are represented by vertices, and their connectedness is represented by edges). The content on IMDB is tied together by a graph: movies are related to to the actors in them, and actors are related to the movies they star in, forming a large complex graph.
The access and query languages for graph databases are the most different of the set of those discussed here. Graph database query languages are generally about finding paths in the graph based on either endpoints, or constraints on attributes of the paths between endpoints; one example is SPARQL.
Why are there so many different models? We might as well ask, "why do we need so many different kinds of data structures?" In the realm of data structures, there is a lot of literature demonstrating the performance characteristics and best-fit use cases for things like lists, trees, and hash tables. In the realm of persistent data storage, relational databases cover a lot of ground because of their genericity. Some of the non-relational stores trade away some of the features of the relational model in exchange for easier schema evolution, and for data distribution. At the same time, some of them offer different data models in order to solve certain kinds of problems more easily than they are solved with the more structured relational model.
Choosing a data model isn't easy. The multi-purpose aspects of the relational model make it appear to be a one-size-fits-all solution, but there are domains that it doesn't handle well. That's not to say the relational model can't be used, but it may not be as efficient as other representations for certain types of operations. The proliferation of non-relational stores is probably a response to that, in much the same way that people choose different data structures for different purposes. And, because no single model does everything well, users of non-relational stores often use multiple stores for different purposes. An application may keep some data in HBase for analytics purposes (i.e., to use Hadoop), while keeping other data in MongoDB in order to be able to perform geo-spatial queries. Some applications combine relational stores with non-relational because they need to use transactions to keep track of monetary exchanges, and use RDBMSes for those, but need other data models for unstructured content storage, or a database with other availability characteristics for real-time click-stream data collection.
In order to choose a data store, you should consider the data access patterns your application needs, as well as how you plan to model your domain. When choosing a data store, you may find that it is unnatural to try to use one store for all your data, so you may want to use different data stores for different sub-systems, as suggested above.
- Scalable Datastores
A comparison of various databases by Rick Cattell. The PDF analysis catgorizes databases according to their data models, and then describes each one's consistency model (I see data models and consistency models as orthogonal, so I will treat that separately in another entry). There is more discussion of newer SQL implementations that attempt to solve some of the scalability and availability problems with traditional RDBMSes. This page also has links to many database offerings and other good references on the web.
- Graph Databases, NOSQL and Neo4j
After a brief introduction to non-relational models and database consistency issues, this has a couple of extended graph database examples, and describes Neo4j, one particular graph database implementation.