Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

A Primer on Open-Source NoSQL Databases

DZone's Guide to

A Primer on Open-Source NoSQL Databases

A beginner's guide to the different flavors of NoSQL databases, including key-value, document-oriented, graph, and column-oriented databases.

Free Resource

Learn how to create flexible schemas in a relational database using SQL for JSON.

The idea of this article is to understand NoSQL databases, its properties, various types, data model, and how they differ from standard RDBMS.

1. Introduction

The RDMS databases are here for nearly three decades now.  But in the era of social media, smart phones and cloud, we generate large volume of data, at a high velocity.  Also the data varies from simple text messages to high resolution video files.  The traditional RDBMS could not able to cope up with the velocity, volume and variety of data requirement of this new era.  Also most of the RDBMS software are licensed and needs enterprise class, proprietary, licensed hardware machines.  This has clearly let way for Open Source NoSQL Databases, where the basic properties are dynamic schema, distributed and horizontally scalable on commodity hardware.

2. Properties of NoSQL

NoSQL is the acronym for Not Only SQL.  The basic qualities of NoSQL databases are schema-less, distributed, and horizontally scalable on commodity hardware.  The NoSQL databases offers variety of functions to solve various problems with variety of data types, where "blob" used to be the only data type in RDBMS to store unstructured data.

2.1 Dynamic Schema

NoSQL databases allows schema to be flexible. New columns can be added anytime.  Rows may or may not have values for those columns and no strict enforcement of data types for columns. This flexibility is handy for developers, especially when they expect frequent changes during the course of the product life cycle.

2.2 Variety of Data

NoSQL databases support any type of data.  It supports structured, semi-structured, and unstructured data to be stored.  Its supports logs, images files, videos, graphs, jpegs, JSON, XML to be stored and operated as it is without any pre-processing.  So it reduces the need for ETL (short for Extract — Transform — Load).

2.3 High Availability Cluster

NoSQL databases support distributed storage using commodity hardware. It also supports high availability by horizontal scalability. This features enables NoSQL databases get the benefit of elastic nature of the Cloud infrastructure services.

2.4 Open Source

NoSQL databases are typically open source software.  The software is free, and most of them are free to use in commercial products.  The open source codebases can be modified to solve business needs.  There are minor variations in the open source software licenses, users must be aware of license agreements.

2.5 NoSQL — Not Only SQL

NoSQL databases not only depend on SQL to retrieve data. They provide rich API interfaces to perform DML and CRUD operations. These APIs are more developer friendly, and supported in a variety of programming languages.

3. Types of NoSQL

There are four types of NoSQL data bases. They are: Key-Value databases, Column oriented database, Document oriented databases, and Graph databases.  At a very high level most of these databases follow the similar structure of RDBMS databases. The database server might contain many databases.  The databases might contain one or more tables inside it.  The table intern will have rows and columns to store the actual data.  This hierarchy is common across all NoSQL databases, but the terminologies might vary.

3.1 Key Value Database

Key-Value databases are developed based on the Dynamo whitepaper published by Amazon.  Key-Value database allows the user to store data in simple <key> : <value> format, where key is used to retrieve the value from the table.

3.1.1 Data Model

The table contains many key spaces and each key space can have many identifiers to store key value pairs.  The key-space is similar to column in typical RDBMS and the group of identifiers presented under the key-space can be considered as rows.It is suitable for building simple, non-complex, high available applications.  Since most of Key Value Databases support in memory storage, can be used for building cache mechanism.

Key Value Store Data Model

3.1.3 Example:

  • DynamoDB

  • Redis

3.2 Column oriented Database

Column oriented databases are developed based on the Big Table whitepaper published by Google.  This takes a different approach than traditional RDBMS, where it supports to add more and more columns and have wider table.  Since the table is going to be very broad, it supports to group the column with a family name, call it "Column Family" or "Super Column".  The Column Family can also be optional in some of the Column data bases.  As per the common philosophy of NoSQL databases, the values to the columns can be sparsely distributed.

3.2.1 Data Model

The table contains column families (optional).  Each column family contains many columns.  The values for columns might be sparsely distributed with key-value pairs.The Column oriented databases are alternate to the typical Data warehousing databases (Eg. Teradata) and they are suitable for OLAP kind of application.

Column Database Data Model

3.2.2 Example

  • Apache Cassandra

  • HBase

3.3 Document-oriented Database

Document oriented databases offer support to store semi-structured data.  It can be JSON, XML, YAML, or even a Word Document.  The unit of data is called a document (similar to a row in RDBMS).  The table which contains a group of documents is called as a "Collection".

3.3.1 Data Model

The Database contains many Collections.  A Collection contains many documents.  Each document might contain a JSON document or XML document or YAML or even a Word Document.Document databases are suitable for Web based applications and applications exposing RESTful services.

Document Database Data Model

3.3.2 Example

  • MongoDB

  • Couchbase

3.4 Graph Database

The real world graph contains vertices and edges.  They are called nodes and relations in a graph.  The graph databases allow us to store and perform data manipulation operations on nodes, relations and attributes of nodes and relations.The graph databases works better when the graphs are directed graphs, i.e. when there are relations between graphs.

3.4.1 Data Model

The graph database is the two dimensional representation of graph.  The graph is similar to a table.  Each graph contains Node, Node Properties, Relation and Relation Properties as Columns.  There will be values for each row for these columns.  The values for properties columns can have key-value pairs. Graph databases are suitable for social media, network problems which involves complex queries with more joins.

Graph Database Data Model

3.4.2 Example

  • Neo4j

  • OrientDB

  • HyperGraphDB

  • GraphBase

  • InfiniteGraph

4. Possible Problem Areas

Following are the important areas to be considered while choosing a NoSQL database for given problem statement.

4.1 ACID Transactions:

Most of the NoSQL databases do not support ACID transactions. E.g. MongoDB, CouchBase, Cassandra.  [Note: To know more about ACID transaction capabilities, refer the appendix below].

4.2 Proprietary APIs / SQL Support

Some NoSQL databases do not support Structured Query Language, they only support an API interface.  There is no common standard for APIs.  Every database follows its own way of implementing APIs, so there is a overhead of learning and developing separate adaptor layers for each and every databases.  Some NoSQL databases do not support all standard SQL features.

4.3 No JOIN Operator

Due to the nature of the schema or data model, not all NoSQL databases support JOIN operations by default, whereas in RDBMS JOIN operation is a core feature.  The query language in Couchbase supports join operations.  In HBase it can be achieved by integrating with Hive.  MongoDB does not support it currently.

4.4 Lee-way of CAP Theorem

Most of the NoSQL databases take the leeway suggested by CAP theorem and they support only any two properties of Consistency, Availability, and Partition.  They do not support all three qualities. [Note: Please refer appendix to know more about CAP theorem].

5. Summary

NoSQL databases solve the problems where RDBMS could not succeed in both functional and non-functional areas.  In this article we have seen the basic properties, generic data models, various types and features of NoSQL databases.  To further proceed, start using anyone of NoSQL database and get hands-on. 

Appendix A Theories behind Databases

A.1 ACID Transactions

ACID is an acronym for Atomicity, Consistency, Isolation, and Durability.  These four properties are used to measure the following:

A.1.1 Atomicity

Atomicity means that the database transactions must be atomic in nature. It is also called all or nothing rule. Databases must ensure that a single failure must result rollback of the entire transaction until the commit point. Only if all transactions are successful the transaction must be committed.

A.1.2 Consistency

Databases must ensure that only valid data must be allowed to be stored. In RDBMS, it is all about enforcing schema. In NoSQL the consistency varies depends on the type of DB. For example, in GraphDB such as Neo4J, consistency ensures that relationship must have start and end node. In MongoDB, it automatically creates a unique rowid, using a 24bit length value.

A.1.3 Isolation

Databases allow multiple transactions in parallel. For example, when read and write operations happens in parallel, read will not know about the write operation until write transaction is committed. The read operation will have only legacy data, until the full commit of the write transaction is completed.

A.1.4 Durability

Databases must ensure that committed transactions are persisted into storage. There must be appropriate transaction and commit logs available to enforce writing into hard disk.

A.2 Brewer's CAP-Theorem

The CAP theorem states that any networked shared-data system can have at most two of three desirable properties.  They are : Consistency, Availability and Partition tolerence.

A.2.1 Consistency

In a distributed database system, all the nodes must see the same data at the same time.

A.2.2 Availability

The database system must be available to service a request received. Basically, the DBMS must be a high available system.

A.2.3 Partition Tolerance

The database system must continue to operate despite arbitrary partitioning due to network failures.

Create flexible schemas using dynamic columns for semi-structured data. Learn how.

Topics:
nosql databases ,mongodb ,nosql

Opinions expressed by DZone contributors are their own.

THE DZONE NEWSLETTER

Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

X

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}