Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Data Modeling Guidelines for NoSQL JSON Document Databases

DZone's Guide to

Data Modeling Guidelines for NoSQL JSON Document Databases

Learn how document database data modeling is different from traditional relational schema modeling and get some guidelines for document database data modeling.

· Database Zone ·
Free Resource

MariaDB TX, proven in production and driven by the community, is a complete database solution for any and every enterprise — a modern database for modern applications.

In this blog post, I'll discuss how NoSQL data modeling is different from traditional relational schema data modeling, and I'll also provide you with some guidelines for document database data modeling.

Document databases, such as MapR-DB, are sometimes called "schema-less" — but this is a misnomer. Document databases don't require the same predefined structure as a relational database, but you do have to define the facets of how you plan to organize your data. Typically, with a NoSQL data store, you want to aggregate your data so that the data can quickly be read together, instead of using joins. A properly designed data model can make all the difference in how your application performs. We have an anecdote at MapR where one of our solution architects worked with a customer, and in a one-hour conversation about schema design, was able to improve access performance by a factor of 1,000x. These concepts matter.

Why NoSQL?

Simply put, the motivation behind NoSQL is data volume, velocity, and/or variety. MapR-DB provides for data variety with two different data models:

  1. MapR-DB as a wide column database with an Apache HBase API.
  2. MapR-DB as a document database with an Open JSON API.

MapR-DB JSON is different than other document data stores in that the row key design is the same for both models, and both can store data (columns or documents) with different access patterns in a different column family with the same row key.

Relational vs. NoSQL Data Modeling

In relational design, the focus and effort are around describing the entity and its relation to other entities — the queries and indexes are designed later. With a relational database, you normalize your schema, which eliminates redundant data and makes storage efficient. Then, queries with joins bring the data back together again. However, joins cause bottlenecks on read, with data distributed across a cluster, and this model does not scale horizontally. With MapR-DB, a table is automatically partitioned across a cluster by key range, and each server is the source for a subset of a table (called a tablet). MapR-DB has a query-first schema design in which queries should be identified first, then the row key should be designed to distribute the data evenly and also to give a meaningful primary index to query by. The row document (JSON) or columns (HBase) should be designed to group data together that will be read together. With MapR-DB, you de-normalize your schema to store in one row or document what would be multiple tables with indexes in a relational world. Grouping the data by key range provides for fast reads and writes by row key.

NoSQL Data Modeling Process

It is useful to start off with Entity Relationship modeling in order to define the entities, relationships, and attributes in your application:

  • Entities: Main objects in your application
  • Attributes: Properties of the objects in your application
  • Relationships: Connections between entities, i.e. 1-1, 1-many, many-many

The E-R model can be used with your query and data access patterns to define the physical model so that the data that is read together are stored together.

As a modeling example, we will use a social application similar to Reddit (Note: I do not know how Reddit is really implemented). Here are the use cases:

  • Users can post URLs to articles by category (like news, sports...):
  • Users can then make comments on posts:

Some of the query requirements are:

  • Display the posts by category and date (most recent first)
  • Display the comments by post
  • Display the posts by user ID

Logical Model Example

This is an E-R diagram for our example social application:

The Entities are:

  • User, Post, Comment, Category

The relations are:

  • A User makes a post
  • A Post has comments
  • A Post belongs to a category

Relational Model Example

This is the relational model for the example social application:

  • Users are stored in the user table.
  • The posted URL is stored in the Post table with a foreign key to the user that posted it, and a foreign key to the category for the post.
  • Comments about a post are stored in the comments table with a foreign key to the post and a foreign key to the user that commented.

Normalization

In a relational database, you normalize the schema to eliminate redundancy by putting repeating information into a table of its own. In this example below, we have an order table, which has a one-to-many relationship with an order items table. The order items table has a foreign key with the ID of the corresponding order.

Denormalization

In a denormalized datastore, you store in one table what would be multiple indexes in a relational world. Denormalization can be thought of as a replacement for joins. Often, with NoSQL, you de-normalize or duplicate data so that data is accessed and stored together.

Parent-Child Relationship-Embedded Entity

Here is an example of denormalization of the SALES_ITEM schema in a Document database:

{
    "_id": "123",
    "date": "10/10/2017",
    “ship_status”:”backordered”
    "orderitems": [
        {
            "itemid": "4348",
            "price": 10.00
        },
        {
            "itemid": "5648",
            "price": 15.00
        }]
}

If your tables exist in a one-to-many relationship, it's possible to model it as a single document. In this example, the order and related line items are stored together and can be read together with a find on the row key (_id). This makes the reads a lot faster than joining tables together.

Note: The maximum default row size is 32MB and optimal size is between 50-100KB. If the embedded entities are really long, then they could be bucketed by row key, or you could just store the ID to the embedded entity table (which would require your application to query that table also).

Document Model Example

This is the document model for the example social application:

There are two tables in the document model compared to four in the relational:

  • User details are stored in the User table.
  • Posted URLs are stored in the Post table:
    • The row key is composed of the category and a reverse timestamp so that posts will be grouped by category with the most recent first.
    • There is a secondary index on the posted by attribute, to query by who submitted the URL.
    • Comments are embedded in the post table.

Composite Row Key Design

Row keys are the primary index for MapR-DB (MapR-DB JSON 6.0 also has secondary indexes). Data is automatically distributed, as it is written by sorted row key range. You can include multiple data elements in a "composite" row key, which can be useful for grouping rows together for finding by key range. For example, if you wanted to group posts by category and date, you could use a row key like "SPORTS_20131012" (if you want the most recent first, use a reverse timestamp). If you wanted to group restaurants by location, you could use a row key like "TN_NASHVL_PANCAKEPANTRY".

Another option is to add a hash prefix to the row key in order to get good distribution and still have a secondary grouping.

Generic Data, Event Data, and Entity-Attribute-Value

Generic data is often expressed as name value or entity attribute value. In a relational database, this is complicated to represent because every row represents an instance of a similar object. JSON allows easy variation across records. Here is an example of clinical patient event data:

patientid-timestamp, Temperature , "102"
patientid-timestamp, Coughing, "True"
patientid-timestamp, Heart Rate, "98"

This is the document model for the clinical patient event data:

The row key is the patient ID plus a time stamp. The variable event type and measurement are put into name-value pairs.

Tree, Adjacency List, Graph Data

Here is an example of a tree, or adjacency list:

Here is a document model for the tree shown above (there are multiple ways to represent trees):

{
    "_id": "USA",
    “type”:”state”,
    "children": ["TN",”FL]
    "parent": null
}
{
    "_id": "TN",
    “type”:”state”,
    "children": ["Nashville”,”Memphis”]
    "parent": "USA”
}
{
    "_id": "FL",
    “type”:”state”,
    "children": ["Miami”,”Jacksonville”]
    "parent": "USA”
}
{
    "_id": "Nashville",
    “type”:”city”,
    "children": []
    "parent": "TN”
}

Each document is a tree node, with the row key equal to the node ID. The parent field stores the parent node ID. The children field stores an array of children node IDs. A secondary index on the parent and children fields allows to quickly find the parent or children nodes.

Inheritance Mapping

In modern object-oriented programming models, different object types can be related, for instance, by extending the same base type. In object-oriented design, these objects are considered instances of the same base type, as well as instances of their respective subtypes. It is useful to store objects in a single database table to simplify comparisons and calculations over multiple objects. But we also need to allow objects of each subtype to store their respective attributes, which may not apply to the base type or to other subtypes. This does not match a relational model but is very easy to do with a document model. Here is an example of object inheritance for store products (bike, pedal, and jersey are all types of store products):

In this online store example, the type of product is a prefix in the row key. Some of the name-value pairs are different and may be missing depending on the type of product. This allows to model different product types in the same table and to find a group of products easily by product type.

In this blog post, you learned how document database data modeling is different from traditional relational schema modeling, and you also got some guidelines for document database data modeling.

MariaDB AX is an open source database for modern analytics: distributed, columnar and easy to use.

Topics:
database ,nosql ,data modeling ,json ,document database ,tutorial ,relational database

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}