{{announcement.body}}
{{announcement.title}}
refcard cover
Refcard #357

NoSQL Migration Essentials

Moving Out of a Relational Database

Need help with your NoSQL migration? Look no further than our "NoSQL Migration Essentials" Refcard. We walk through the primary steps for moving out of a relational database, plus important design principles to understand and consider in your migration process.

Readers will review key concepts that range from denormalizing and modeling data to defining access patterns, designing primary keys and indexes, and creating an entity relationship diagram — all demonstrated with a simple site application example. As a bonus, readers can use the included JSON structure at the end to interact with a NoSQL playground.

Free PDF for Easy Reference

Brought to You By

Couchbase
refcard cover

Written By

author avatar Matthew Groves
Product Marketing Manager, Couchbase
author avatar Frank Eaves
Senior Software Engineer, N/A
Section 1

Introduction

The need to transition from an SQL (relational) to a NoSQL (non-relational) data solution happens usually under a positive business environment. Your customers' requirements are causing you to reevaluate your data solution in order to help them achieve their business objectives. So congratulations, you have come to the right place. When you're finished reading this Refcard, you'll have a solid understanding of the foundations of this transition and serious paths to consider when going about planning and implementing your migration from an SQL to a NoSQL database. 

Section 2

About SQL and NoSQL

The primary engineering incentive driving the design principles for SQL is minimizing memory and disk usage, while that of NoSQL is improving scalability. Understanding the reasons behind each of these motivations will help set a foundation on which to base a migration from SQL to NoSQL. 

SQL Incentives (What Were They Thinking?) 

Relational Software Inc., now Oracle, announced the first implementation of SQL in 1979. At the time, one megabyte of storage cost $47,000 (inflation adjusted). Given that storage was so expensive, engineering design was focused on the efficient use of this scarce resource. Not repeating data was the main challenge to limiting the amount of storage used. References to data in other tables and constraints to assist with data integrity are the guiding principles within the SQL database community. 

This understanding and logic promulgated throughout the engineering community as well as into academia. Students of computer science are taught these principles and take that knowledge base with them into the professional workplace. 

NoSQL Incentives (How Did We Get Here?) 

From 1979–2021 (42 years), the number of people using networked computers (the Internet) grew exponentially. During this time, the cost of storage fell to $0.023 per gigabyte. The motivations of 1979's data engineers have changed drastically. Internet hosts went from single digits in the late 60s to just over one billion in 2021. The modern incentive of data engineering is scalability, while cost of storage is now basically an afterthought. The foremost guiding principle within the NoSQL database community is using keys and indexes on independent data to create efficient data access and as a result, achieve speed and scalability. 

Section 3

Key Concepts in NoSQL Migration

Why is database migration such a critical topic of conversation? It is essential to first understand how the issue ties directly to the success of today's modern business. Since the introduction of SQL, the importance of data has grown exponentially, and in tandem, so has the need to learn and apply the concepts of SQL databases. Academia and industry both found new and compelling uses for data. This naturally led colleges to teach relational principles. As graduates entered the workforce, relational databases were used in systems development, which resulted in many businesses running legacy software systems that are now struggling to scale and meet the needs of the modern Internet customer. 

Relying on legacy systems can result in lagging behind the smaller start-ups using modern solutions and cloud-based services, including NoSQL. These start-ups can scale their offerings far beyond what the legacy systems can offer. The loss of market share for these companies has opened a new market for industry solutions that help businesses with legacy systems migrate their data infrastructure to NoSQL and educate engineering teams about efficient implementation of it with the hopes of maintaining and regaining lost market share. 

Making this leap isn’t as hard as it might seem. Next, we'll cover the basics (and a comparison) of SQL and NoSQL data modeling. 

NoSQL Data Modeling 

The fundamental principles around NoSQL data modeling are the same as those used for SQL. The difference comes down to denormalization vs. normalization of data and the minor shift in the shared approaches that this causes. Below is a Venn diagram showing a high-level view of data modeling for SQL and NoSQL. 

NoSQL vs. SQL Venn diagram

The process of creating an entity relationship diagram, understanding your application’s access patterns, and then optimizing data throughput by using indexes is the same for both SQL and NoSQL. So let’s now focus on understanding the differences between normalization and denormalization.

The process of creating an entity relationship diagram, understanding your application’s access patterns, and then optimizing data throughput by using indexes is the same for both SQL and NoSQL. So let’s focus on understanding the differences between normalization and denormalization. 

Denormalizing Data 

What Is Normalization? …Just Be Like Everyone Else  

The objectives of normalization are to reduce data redundancy and the amount of data storage used, as well as to improve data integrity. In his paper, "Further Normalization of the Data Base Relational Model," Edgar Codd defined four forms of data normalization: 

Form 

Purpose 

1NF 

(First Normal Form) 

Remove unnecessary insertions, updates, and deletion dependencies from the data model — all values must be atomic 

2NF 

Optimize your data model so when new data types are required, restructuring the schema is unneeded — non-key values must have a clear connection to the primary key 

3NF 

Remove transitive dependencies within the data model 

4NF 

Identify statistics that could change in the future; make changes to the data model relationships so querying for these statistics is neutral 

Following this process results in a set of tables with well-defined constraints that meet the objectives of reduced data redundancy and keeping data integrity intact. 

What Is Denormalization? — Only Exceptional Data Need Apply! 

The objective of denormalization is to design all access patterns in the most efficient way possible. Data access patterns are a main focus, and careful analysis is needed to attain that efficiency. This is done by combining or nesting data into one structure, which makes the reads and writes faster and avoids the overhead of joins.  

Some people think that NoSQL is schema-less; however, this isn't true. All applications have an inherent logical data structure, which still exists in a NoSQL database — it is just stored implicitly. By utilizing this capability, joins can be reduced, allowing the application to retrieve all necessary information using a single lookup. 

Before jumping into the example, let's first review the possible variations of NoSQL data models. 

NoSQL Models

For SQL data solutions, a table is the core data model. However, there are multiple models for NoSQL data solutions to choose from. This mainly depends on the NoSQL database you are using — the four most popular types: 

  1. Key-value pair  
  2. Column-oriented 
  3. Graph-based 
  4. Document-oriented 

Many NoSQL databases support two or more of these models. These are known as "multi-model" databases. 

Quick note: In a later section, a simple music site is used to demonstrate denormalization. The data used in the models below — band and song names — are taken from that example. 

Key-Value Pair 

For key-value pair NoSQL databases, the model consists of a key and value: 

Primary Key 

Value 

Band Name 

Coldplay 

 But for more complex data structures, the value can be a JSON object, which is greater in depth and complexity. Typically, text is used as the visualization model for JSON structures. 

 
Primary Key: Band Name 
Value: 
{ 
   "Band Name": "Coldplay", 
   "Year": 2002, 
   "Origin": "Sheffield, England" 
   } 


Column-Oriented 

For column-oriented NoSQL databases, tables (or "column families", since they are fundamentally different from relational tables) can be used to show the column data structures. Tables can have multiple header rows — one for each data structure: 

Primary Key 

Attributes 

Band Name 

Year 

Origin 

Band Members 

Arctic Monkeys 

2002 

Sheffield, England 

Alex Turner, Jamie Cook, Nick O'Malley, 

Matt Helders 

Song Name 

Year 

Peak Chart Pos 

 

When the Sun Goes Down 

2018 

1 

 


This model shows two data structures, Band Name and Song Name. 

Graph-Based  

For graph-based NoSQL databases, the model consists of nodes and edges. Data structures are put into nodes, and relationships are represented by the edges that join one node to another. For example, visualizing a band, the songs that the band performs, and the members of the band might be modeled this way: 

 

Document-Oriented 

Document-oriented NoSQL databases have a primary key and a JSON value.  

 
Primary Key:  Arctic Monkeys 
Value: 
{ 
   "Band Name": "Arctic Monkeys", 
   "Year": 2002, 
   "Origin": "Sheffield, England", 
   "Band Members": [ 
      "Alex Turner", 
      "Jamie Cook", 
      "Nick O'Malley", 
      "Matt Helders" 
      ], 
   "Genres": [ 
      "Indie rock", 
      "garage rock", 
      "post-punk revival", 
      "psychedelic rock", 
      "alternative rock" 
      ] 
   } 

 

Because all SQL models use tables, describing the migration process from SQL to NoSQL with tables may be a more applicable depiction that resonates with those coming from an SQL background. Therefore, in the following sections, a table model will be used. 

With the base knowledge about NoSQL databases, let's dive into how denormalization works. 

Moving Data Into a NoSQL Database 

To demonstrate the core concept of denormalization, we'll walk through the design of a data solution from the beginning. However, most likely you will have gone through the initial phase(s) in building your current SQL data solution.   

When creating any data solution, whether SQL or NoSQL, you should consider following these steps: 

  1. Understand your application – Use design tools and processes that are best suited to illustrate your problem domain and its design. For this example, I use wire diagrams. 
  2. Build an entity relationship diagram – Capture the data structures and the relationships between them. 
  3. Define your access patterns – The data that should be grouped together for optimal data delivery based on the application's needs. 
  4. Design your primary keys and indexes – Analyze the best way to design the primary keys and indexes for optimal data delivery. 

For this example, our problem domain is a simple music site where users are presented with a list of songs and associated data, as well as links to more information about each song and band.  

1: Understand Your Application 

The functionality this application will provide is allowing users to choose a song or band they want to read more details about. Below is the Home screen of our simple music site.  

Functionality: Present a list of songs with links to song and band details. 

Users can select the song name to navigate to song details or the band name to navigate to band details.  

The data model for the Home screen: 

Primary Key 

Attributes 

Home Screen 

Song Name 

Band Name 

Year 

Peak Chart Pos 

Yes 

When the Sun Goes Down 

Arctic Monkeys 

2018 

1 

Yes 

Yellow 

Coldplay 

2000 

2 

Note that accessing this data structure will be done using the Home Screen column, and therefore, indexing this column may be necessary for improved performance.  

Functionality: Allow users to view song details. 


The model for the Song Details screen: 

Primary Key 

Attributes 

Band Name - Song Name 

Album 

Year 

Peak Chart Pos 

Lyrics 

Arctic Monkeys - When the Sun Goes Down 

Whatever People Say I Am, That's What I'm Not 

2018 

1 

So who's that girl there? 

I wonder what went wrong 

So that she had to roam the streets 

Coldplay - Yellow 

Parachutes 

2000 

2 

Look at the stars 

Look how they shine for you 

And everything you do 

Note that if your primary key must be a number, then hashing the string value into a number will work as well. 

Functionality: Allow users to view band details. 

The model for the Band Details screen: 

Primary Key 

Attributes 

Band Name - Details 

Year 

Origin 

Band Members 

Genres 

Arctic Monkeys - Details 

2002 

Sheffield, England 

Alex Turner, Jamie Cook, Nick O'Malley, 

Matt Helders 

Indie rock, 

garage rock, 

post-punk revival, 

psychedelic rock, 

alternative rock 

Coldplay - Details 

1996 

London, England 

Chris Martin, 

Jonny Buckland, 

Guy Berryman, 

Will Champion, 

Phil Harvey 

Alternative rock, 

pop rock, 

post-Britpop, 

pop 

 
Note that the Band Members and Genres columns are both arrays, representing a one-to-many embedded relationship.  

2: Build an Entity Relationship Diagram 

The entity relationship diagram looks like this: 

There are four entities: Band NameSong NameGenre, and Band Member. And there are three relationships, PerformBelongsTo, and PartOf — all of which are one-to-many. 

3: Define the Access Patterns 

Access patterns are the data patterns/groupings needed for the application. In our music site, we have Song Name and Band Name, and the access patterns for each of the wire diagrams are: 

  • Home screen – Displays the list of songs 
  • Song Details screen – Displays song details 
  • Band Details screen – Displays band details 

 To better understand these access patterns, let’s look at a composite data model for all wire diagrams (data sourced from Tables 4-6): 

Primary Key 

Attributes 

Home Screen 

Song Name 

Band Name 

Year 

Peak Chart Pos 

Yes 

When the Sun Goes Down 

Arctic Monkeys 

2018 

1 

Yes 

Yellow 

Coldplay 

2000 

2 

Band Name - Song Name 

Album 

Peak Chart Pos 

Year 

Lyrics 

Arctic Monkeys - When the Sun Goes Down 

Whatever People Say I Am, That's What I'm Not 

1 

2018 

So who's that girl there? 

I wonder what went wrong 

So that she had to roam the streets 

Coldplay - Yellow 

Parachutes 

2 

2000 

Look at the stars 

Look how they shine for you 

And everything you do 

Band Name - Details 

Band Members 

Origin 

Year 

Genres 

Arctic Monkeys - Details 

Alex Turner, Jamie Cook, Nick O'Malley, 

Matt Helders 

Sheffield, England 

2002 

Indie rock, 

garage rock, 

post-punk revival, 

psychedelic rock, 

alternative rock 

Coldplay - Details 

Chris Martin, 

Jonny Buckland, 

Guy Berryman, 

Will Champion, 

Phil Harvey 

London, England 

1996 

Alternative rock, 

pop rock, 

post-Britpop, 

pop 

 

The beneficial part about denormalizing data this way is that when queried using a primary key, the application retrieves the data needed to satisfy the functionality without resorting to joins. The downside of denormalizing your data is that some data may be duplicated (i.e., Band NameYearPeak Chart Pos). In a fully normalized SQL database, data isn't duplicated, and data integrity is maintained by the database through table constraints. 

To meet scalability needs, a NoSQL database has duplicated data and moves the responsibility of data integrity to the application code. When data needs to be updated in multiple places, it is the application’s responsibility to keep all data structures up to date.  

While not possible in every NoSQL database, many document databases have recently introduced JOIN capability. With JOIN functionality, you can design your model to take advantage of whichever approach is best. You can denormalize to maximize performance, and you can normalize to maximize data integrity. 

4: Design Your Primary Keys and Indexes 

In the composite table above, there are separate primary keys for each data structure used. By querying this single table, you can extract different schemas or groups of data. The Home screen’s primary key is the Home Screen column. When the Home Screen column is queried, it will look for data that has a Home Screen attribute set to yes. Therefore, an index should be created on the Home Screen attribute, so that look up will be as fast as possible. The Song Details primary key uses a string that consists of the Band Name and Song Name concatenated together. The Band Details primary key uses a string consisting of the Band Name followed by - Details. These primary keys are designed this way to give the application the data it needs in the quickest manner possible. 

Common Migration Paths 

In most cases, legacy solutions are unable to meet organizations’ scalability objectives. And the engineering teams supporting those legacy applications may not be aware of NoSQL database principles; however, they have a tremendous amount of application and domain knowledge. In short, companies in these situations have two primary routes for migrating to a NoSQL database.  

Direct Port 

Every business' situation is unique, and the migration approach will inevitably be unique as well. A direct port will be easiest if your current data solution has a clear decoupling between the application logic and data. If the application data layer is designed in a manner that allows pulling out one database architecture and replacing it with another, then it's just a matter of implementing the data layer interface using NoSQL. However, if your engineering team needs to create a clear decoupling of the application logic and data, that work will need to be completed first. Then they can implement the data layer interface using NoSQL. 

These tasks may be done in parallel with one team coming up to speed on NoSQL— and following the four steps for creating a new data solution discussed above — while the other team does the architectural work of decoupling the application logic and data. Then when both teams are ready, the application can be switched from an SQL database to a NoSQL database. 

Piecemeal 

In cases when a legacy application must remain running to keep the business operational, switching the entire data layer over at once isn't reasonable. Depending on the size of the data layer and the number of engineering teams involved, migrating piecemeal may be an option. Oftentimes, the areas where the application struggles under the current SQL data layer is known. Focus your engineering efforts there first, porting the critical code over in a piecemeal fashion. 

SQL and NoSQL Working Together

If your application’s data layer involves using multiple databases, you can choose to port over one database to a NoSQL solution while leaving the other alone. This approach works well in a cloud-native environment where multiple microservices are running. Teams should start with the service that is under the most stress and then move to the next service. When a monolith is involved, a careful process needs to be followed where you would wall off the part that has the most pressing issues with scaling. After decoupling this part, it starts to look like a microservice, and you can then focus on the data layer porting process. Once complete, the next piece of the monolith can be addressed, repeating the process. 

Convert to NoSQL While Still Using SQL Queries 

The market has recognized the need for businesses to migrate to a NoSQL solution. And those currently using an SQL solution also need a bridge that enables them to move incrementally into a full NoSQL database. This method offers three things:  

  1. Businesses can address their scalability objectives. 
  2. Engineering teams have time to come up to speed on NoSQL principles. 
  3. Businesses have more options for addressing and evolving their data solution going forward. 

There are solutions available that allow you to convert your SQL schema over to a NoSQL database while continuing to use SQL queries. There are also NoSQL databases that are implementing the "SQL++" standard: adding denormalized capabilities to the familiar query language. This approach may help teams ramp up to NoSQL faster, building on their existing experience and code base. (With the advent of SQL++, the term "NoSQL" can be defined as "Not Only SQL": SQL is still an option for interacting with data, but it's not the only one).  

After, the business can operate as normal and only address the areas where scalability has become a critical issue, leaving the rest of their data solution as is. As the business and engineering teams grow more familiar with NoSQL databases and key principles, leaders can make more informed decisions about how to design the business' overall data architecture. 

Section 4

Conclusion

There are areas where a NoSQL database shouldn’t be used, and another type of database should be considered. For example, an engineering team uses custom queries to discover and explore relationships within their data, which provides the metrics needed for business intelligence efforts. Put simply, probing such data and performing statistical analyses that do complex database queries will execute poorly in a NoSQL database. NoSQL has been designed specifically to address database scalability at the expense of the memory used, the database’s ability to support custom queries, and data integrity — just as the SQL database was designed to minimize memory use through the process of normalization. As SQL database technology evolved, so did its capabilities to handle efficient custom queries and support data integrity. 

The basic principles surrounding NoSQL databases aren’t that different from SQL. The key difference is in normalizing your data versus denormalizing your data. By walking through an application design step by step, you saw how data structures can be organized and then put into a denormalized structure. Finally, I highly encourage you to visit https://couchbase.live/ and interact with the JSON data to truly experience a NoSQL database (see Additional Resources).   

The lessons learned from the evolution of SQL are being applied to NoSQL in an effort to improve and strengthen its weaker areas, as well as to continue innovating simpler ways for organizations to migrate from SQL to NoSQL. 

Additional Resources 

 
CREATE COLLECTION tutorial._default.bands; 
CREATE COLLECTION tutorial._default.songs; 
CREATE COLLECTION tutorial._default.band_details; 
 
INSERT INTO tutorial._default.bands (KEY, VALUE) 
VALUES 
("Arctic Monkeys", { 
"Home Screen": "Yes", 
"Song Name": "When the Sun Goes Down", 
"Year": 2018, 
"Peak Chart Pos": 1 
}), 
("Coldplay", { 
"Home Screen": "Yes", 
"Song Name": "Yellow", 
"Year": 2000, 
"Peak Chart Pos": 2 
}); 
 
INSERT INTO tutorial._default.songs (KEY, VALUE) 
VALUES 
("Arctic Monkeys - When the Sun Goes Down", { 
"Album": "Whatever People Say I Am, That's What I'm Not", 
"Peak Chart Pos": 1, 
"Year": 2018, 
"Lyrics": "So who's that girl there? I wonder what went wrong So that she had to roam the streets" 
}), 
("Coldplay - Yellow", { 
"Album": "Parachutes", 
"Peak Chart Pos": 2, 
"Year": 2000, 
"Lyrics": "Look at the stars Look how they shine for you  And everything you do" 
}); 
 
INSERT INTO tutorial._default.band_details (KEY, VALUE) 
VALUES 
("Arctic Monkeys", { 
    "Band Members": [ 
      "Alex Turner", 
      "Jamie Cook", 
      "Nick O'Malley", 
      "Matt Helders" 
    ], 
    "Origin": "Sheffield, England", 
    "Year": 2002, 
    "Genres": [ 
      "indie rock", 
      "garage rock", 
      "post-punk revival", 
      "psychedelic rock", 
      "alternative rock" 
    ] 
  }), 
("Coldplay", { 
    "Band Members": [ 
      "Chris Martin", 
      "Jonny Buckland", 
      "Guy Berryman", 
      "Will Champion", 
      "Phil Harvey" 
    ], 
    "Origin": "Sheffield, England", 
    "Year": 2002, 
    "Genres": [
      "Alternative rock", 
      "pop rock", 
      "post-Britpop", 
      "pop" 
    ] 
}); 

 

Note: This is the same data in the composite table (Table 7) expressed as a JSON structure. To query this data, you will need minimum indexes: 

  • CREATE PRIMARY INDEX ON tutorial._default.bands;  
  • CREATE PRIMARY INDEX ON tutorial._default.songs; 
  •  CREATE PRIMARY INDEX ON tutorial._default.band_details; 

(More detailed indexes will be necessary later to improve performance).  

And to query, you can try the following: 

  • For Home Screen data: SELECT META(b).id AS BandName, b.* FROM tutorial._default.bands b WHERE b.`Home Screen` = 'Yes'; 
  • For Song Details data: SELECT META(b).id AS SongName, b.* FROM tutorial._default.songs b WHERE META(b).id = 'Coldplay - Yellow' 
  • For Band Details data: SELECT META(b).id AS BandName, b.* FROM tutorial._default.band_details b WHERE META(b).id = 'Coldplay'; 

As an alternative to embedded and/or duplicating data, consider a SQL++ query using a join in NoSQL: 

  • SELECT META(b).id AS BandName, d.Origin FROM tutorial._default.bands b INNER JOIN tutorial._default.band_details d ON META(b).id = META(d).id 

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}