Migrate Your Data Model From SQL to NoSQL
Deep dive on the migration from SQL to NoSQL databases with a discovery assessment, choosing a migration strategy, and identifying data dependencies.
Join the DZone community and get the full member experience.
Join For FreeThis article was authored by AWS Sr. Developer Advocate Mohammed Fazalullah Qudrath and AWS Solutions Architect Almas Shaikh, and published with permission.
All organizations need to modernize their IT infrastructure at some point. Regardless of its size, after operating for a few years, it’s natural for your organization’s infrastructure to include a blend of older and newer technologies, which could present an opportunity for growth.
The SQL database is one of the most pervasive business tools and has been the de facto database standard for many years. However, as the volume of data being generated continues to increase exponentially, and smaller, more specialized NoSQL databases emerge, organizations have been looking for ways to migrate data from SQL.
Difference Between SQL and NoSQL Databases
SQL databases are relational, meaning they use a structured query language (SQL) for defining and manipulating the data. They are based on a table structure, with data organized into rows and columns, and they use schemas to define the structure of each table.
On the other hand, NoSQL databases are non-relational and can store and retrieve data in various ways, such as document, key-value, wide-column, or graph formats. They are schema-less, meaning they can handle unstructured and semi-structured data with flexibility and scalability.
One of the main advantages of NoSQL databases is that they are more scalable. As the volume of data in an SQL database increases, at some point, the users will notice it takes longer and longer to access and manipulate information. At this point, we often have customers asking us how they should transition from SQL to NoSQL, and what are the best practices to execute this migration as fast as possible.
What Does the Modernization Journey Look Like From SQL to NoSQL?
We have collected data on thousands of database migrations to help accelerate your database modernization journey, and hopefully make it more enjoyable. We’re going to share some of our experiences to help you avoid common errors, hopefully, teach you how to manage failures, and, when necessary, how to roll back and remediate.
Modernization Starts With Discovery Assessment
All database migration projects should start with discovering exactly how the existing SQL database is being used. This is a comprehensive evaluation of your current SQL database system, including its structure, the data it contains, and the business requirements it serves. This assessment should involve a detailed analysis of your data model, including tables, relationships, indexes, views, stored procedures, and other database objects. Other relevant technical information includes server names, what is connecting to the database, what is mission-critical, and why.
During this phase, it's also important to understand the workload characteristics of your SQL database, such as transaction rates, query complexity, concurrency requirements, and data growth trends. This information will help you choose the right NoSQL database that can handle your workload requirements and data characteristics.
Other important factors include how often data will be queried, how often it will change, and how big the payload will be. The discovery and assessment phase is vital because it will determine which modernization approach to take, and the most appropriate mechanisms to use.
The 7 R's of Migration
These are the seven R's of migration, each representing a strategy for moving from SQL to NoSQL.
- Rehosting involves moving your SQL database to a different host without making changes to its structure or code. This is often referred to as "lift-and-shift."
- Replatforming involves moving to a new platform with minimal changes to the database structure or code.
- Repurchasing involves moving to a different product, such as switching from a commercial SQL database to an open-source NoSQL database.
- Refactoring involves rearchitecting and rewriting the database to take advantage of the features and capabilities of NoSQL databases.
- Relocating involves moving the database to a different physical location, such as from on-premises to the cloud.
- Retaining involves keeping the SQL database as it is, perhaps because it's too complex or risky to migrate.
- Retiring involves decommissioning the SQL database because it's no longer needed.
When you have collected the technical details of the database, you can make an informed decision on how to proceed. Maybe the database isn’t even worth saving and should be retired. At the other end of the scale, perhaps it’s worth investing time and money to completely refactor it.
Refactoring a database means changing the design, structure, or performance, without changing its external behavior or functionality. Refactoring adds the most business value and can significantly improve the performance, scalability, and agility of a database, but it is also the most difficult and time-consuming choice. Because of these challenges, only a small percentage of customers choose to refactor their database. A simple lift-and-shift to the cloud may address some of the issues, but it overlooks the potential business benefits that could be realized by adopting a serverless infrastructure with microservices.
The rest of the article will focus on how developers can rearchitect a SQL database into a NoSQL database and also list options to use in AWS to further the same.
Finding Dependencies in Data and Choosing a Mechanism
Mechanisms are ways to detect changes being made to the source database and then replicate those changes on the target (non-SQL) database. There are different mechanisms to choose from but to discover the most appropriate mechanism for your project, you first need to know exactly what is happening inside the database. To avoid errors and inconsistencies, it’s important to understand all of the SQL database’s dependencies — such as elements, tables, and functions — that could be affected when something in the database is changed.
The migration mechanism chosen could be a continuous migration where specific sets of data are moved gradually using a change data capture approach, or a dual-write mechanism where data is written to both the SQL and NoSQL databases simultaneously.
In a continuous migration, you might start by migrating non-critical data or data that has fewer dependencies. This allows you to test the migration process and troubleshoot issues without affecting critical operations.
In a dual-write setup, you need to ensure data consistency between the SQL and NoSQL databases. This might involve using a transactional mechanism that is performed at the application level (and requires changes to the code) to ensure that writes to both databases are successful or using a reconciliation process to resolve any inconsistencies.
Change Data Capture (CDC)
SQL databases contain transaction logs, which are documents that list every data change made in the database. The logs can be used to easily replicate those changes elsewhere.
Change Data Capture (CDC) is a design pattern that captures changes in the SQL database so that the target NoSQL database can be updated accordingly. This is typically implemented using database triggers or log-mining techniques.
Depending on the tools used, first, you need to define the source and target endpoints, so data can securely flow between them. Then a replication instance is created, along with a query to describe the tables to be replicated and their target.
This task can be made easier with other tools, such as the AWS Database Migration Service (DMS), where you simply define source and target endpoints and describe what needs to be moved. DMS is commonly used for migrating between different SQL databases but it can also migrate from SQL to NoSQL such as Amazon DynamoDB.
You can also stream the data through Amazon Kinesis and then use AWS Lambda functions to import, validate, and transform the data as required.
The easiest part of the migration is dealing with the cold data, or unused data, which usually makes up the majority of any database. Cold data can be safely copied over to the new system.
Advantages:
- Doesn’t require code changes, such as the dual-write method, so is faster to implement
- Enables atomic changes, which means you can rely on the transaction logs
- Supported by many database engines
- Can move large volumes of data
Disadvantages:
- Requires additional infrastructure and security
- Difficult to sample and segment data—problems moving just one customer/product/region
- Requires extensive database technical knowledge
- Limited customization
Dual Writes
Dual writes involve writing data to both the SQL and NoSQL databases at the same time. This approach is beneficial as it allows for easy data segmentation and requires less infrastructure since it takes place at the application level, where the code is modified to add an additional function. That means when changes occur, they are replicated on the new database—changing the source of truth.
In a dual-write setup, you need to handle scenarios where a write to one database succeeds but the write to the other database fails. This might involve using a distributed transaction mechanism or a compensation logic to roll back changes in case of failures.
The advantage of dual writes is that it enables segmentation, so it’s possible to migrate just one customer, product, or region at a time, in order to build confidence that everything is working before progressing further. Using this method, if there is an error, it’s easy to read the original database to ensure the values are correct.
When you have confidence that the new database is functioning as expected, the read priority can be changed so data is extracted from the new source. If that goes smoothly, then both the read and write priorities can be changed so that only the new database is being used.
Advantages:
- Easy to sample and segment data
- Easy to customize
- No additional infrastructure required
- Does not require database access
Disadvantages/Challenges:
- Data inconsistency—problems can occur if transactions fail or the database is rolled back
- Requires coding—so added complexity and requires failure handling
- Increased use of compute/memory/bandwidth resources because of duplication/copying
Testing and Validation
Testing and validation are crucial after migration to ensure data consistency and correctness. This can involve comparing the number of rows, values, attribute types, and other data elements in the SQL and NoSQL databases.
You can start with simple checks like comparing the count of records in the SQL and NoSQL databases. Then, you can move to more complex validations like comparing the values of specific fields, checking the data types, and validating the relationships between different data elements.
You can also use techniques like randomized sampling, where you randomly select a subset of records for comparison. This allows you to validate the migration without having to compare every single record, which can be impossibly time-consuming and expensive for large databases with millions or billions of records. So randomly choose a small percentage of the data and check it.
Also, consider adding a validation step and a monitoring process. This could be an AWS Lambda function that checks the types, values, and structure of the data to make sure the inputs are correct and generates a report.
In addition to data validation, you should also test the performance of the NoSQL database to ensure that it can handle your workload requirements. This might involve running stress tests, load tests, and other performance tests.
Finally, you should monitor the NoSQL database after the migration to detect any issues and ensure that it's operating correctly. This might involve setting up alerts for any anomalies or errors, monitoring the performance metrics, and regularly checking the data for consistency and correctness.
Summary
Modernization projects should always start with the customer—understanding their data structure, types of APIs, number of read/writes, and the volume of data being transferred.
The next step is to decide the mechanism and what is most appropriate for your use case—CDC or dual writes. There is no correct answer: both have advantages and disadvantages, it all depends on your use case.
Aim to migrate small batches at a time. One customer, one product, one region. And as you gain confidence in the data, expand.
Add validation and testing wherever possible. You will experience failures and errors, so make it as simple as possible to understand why they occurred.
Remember, SQL databases are popular for a good reason — they work well and there are some features you will miss when you migrate to NoSQL, such as the scheme.
In conclusion, you are always going to have a mix of new and old technologies and it’s vital to take a step back and evaluate the systems you already use, and then consider how to take advantage of them in a quickly changing world. Don’t run away from legacy systems, there are ways to improve them—but always plan ahead.
If you are interested, watch the on-demand session on this topic presented at AWS re:Invent 2022.
Opinions expressed by DZone contributors are their own.
Comments