Relational DB Migration to S3 Data Lake Via AWS DMS, Part I
This article discusses the challenges faced during relational database migration to AWS using DMS, including source data, logging, and network bandwidth issues.
Join the DZone community and get the full member experience.
Join For FreeAWS Database Migration Service is a cloud service that migrates relational databases, NoSQL databases, data warehouses, and all other types of data stores into AWS Cloud or between cloud and on-premises setups efficiently and securely. DMS supports several types of source and target databases such as Oracle, MS SQL Server, MySQL, Postgres SQL, Amazon Aurora, AWS RDS, Redshift, and S3, etc.
Observations During the Data Migration
We worked on designing and creating an AWS S3 data lake and data warehouse in AWS Redshift with the data sources from on-premises for Oracle, MS SQL Server, MySQL, Postgres SQL, and MongoDB for relational databases. We used AWS DMS for the initial full load and daily incremental data transfer from these sources into AWS S3.
With this series of posts, I want to explain the various challenges faced during the actual data migration with different relational databases.
1. Modified Date Not Populated Properly at the Source
AWS DMS is used for full load and change data capture from source databases. AWS DMS captures changed records based on the transaction logs, but a modified date column updated properly can help to apply deduplication logic, and extract the latest modified record for a given row on the target in S3.
In case modified data is not available for a table or it is not updated properly, AWS DMS provides an option of transformation rules to add a new column while extracting data from the source database. Here, the AR_H_CHANGE_SEQ header helps to add a new column with value as a unique incrementing number from the source database, which consists of a timestamp and an auto-incrementing number.
The below code example adds a new column as DMS_CHANGE_SEQ
to the target, which has a unique incrementing number from the source. This is a 35-digit unique number with the first 16 digits for the timestamp and the next 19 digits for the record ID number incremented by the database.
{
"rule-type": "transformation",
"rule-id": "2",
"rule-name": "2",
"rule-target": "column",
"object-locator": {
"schema-name": "%",
"table-name": "%"
},
"rule-action": "add-column",
"value": "DMS_CHANGE_SEQ",
"expression": "$AR_H_CHANGE_SEQ",
"data-type": {
"type": "string",
"length": 100
}
}
2. Enabling Supplemental Logging for Oracle as a Source
For Oracle as a source database, to capture ongoing changes, AWS DMS needs minimum supplemental logging to be enabled on the source database. Accordingly, this will include additional information and columns in the redo logs to identify the changes at the source.
Supplemental logging can be enabled for primary, unique keys, sets of columns, or all the columns. Supplemental logging for all columns captures all the columns for the tables in the source database and helps to overwrite the complete records in the target AWS S3 layer.
Supplemental logging of all columns will increase the redo logs size, as all the columns for the table are logged into the logs. One needs to configure, redo, and archive logs accordingly to consider additional information in them.
3. Network Bandwidth Between Source and Target Databases
Initial full load from the on-premises sources for Oracle, MS SQL Server, etc., worked fine and changed data capture, too, for most of the time.
There used to be a moderate number of transactions most of the time of the day in a given month, except for the end-of-business-day process, daily, post-midnight, and month-end activities. We observed DMS migration tasks were out of sync or failed during this time.
We reviewed the source, target, and replication instance metrics in the logs and found the following observations:
- CDCLatencySource – the gap, in seconds, between the last event captured from the source endpoint and the current system timestamp of the AWS DMS instance.
- CDCIncomingchanges – the total number of change events at a point in time that is waiting to be applied to the target. This increases from zero to thousands during reconciliation activities in the early morning.
- CDCLatencySource – the gap, in seconds, between the last event captured from the source endpoint and the current system timestamp of the AWS DMS instance. This increases from zero to a few thousand up to 10-12K seconds during daily post-midnight reconciliation activities. This value was up to 40K during month-end activities.
Upon further logs analysis and reviewing other metrics, we observed that:
AWS DMS metrics NetworkReceiveThroughput is to understand the incoming traffic on the DMS Replication instance for both customer database and DMS traffic. These metrics help to understand the network-related issues, if any, between the source database and the DMS replication instance.
It was observed that the network receive throughput was up to 30MB/s, i.e., 250Mb/s, due to the VPN connection between the source and AWS, which was also shared for other applications.
The final conclusion to this issue is that connectivity between source and target databases is critical for successful data migration. You should ensure sufficient bandwidth between on-premises or other cloud source databases and the AWS environment is set up before the actual data migration.
- A VPN tunnel such as AWS Site-to-Site VPN or Oracle Cloud Infrastructure (OCI) Site-to-Site VPN (Oracle AWS) can provide a throughput of up to 1.25 Gbps. This would be sufficient for small tables migration or tables with less DML traffic migration.
- For large data migrations with heavy transactions per second on the tables, you should consider AWS Direct Connect. It provides an option to create a dedicated private connection with 1 Gbps, 10 Gbps, etc. bandwidth supported.
Conclusion
This is Part I of the multi-part series for the relational databases migration challenges using AWS DMS and their solutions implemented. Most of these challenges mentioned in this series could happen during the database migration process and these solutions can be referred.
Opinions expressed by DZone contributors are their own.
Comments