Data Replication Automation
At a glance, it appears to be a simple task of exporting the records from the source database and inserting them into the target database. However, there are some issues.
Join the DZone community and get the full member experience.
Join For FreeThis article discusses a technology of data replication automation where specific records get copied from a source database into a target database based on Oracle/MySQL DB.
The Use Case
Consider the following use case. One of the services a QA group provides is replicating data environments — for example, copying specific records (which are uncovering some unseen business logic defects) from a staging environment into a testing environment to reproduce and debug an issue. Replication automation saves the effort of manually recreating data in the testing environment, which is usually complex, time-consuming, and error-prone. Additionally, the results can be unfaithful to the situation being examined in the source environment.
The Issues
At a glance, it appears to be a simple task of exporting the records from the source database and inserting them into the target database. This approach, however, presents some issues to address:
Referential integrity is broken. A record may reference another record, called the parent, through some foreign keys. To copy a record into a database, its parent must be copied first.
Business relations get lost. A record can reference other records (its parents) and at the same time can be referenced by yet some other records called the children. These references constitute the business structure and relations. Therefore, from a business perspective, to copy a record, its parents and children must also be copied.
To facilitate the discussion that follows, let's define:
The family of a record to be a union of the record's parents and children.
The data scenario of seed records to be a union of the below records:
Some records known to the user, called the seeds.
The family of each of the record in #1.
The family of each of the record in #2, and so forth.
With these definitions, the use case of data replication now reduces to: Given the seed records, export the data scenario from a source database and copy them into a target database in an order that complies with referential integrity.
The Solution
The key points in replication automation include:
How to find the parent/child relations. This can be resolved by the
information_schema.key_column_usage
table available in the Oracle/MySQL DBMS. In particular, the columnstable_name
,column_name
,referenced_table_name
, andreferenced_column_name
implement the references.How to find the full definition of a record. This can be resolved by the
information_schema.columns
table, again available in the DBMS. In particular, the columnstable_name
,column_name
, anddata_type
provide an adequate profile of a record.How to build an order of inserting records that complies with referential integrity. This can be resolved by a deque. A deque is a double-ended queue in which both the insertion and removal of elements can be performed at both end points. In our case, we impose the discipline of:
Inserting parents at the front-end.
Inserting children at the back-end.
Retrieving/removing records at the front-end.
The following code snippet outlines the solution to the replication automation, with implementation notes further below.
public void repliateDataScenario() {
Set<Repord> doneRecords = new HashSet<Record>();
// load replication queue with seed records.
// reqQueue is an instance of a Deque class, RecQueue<Record>;
// the Record class houses a table name and a list of primary/foreign keys of the table;
// the seed records are represented by SELECT...FROM...WHERE... statements, which
// RepQueue.init() uses to retrieve the records and loads in the queue.
repQueue.init();
for (; !repQueue.isEmpty(); ) {
Record nextRecord = repQueue.peek();
// if already exported, discard front record and move on to the next
if (doneRecords.contains(nextRecord)) {
repQueue.pollFirst();
continue;
}
// nextRecord is not exported; get together the family if hasn't
if (!nextRecord.hasGotTogetherFamily()) {
// the references [table_name, column_name, referenced_table_name, referenced_column_name]
// are implemented by java Map; Record.queryParents() uses the maps to retrieve
// the record's parents, and insert them in front of the record in the queue.
repQueue.insertFrontEnd(nextRecord.queryParents());
// Record.queryChildren() uses the maps to retrieve the record's children, and
// insert them at back of the record in the queue.
repQueue.insertBackEnd(nextRecord.queryChildren());
// mark done of querying family
nextRecord.hasGotTogetherFamily(true);
continue;
}
// otherwise export the record, register it in doneRecords set,
// remove it from queue, and move on to the next
nextRecord.export();
doneRecords.add(nextRecord);
repQueue.pollFirst();
}
}
Implementation Notes
- At Line 35,
Record.export()
retrieves the full record using theinformation_schema.columns
meta information, then exports the record as anINSERT INTO...VALUES(...)
statement, and writes it to a script file; which upon completion can be executed in the target database to actually replicate the data scenario. Lines 24 and 27 implement an
insert_my_parents
>insert_myself
>insert_my_children
linear order for each of the records. The referential integrity is thus honored during the insertion to the target database.
An Example
Consider below a simple data scenario of five records: 2 has one child (1), 3 has two children (1 and 5), 1 has two parents (2 and 3) and two children (4 and 5), 4 has one parent (1), and 5 has two parents (1 and 3).
Supposing 1 is the chosen seed record, the following list shows the states of doneRecords
, repQueue
, and the output through the loop of the above code snippet. In the list, 1
denotes a record 1 in repQueue
that has not queried its family; and 1*
denotes a record 1 that has queried its family.
doneRecords {}; repQueue <1>; output <>
doneRecords {}, repQueue <2,3,1*,4,5>; output <>
doneRecords {}, repQueue <2*,3,1*,4,5,1>; output <>
doneRecords {2}, repQueue <3,1*,4,5,1>; output <2>
doneRecords {2}, repQueue <3*,1*,4,5,1,1,5>; output <2>
doneRecords {2,3}, repQueue <1*,4,5,1,1,5>; output <2,3>
doneRecords {2,3,1}, repQueue <4,5,1,1,5>; output <2,3,1>
doneRecords {2,3,1}, repQueue <1,4*,5,1,1,5>; output <2,3,1>
doneRecords {2,3,1}, repQueue <4*,5,1,1,5>; output <2,3,1>
doneRecords {2,3,1,4}; repQueue <5,1,1,5>; output <2,3,1,4>
doneRecords {2,3,1,4}; repQueue <1,3,5*,1,1,5>; output <2,3,1,4>
doneRecords {2,3,1,4}; repQueue <3,5*,1,1,5>; output <2,3,1,4>
doneRecords {2,3,1,4}; repQueue <5*,1,1,5>; output <2,3,1,4>
doneRecords {2,3,1,4,5}; repQueue <1,1,5>; output <2,3,1,4,5>
doneRecords {2,3,1,4,5}; repQueue <1,5>; output <2,3,1,4,5>
doneRecords {2,3,1,4,5}; repQueue <5>; output <2,3,1,4,5>
doneRecords {2,3,1,4,5}; repQueue <>; output <2,3,1,4,5>
It can be seen that for any record R in the data scenario, the output achieves the linear order of insertion:
R's parents > R > R's children.
It can be shown that no matter how the seed records are chosen, the above property is achieved.
Opinions expressed by DZone contributors are their own.
Comments