Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Data Replication Automation

DZone's Guide to

Data Replication Automation

At a glance, it appears to be a simple task of exporting the records from the source database and inserting them into the target database. However, there are some issues.

· Database Zone
Free Resource

Navigating today's database scaling options can be a nightmare. Explore the compromises involved in both traditional and new architectures.

This article discusses a technology of data replication automation where specific records get copied from a source database into a target database based on Oracle/MySQL DB.

The Use Case

Consider the following use case. One of the services a QA group provides is replicating data environments — for example, copying specific records (which are uncovering some unseen business logic defects) from a staging environment into a testing environment to reproduce and debug an issue. Replication automation saves the effort of manually recreating data in the testing environment, which is usually complex, time-consuming, and error-prone. Additionally, the results can be unfaithful to the situation being examined in the source environment.

The Issues

At a glance, it appears to be a simple task of exporting the records from the source database and inserting them into the target database. This approach, however, presents some issues to address:

  • Referential integrity is broken. A record may reference another record, called the parent, through some foreign keys. To copy a record into a database, its parent must be copied first.

  • Business relations get lost. A record can reference other records (its parents) and at the same time can be referenced by yet some other records called the children. These references constitute the business structure and relations. Therefore, from a business perspective, to copy a record, its parents and children must also be copied.

To facilitate the discussion that follows, let's define:

  • The family of a record to be a union of the record's parents and children.

  • The data scenario of seed records to be a union of the below records: 

    1. Some records known to the user, called the seeds.

    2. The family of each of the record in #1.

    3. The family of each of the record in #2, and so forth.

With these definitions, the use case of data replication now reduces to: Given the seed records, export the data scenario from a source database and copy them into a target database in an order that complies with referential integrity.

The Solution

The key points in replication automation include:

  1. How to find the parent/child relations. This can be resolved by the information_schema.key_column_usage table available in the Oracle/MySQL DBMS. In particular, the columns table_name, column_name, referenced_table_name, and referenced_column_name implement the references.

  2. How to find the full definition of a record. This can be resolved by the information_schema.columns table, again available in the DBMS. In particular, the columns table_namecolumn_name, and data_type provide an adequate profile of a record.

  3. How to build an order of inserting records that complies with referential integrity. This can be resolved by a deque. A deque is a double-ended queue in which both the insertion and removal of elements can be performed at both end points. In our case, we impose the discipline of:

    • Inserting parents at the front-end.

    • Inserting children at the back-end.

    • Retrieving/removing records at the front-end.

The following code snippet outlines the solution to the replication automation, with implementation notes further below.

public void repliateDataScenario() {
  Set<Repord> doneRecords = new HashSet<Record>();

  // load replication queue with seed records.
  // reqQueue is an instance of a Deque class, RecQueue<Record>;
  // the Record class houses a table name and a list of primary/foreign keys of the table;
  // the seed records are represented by SELECT...FROM...WHERE... statements, which 
  // RepQueue.init() uses to retrieve the records and loads in the queue.
  repQueue.init();

  for (; !repQueue.isEmpty(); ) {
    Record nextRecord = repQueue.peek();
    // if already exported, discard front record and move on to the next
    if (doneRecords.contains(nextRecord)) {
      repQueue.pollFirst();
      continue;
    }

    // nextRecord is not exported; get together the family if hasn't
    if (!nextRecord.hasGotTogetherFamily()) {
      // the references [table_name, column_name, referenced_table_name, referenced_column_name]
      // are implemented by java Map; Record.queryParents() uses the maps to retrieve 
      // the record's parents, and insert them in front of the record in the queue.
      repQueue.insertFrontEnd(nextRecord.queryParents());
      // Record.queryChildren() uses the maps to retrieve the record's children, and
      // insert them at back of the record in the queue.
      repQueue.insertBackEnd(nextRecord.queryChildren());
      // mark done of querying family
      nextRecord.hasGotTogetherFamily(true);
      continue;
    }

    // otherwise export the record, register it in doneRecords set,
    // remove it from queue, and move on to the next
    nextRecord.export();
    doneRecords.add(nextRecord);
    repQueue.pollFirst();
  }
}

Implementation Notes

  • At Line 35, Record.export() retrieves the full record using the information_schema.columns meta information, then exports the record as an INSERT INTO...VALUES(...)  statement, and writes it to a script file; which upon completion can be executed in the target database to actually replicate the data scenario.
  • Lines 24 and 27 implement an insert_my_parents > insert_myself > insert_my_children linear order for each of the records. The referential integrity is thus honored during the insertion to the target database.

An Example

Consider below a simple data scenario of five records: 2 has one child (1), 3 has two children (1 and 5), 1 has two parents (2 and 3) and two children (4 and 5), 4 has one parent (1), and 5 has two parents (1 and 3).

Image title

Supposing 1 is the chosen seed record, the following list shows the states of doneRecordsrepQueue, and the output through the loop of the above code snippet. In the list, 1 denotes a record 1 in repQueue that has not queried its family; and 1* denotes a record 1 that has queried its family.

doneRecords {}; repQueue <1>; output <>
doneRecords {}, repQueue <2,3,1*,4,5>; output <>
doneRecords {}, repQueue <2*,3,1*,4,5,1>; output <>
doneRecords {2}, repQueue <3,1*,4,5,1>; output <2>
doneRecords {2}, repQueue <3*,1*,4,5,1,1,5>; output <2>
doneRecords {2,3}, repQueue <1*,4,5,1,1,5>; output <2,3>
doneRecords {2,3,1}, repQueue <4,5,1,1,5>; output <2,3,1>
doneRecords {2,3,1}, repQueue <1,4*,5,1,1,5>; output <2,3,1>
doneRecords {2,3,1}, repQueue <4*,5,1,1,5>; output <2,3,1>
doneRecords {2,3,1,4}; repQueue <5,1,1,5>; output <2,3,1,4>
doneRecords {2,3,1,4}; repQueue <1,3,5*,1,1,5>; output <2,3,1,4>
doneRecords {2,3,1,4}; repQueue <3,5*,1,1,5>; output <2,3,1,4>
doneRecords {2,3,1,4}; repQueue <5*,1,1,5>; output <2,3,1,4>
doneRecords {2,3,1,4,5}; repQueue <1,1,5>; output <2,3,1,4,5>
doneRecords {2,3,1,4,5}; repQueue <1,5>; output <2,3,1,4,5>
doneRecords {2,3,1,4,5}; repQueue <5>; output <2,3,1,4,5>
doneRecords {2,3,1,4,5}; repQueue <>; output <2,3,1,4,5>

It can be seen that for any record R in the data scenario, the output achieves the linear order of insertion:

  • R's parents > R > R's children.

It can be shown that no matter how the seed records are chosen, the above property is achieved.

Planning for disaster doesn't have to actually be a disaster. Understand your options for deploying a database across multiple data centers - without the headache.

Topics:
database cloning ,oracle ,mysql ,database ,tutorial ,data replication ,automation

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}