A New Way to Compare and Reorganize Data
A New Way to Compare and Reorganize Data
Learn how esProc compares data touches the essential meaning of data differences, which widely exist in each system and between systems
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
In real-world daily business routines, it is common that data that comes from different sources is of the same structure. Sometimes, each set of data is independent and there isn’t any overlapping, like the sales data each branch office exports from their own database. Other times, data overlaps heavily. In a common complete business process, it is most probable that all systems and sections input data based on their store of data. To compare the overlapped data and find and correct the possible corrupt data, we need the auto-reconciliation operation.
Generally, such a reconciliation system will traverse records of a certain set of data by loop and match them with the records one-by-one in another set of data. The coding logic is highly repeatable. But as reconciliation processes use different keywords and because of the change of the query goal, we often need to rewrite a script, leading to cost-inefficient development and maintenance.
esProc makes reconciliation processes intuitive and convenient by offering truly set-oriented operations. To match and reorganize data, two functions —
merge() — will suffice.
To illustrate this, we use a simple case of merging two datasets of sales data.
The following two files — old.csv and new.csv — contain the predictive sales data and the actual sales data, respectively. Both consist of the username, selling date, sales value, and sales count. The business goal is to find and analyze the newly added data, the deleted data, and the updated data by using
date as the keywords, which are also called logical primary keys.
The second and third lines in new.csv are the newly added records, which are probably the newly-closed deals. The fourth line of records has been modified, which possibly means a change of the sale price. The third line in old.csv is the deleted record, which may be a cancelled order.
Here, we omit the code written in the conventional logic for matching and take a look at how
esProc deals with this:
Yes, that’s it. No loop traversal. No optimization of query algorithm. And if the matching keywords are changed, we just need to change a few lines. Now, let’s go over the code:
A1,B1: Import the files respectively, using the comma as the separator to generate two raw datasets.
esProcsupports importing data from files of other formats or from the data table in a database.
A2,B2: Use the
sort()function to sort each data set by the keywords, or the logical primary keys, to generate a new data set. The new sets will be of use in the subsequent steps in the
A3: Find the newly added records, which are those in which both the keywords
dateare in set B2 but not in set A2. This is the difference operation of the set operations, which is specified by the function option
@d. We also have
@ufor union operations and
@ifor intersection operation. We get a new set named “new”.
A4: Find the deleted records, which are those in which both the keywords
dateare in set A2 but not in set B2. Note that the order of A2 and B2 in the bracket before the
merge()function is different. We get a new set named “delete”.
A5: Like A3, this also performs a difference operation between B2 and A2, with all fields used as the keywords. The result is all the changed records, including updated ones and newly added ones. We get a new set “diff”.
A6: Remove the newly added records, which are the “new” set, from the “diff” set to get the updated records. Below is the “update” set.
With the introduction of set operations, the newly added, deleted, and updated acquire an intuitive meaning. That is the differences between the old and the new sets. These differences can be conveniently represented through corresponding set operations.
The final result can be viewed from the IDE where the computation is performed or output to a file using the file handling function or returned to a Java program or a reporting tool via JDBC. In the above script, B6 shows a simple method of returning the reconciliation result as result set to another system module. Below is the example of the result set used in a Java program.
// establish a connection via esProc JDBC Class.forName("com.esproc.jdbc.InternalDriver"); con= DriverManager.getConnection("jdbc:esproc:local://"); // call esProc script, whose name is test and that can accept parameters st =(com.esproc.jdbc.InternalCStatement)con.prepareCall("call test()"); st.execute();//execute esProc stored procedure ResultSet set = st.getResultSet();//get the result
esProc compares data touches the essential meaning of “data differences,” which are actually the differences of sets. In fact, data differences widely exist in each system and between systems, ranging from the huge account data in bank systems and network operator systems to the duplication detection and edition comparison in personal file systems. But after generating the to-be-compared datasets and defining the keywords for matching, we can reorganize data by taking good advantage of the set operations.
Opinions expressed by DZone contributors are their own.