Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

A New Way to Compare and Reorganize Data

DZone's Guide to

A New Way to Compare and Reorganize Data

Learn how esProc compares data touches the essential meaning of data differences, which widely exist in each system and between systems

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

In real-world daily business routines, it is common that data that comes from different sources is of the same structure. Sometimes, each set of data is independent and there isn’t any overlapping, like the sales data each branch office exports from their own database. Other times, data overlaps heavily. In a common complete business process, it is most probable that all systems and sections input data based on their store of data. To compare the overlapped data and find and correct the possible corrupt data, we need the auto-reconciliation operation.

Generally, such a reconciliation system will traverse records of a certain set of data by loop and match them with the records one-by-one in another set of data. The coding logic is highly repeatable. But as reconciliation processes use different keywords and because of the change of the query goal, we often need to rewrite a script, leading to cost-inefficient development and maintenance.

esProc makes reconciliation processes intuitive and convenient by offering truly set-oriented operations. To match and reorganize data, two functions — sort() and merge() — will suffice.

To illustrate this, we use a simple case of merging two datasets of sales data.

The following two files — old.csv and new.csv — contain the predictive sales data and the actual sales data, respectively. Both consist of the username, selling date, sales value, and sales count. The business goal is to find and analyze the newly added data, the deleted data, and the updated data by using userName and date as the keywords, which are also called logical primary keys.

The second and third lines in new.csv are the newly added records, which are probably the newly-closed deals. The fourth line of records has been modified, which possibly means a change of the sale price. The third line in old.csv is the deleted record, which may be a cancelled order.

old.csv

new.csv

1

2

3

4

5

6

7

8

9

userName,date,saleValue,saleCount

Rachel,2015-03-01,4500,9

Rachel,2015-03-03,8700,4

Tom,2015-03-02,3000,8

Tom,2015-03-03,5000,7

Tom,2015-03-04,6000,12

John,2015-03-02,4000,3

John,2015-03-02,4300,9

John,2015-03-04,4800,4

userName,date,saleValue,saleCount

Rachel,2015-03-01,4500,9

Rachel,2015-03-02,5000,5

Ashley,2015-03-01,6000,5

Rachel,2015-03-03,11700,4

Tom,2015-03-03,5000,7

Tom,2015-03-04,6000,12

John,2015-03-02,4000,3

John,2015-03-02,4300,9

John,2015-03-04,4800,4

Here, we omit the code written in the conventional logic for matching and take a look at how esProc deals with this:

A B
1

("d:\\old.csv").import@t(;",")

("d:\\new.csv").import@t(;",")

2

sort(userName,date)

.sort(userName,date)

3

=[B2,A2].merge@d(userName,date)

4

=[A2,B2].merge@d(userName,date)

5

=[B2,A2].merge@d(userName,date,saleValue,saleCount)

6

=[diff,new].merge@d(userName,date)

result update

Yes, that’s it. No loop traversal. No optimization of query algorithm. And if the matching keywords are changed, we just need to change a few lines. Now, let’s go over the code:

  • A1,B1: Import the files respectively, using the comma as the separator to generate two raw datasets. esProc supports importing data from files of other formats or from the data table in a database.

  • A2,B2: Use the sort() function to sort each data set by the keywords, or the logical primary keys, to generate a new data set. The new sets will be of use in the subsequent steps in the merge() function.

  • A3: Find the newly added records, which are those in which both the keywords userName and date are in set B2 but not in set A2. This is the difference operation of the set operations, which is specified by the function option @d. We also have @u for union operations and @i for intersection operation. We get a new set named “new”.

  • A4: Find the deleted records, which are those in which both the keywordsuserName and date  are in set A2 but not in set B2. Note that the order of A2 and B2 in the bracket before the merge() function is different. We get a new set named “delete”.

  • A5: Like A3, this also performs a difference operation between B2 and A2, with all fields used as the keywords. The result is all the changed records, including updated ones and newly added ones. We get a new set “diff”.

  • A6: Remove the newly added records, which are the “new” set, from the “diff” set to get the updated records. Below is the “update” set.

With the introduction of set operations, the newly added, deleted, and updated acquire an intuitive meaning. That is the differences between the old and the new sets. These differences can be conveniently represented through corresponding set operations.

The final result can be viewed from the IDE where the computation is performed or output to a file using the file handling function or returned to a Java program or a reporting tool via JDBC. In the above script, B6 shows a simple method of returning the reconciliation result as result set to another system module. Below is the example of the result set used in a Java program.

// establish a connection via esProc JDBC
Class.forName("com.esproc.jdbc.InternalDriver");
con= DriverManager.getConnection("jdbc:esproc:local://");
// call esProc script, whose name is test and that can accept parameters
st =(com.esproc.jdbc.InternalCStatement)con.prepareCall("call test()");
st.execute();//execute esProc stored procedure
ResultSet set = st.getResultSet();//get the result

The way esProc compares data touches the essential meaning of “data differences,” which are actually the differences of sets. In fact, data differences widely exist in each system and between systems, ranging from the huge account data in bank systems and network operator systems to the duplication detection and edition comparison in personal file systems. But after generating the to-be-compared datasets and defining the keywords for matching, we can reorganize data by taking good advantage of the set operations.

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
big data ,data analytics ,data computation ,esproc

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}