Merge Databases With Different Schema and Duplicate Entries
Removing duplicate entries from merged database tables can be anything but routine — and the source of performance woes.
Join the DZone community and get the full member experience.
Join For FreeCombining tables frequently results in duplicate entries that can be removed in several ways. The trick is knowing which way is best for a given situation. Often the only way to determine the best approach is by testing several and comparing their effect on database performance.
It is one of the most common operations in database management: Merge two tables that use different schema while also removing duplicate entries. Yet there are as many approaches to this problem as there are types of database tables. There are also as many potential glitches.
Here's a look at three ways to address the situation in SQL and MySQL.
All the News That's Fit to Merge
Combining multiple tables with similar values often creates duplicate entries. Several methods are available for eliminating duplicate values in SQL, but it can be tricky to determine which is best in a given situation.
In a StackOverflow post from October 2012, a number of approaches were proposed for removing duplicates from joined tables. The first was to convert an inner query to a common table expression (CTE):
A common table expression for an inner join often has a lower impact on performance than using the DISTINCT
keyword to eliminate duplicates. Source: StackOverflow.
The second approach was to use the DISTINCT
keyword, which one poster claims performs better in some cases. Also suggested was the use of the string_agg
function and the group by clause.
Getting Up Close and Personal With the UNION Clause
One of the basic elements in the SQL toolbox is the UNION
operator, which checks for duplicates and returns only distinct rows, and also stores data from both tables without duplicates:
Insert rows from a second table when their values don't match those of the joined table or create a new table that doesn't affect either of the original tables. Source: StackOverflow
Alternatively, you can use the SELECT INTO
command to create a new table from the contents of two separate tables in a way that removes duplicates:
The SELECT INTO
command creates a new table from the content of two others and removes duplicates in the original tables. Source: StackOverflow.
Combining Multiple Gigabyte-Size Tables Without a Performance Hit
It isn't unusual for database tables to become massive. Imagine merging a dozen tables with a total of nearly 10 million separate records and more than 3GB. The first suggestion on StackOverflow was to create a new table with a unique constraint on the set of columns that establish a row's uniqueness, then to use INSERT IGNORE INTO ... SELECT FROM
to move rows from the old table to the new one, and finally to truncate the old tables and use INSERT INTO ... SELECT FROM
to return the rows to the original table.
Another proposed solution was to create a specific view that combines the results of the 12 tables and then to filter the results by querying on the view you just created.
Published at DZone with permission of Darren Perucci, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments