DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Last call! Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • Advanced Maintenance of a Multi-Database Citus Cluster With Flyway
  • Migrating from Sakila-MySQL to Couchbase - Part 3: Stored Procedures
  • Running Streaming ETL Pipelines with Apache Flink on Zeppelin Notebooks
  • How to Restore a Transaction Log Backup in SQL Server

Trending

  • Enhancing Security With ZTNA in Hybrid and Multi-Cloud Deployments
  • Understanding and Mitigating IP Spoofing Attacks
  • It’s Not About Control — It’s About Collaboration Between Architecture and Security
  • Using Python Libraries in Java
  1. DZone
  2. Data Engineering
  3. Databases
  4. ETL vs Pure SQL

ETL vs Pure SQL

Which one will come out on top? You should definitely check out the results so you can decide which one to use for your next project.

By 
Mateusz Komendołowicz user avatar
Mateusz Komendołowicz
·
Dec. 30, 16 · Opinion
Likes (5)
Comment
Save
Tweet
Share
6.9K Views

Join the DZone community and get the full member experience.

Join For Free

In this post I would like to compare the efficiency of upsert operation in ETL tools and SQL. This is one of the most commonly used operations when building ETLs and integrating systems. Developers should always ask themselves whether not to use custom written MERGE instead of creating jobs in selected ELT tool.

As an ETL example, I will use Talend Open Studio 6.1. An SQL example will be MERGE operation in SQL Server. We will start with building two tables, which will be used for upsert. Please note, that I will populate sourceTable with 1 000 000 rows and destinationTable with 500 000 rows using modified script from http://blogbi.pl/generating-milions-of-rows-in-sql-server/ 

CREATE TABLE sourceTable (     
  id int PRIMARY KEY,     
  number int );
CREATE TABLE destinationTable (     
  id int PRIMARY KEY,     
  number int ); 
-- mentioned script here 
  SELECT COUNT(*) as [#sourceCount] FROM dbo.sourceTable; 
SELECT COUNT(*) as [#destinationCount] FROM dbo.destinationTable;

Talend Open Studio upsert operation implementation

First, we will build job in Talend. Normally you can use three components: tMSSqlInput (for  sourceTable), tMSSqlOutput (for destinationTable) and obviously tMap. Then you would switch component action for data for (Insert or Update) for tMSSqlOutput. However this approach creates a flow, which transfers data with speed of dozens rows per second, so it is extremely poor. So, let me show you how to build a well-performing Talend upsert job. 

talend upsert job

 I have used sourceTable and destinationTable as inputs for tMap. Then I have selected inner join between them to catch which rows are new and which are already in destination table. Based on that I will filter them and insert/update them.

 With such an approach the performance is better than built-in insert/update in Talend. It works in ~ 20 seconds.

SQL Server upsert operation implementation 

Now let’s check out how MERGE works in comparison to Talend.

The SQL code for MERGE:

CREATE PROCEDURE dbo.MergeExample AS BEGIN     
MERGE dbo.destinationTable AS target       
USING dbo.sourceTable AS source      
ON target.id = source.id    
WHEN MATCHED THEN           
UPDATE SET number = source.number     
WHEN NOT MATCHED THEN          
INSERT (id, number)           
VALUES (source.id, source.number); END;

In Talend you can just use TMSSqlRow with SQL Query EXEC dbo.MergeExample.

Performance: ~2 s.

So SQL is 10 times faster than Talend in this case.  

Summary

This simple theoretical test shows the same what practice taught me. Pure SQL is much faster than any ETL tool in case of many operations. With more data and more complicated transformations the advantage of SQL will be even bigger. I have seen cases where pure well written SQL is even 500 times faster than ETL tools.

Also, please note that when you have a lot of transformations, it is nearly always easier to write Table -> View – > Merge flow in SQL, then create complex jobs in ETL tool.

Therefore I would like to suggest you consider this option while developing your ETLs.

Extract, transform, load sql Database

Published at DZone with permission of Mateusz Komendołowicz, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Advanced Maintenance of a Multi-Database Citus Cluster With Flyway
  • Migrating from Sakila-MySQL to Couchbase - Part 3: Stored Procedures
  • Running Streaming ETL Pipelines with Apache Flink on Zeppelin Notebooks
  • How to Restore a Transaction Log Backup in SQL Server

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: