DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
The Latest "Software Integration: The Intersection of APIs, Microservices, and Cloud-Based Systems" Trend Report
Get the report
  1. DZone
  2. Data Engineering
  3. Big Data
  4. ETL vs ELT: The Difference is in the How

ETL vs ELT: The Difference is in the How

A technical comparison between ETL and ELT. What happens when the 'transform' and 'load' steps are switched around in the process?

Roi Avinoam user avatar by
Roi Avinoam
·
Jan. 11, 17 · Tutorial
Like (16)
Save
Tweet
Share
19.51K Views

Join the DZone community and get the full member experience.

Join For Free

for the last couple of decades etl (extract, transform, load) has been the traditional approach for data warehousing and analytics. the elt (extract, load, transform) approach changes the old paradigm. but, what’s actually happening when the “t” and “l” are switched?

etl and elt solve the same need:

billions of data and events need to be collected, processed and analyzed by businesses. the data needs to be be clean, manageable and ready to analyze. it needs to be enriched, molded and transformed. to make it meaningful.

but, the “ how ” is what’s different and leads to new possibilities in many modern data projects. there are differences in how raw data is managed, when processing is done, and how analysis is performed.

in this article, we’ll demonstrate the technological differences of etl and elt, showing data engineering and analysis examples of the two approaches and summarizing 10 pros and cons .

the technological differences lets first align on the 3 stages - e, t, and l:

  • extraction : retrieving raw data from an unstructured data pool and migrating it into a temporary staging data repository.
  • transformation : structuring, enriching, and converting the raw data to match the target source.
  • loading : loading the structured data into a data warehouse to be analyzed and used by business intelligence (bi) tools.

etl vs. elt

etl requires management of the raw data, including the extraction of the required information and running the right transformations to ultimately serve the business needs. each stage — extraction, transformation and loading — requires interaction by data engineers and developers, and dealing with capacity limitations of traditional data warehouses. using etl, analysts and other bi users have become accustomed to waiting , since simple access to the information is not available until the whole etl process has been completed. a graphic describing how etl works: extract, transform, load

in the elt approach, after you’ve extracted your data, you immediately start the loading phase — moving all the data sources into a single, centralized data repository. with today’s infrastructure technologies using the cloud, systems can now support large storage and scalable compute. therefore, a large, expanding data pool and fast processing is virtually endless for maintaining all the extracted raw data. a graphic describing how elt works : extract, load, transform

in this way, the elt approach provides a modern alternative to etl . however, it’s still evolving. therefore, the frameworks and tools to support the elt process are not always fully developed to facilitate load and processing of large amount of data. the upside is very promising — enabling unlimited access to all of your data at any time and saving developers efforts and time for bi users and analysts .

a hands-on example

here’s an example to illustrate the technological differences between etl and elt, and drill down into the details.

our demonstration will use two data tables : one for purchases and another for currencies, as below:

purchases table
ip amount currency
2.248.0.0 100 euro
2.248.0.0 200 gbp
72.229.28.185 300

usd

currencies table

currency rate
euro 1.12
gbp 1.3
usd

1

to understand the fundamentals, we’ll look at how this sample is processed in etl and elt . for each, we’ll show how to calculate a single summary table using these two tables, including the average purchase per country (based on the ip address provided).

etl data transformation on extracted data

in the etl process, the transformation stage applies to a series of rules or functions on the extracted data to create the table that will be loaded.

here’s some code to demonstrate the preliminary data transformation process for etl:

def transform(data):
countries = {}
for d in data:
country = ip2country(d["ip"])
amount = d["amount"] \\\\\* currencies[p["currency"]] # where do we keep these?
acc = countries.setdefault(country, {"sum": 0, "count": 0})
acc["sum"] += amount
acc["count"] += 1

# compute the averages
output = []
for k, acc in countries:
output.push({
"country": k,
"amount": acc["sum"] / acc["count"]
})
return output

using this script, we are mapping the ip addresses to their related country. we are deriving a new calculated value ‘amount’ by multiplying the values of both source tables group by currency attribute. then we are sorting data by the country column, joining the data from the purchases and currencies tables, and summing up the average values per country.

this data transformation results in a new table with the average amount per country:

average amount per country

country amount
usa 300
sweden

372

elt data transformation at query runtime

in contrast to etl, with elt all data is already loaded and can be used at any point in time.

therefore, the transformation is done at query runtime:

select ip2country(ip) as country, avg(amount \\\\\* currencies.rate as amount)
natural join currency
group by 1;

in the query, we are selecting the ip address by country, multiplying amount from the purchases table and rate from the currencies table to calculate the average amount. then, joining both tables based on the common columns of both tables and grouping by country.

this will result the same exact output table as in the etl process above. however, in this case, since all raw data has been loaded, we can more easily continue running other queries in the same environment to test and identify the best possible data transformations that match the business requirements.

the bottom line of this hands-on example

elt is more efficient than etl for development code . in addition, elt is much more flexible than etl. with elt, users can run new transformations as well as test and enhance queries directly on the raw data as it is required — without the time and complexity that we’ve become used to with etl.

managing data warehouses and data lakes

according to gartner , the data management and data integration needs of businesses today require both small and big, structured and unstructured data . here’s what they suggest about what needs to change in the way of work:

“the traditional bi team needs to continue developing clear best practices, with well understood business objectives… there is a second mode of bi which is more fluid and … highly iterative with unforeseen data discovery and is allowed to fail fast.”

this type of conversation has created a lot of talk in the industry about data warehouses vs. data lakes. the data lake concept is a new way of thinking about big data for unstructured data made for infinite scaling - using tools like hadoop for implementing the second mode of bi work described by gartner. however, although enterprise still use data warehouses to support a traditional paradigm such as etl, scalable modern data warehouses such as redshift and bigquery can be used to implement the elt modern paradigm with all its inherent benefits mentioned above.

ibm talks about 5 things that modern big data projects require — showing the need for new data concepts like the data lake. it’s the 5 v’s:

  1. volume: the volume of (raw) data
  2. variety: the variety (e.g. structured, unstructured, semi-structured) of data
  3. velocity: the speed of data processing, consummation or analytics of data
  4. veracity: the level of trust in the data
  5. (value): the value behind the data

etl continues to be a good match when dealing with legacy data warehouses - looking at smaller subsets and moving them into the data warehouse. but it’s hard to provide a solution with etl for the 5 v’s as you go down the list - how to deal with the volumes? the unstructured data? speed? etc.

the elt approach opens opportunity for working in a more fluid, iterative bi environment due to its efficiency and flexibility. elt supports both the data warehouse and extends to the data lake concept — enabling the incorporation of unstructured data into its bi solution.

summarizing 10 pros and cons

to summarize the two approaches, we’ve grouped the differences into 10 criteria:

1. time - load

etl: uses staging area and system, extra time to load data

elt: all in one system, load only once

2. time - transformation

etl: need to wait, especially for big data sizes - as data grows, transformation time increases

elt: all in one system, speed is not dependant on data size

3. time - maintenance

etl: high maintenance - choice of data to load and transform and must do it again if deleted or want to enhance the main data repository

elt: low maintenance - all data is always available

4. implementation complexity

etl: at early stage, requires less space and result is clean

elt: requires in-depth knowledge of tools and expert design of the main large repository

5. analysis and processing style

etl: based on multiple scripts to create the views - deleting view means deleting data

elt: creating adhoc views - low cost for building and maintaining

6. data limitation or restriction in supply

etl: by presuming and choosing data a priori

elt: by hw (none) and data retention policy

7. data warehouse support

etl: prevalent legacy model used for on-premises and relational, structured data

elt: tailored to using in scalable cloud infrastructure to support structured, unstructured such big data sources

8. data lake support

etl: not part of approach

elt: enables use of lake with unstructured data supported

9. usability

etl: fixed tables, fixed timeline, used mainly by it

elt: ad hoc, agility, flexibility, usable by everyone from developer to citizen integrator

10. cost-effective

etl: not cost-effective for small and medium businesses

elt: scalable and available to all business sizes using online saas solutions

final thoughts

etl is outdated. it helped to cope with the limitation of the traditional rigid and data center infrastructures which with the cloud are no longer a barrier today. in organizations with large data sets of even only a few terabytes, load time can take hours, depending on the complexity of the transformation rules.

elt is an important part of the future of data warehousing. with elt, businesses of any size can capitalize on the current technologies. by analyzing larger pools of data with more agility and less maintenance, businesses gain key insights to create a real competitive advantages and excel in their business.

want to share your thoughts? follow the conversation on twitter .

Big data Relational database Database Extract, load, transform Extract, transform, load

Published at DZone with permission of Roi Avinoam, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • How To Handle Secrets in Docker
  • Fargate vs. Lambda: The Battle of the Future
  • The 5 Books You Absolutely Must Read as an Engineering Manager
  • REST vs. Messaging for Microservices

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: