DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Because the DevOps movement has redefined engineering responsibilities, SREs now have to become stewards of observability strategy.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Related

  • Optimizing Your Data Pipeline: Choosing the Right Approach for Efficient Data Handling and Transformation Through ETL and ELT
  • CRM Analytics Data Flow and Recipe, Ultimate Guide to Data Transformation
  • The Evolutionary Adaptation of Data Integration Tools for Regulatory Compliance
  • The API-Centric Revolution: Decoding Data Integration in the Age of Microservices and Cloud Computing

Trending

  • Distributed Consensus: Paxos vs. Raft and Modern Implementations
  • AI Speaks for the World... But Whose Humanity Does It Learn From?
  • AI Agents: A New Era for Integration Professionals
  • Optimizing Serverless Computing with AWS Lambda Layers and CloudFormation
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Data Integration and ETL for Dummies (Like Me)

Data Integration and ETL for Dummies (Like Me)

How to understand modern data integration and ETL from a non-technical perspective.

By 
David Molot user avatar
David Molot
·
Mar. 04, 21 · Analysis
Likes (16)
Comment
Save
Tweet
Share
15.5K Views

Join the DZone community and get the full member experience.

Join For Free

In early 2020, I was introduced to the idea of data integration through a friend who was working in the industry. Yes, I know. Extremely late. All I knew about it was that I could have my data in one (virtual) place and then have it magically appear in another (virtual) place. I had no clue how it was done or how important it was to modern businesses.

To give you some background, my past work experience is not in any kind of technical space. It is in business development and marketing for non-technical products. I probably should have been more aware of the technical world around me, but for now, you must forgive me for my ignorance.

A couple days after my friend and I had our initial discussion about the data integration space, he enlightened me to the importance of being able to extract, transform, and load (ETL) data. I was curious about the inner workings behind the process, so I decided I wanted to learn more about ETL and data integration. 

My goal with this article is to help you learn the basics that I believe are the most important in understanding what data integration is. As a non-technical professional, I had no clue where to start other than to look up “What is data integration?” So let's start there.

What Is Data Integration?

In general, it is a bridge for data. Data integration allows for data that is sitting in separate places (often referred to as sources) to be brought together into a unified view. A unified view is when all the data you have gathered from various sources is put all in one place, allowing you to view all your data in one place. 

This doesn’t mean that the data no longer exists in the original source you pulled it from. A more accurate way of portraying it is as if someone copied the data from the original source and then sent that copy of the original data across the bridge. This is similar to the idea of data replication, except for that data replication is just copying data into different places for security and safety purposes. 

So in reality, data integration is not moving the original data around, but rather replicating it and putting the copy of the original data somewhere else (that somewhere else usually referred to as a target) for purposes we will get into a later in the article. 

Another important piece of the data integration space is the transformation of data. This is an essential piece of the data integration world as it can inhibit your data integration pipeline (a pipeline is the data integration from start to finish — importing the data to exporting the data) from being able to join all the data. 

The problem is that data from different sources often come in different formats which cannot be combined into a unified view unless some or all of the data is transformed to the same format. I will touch more on this later in the article.

Who Needs Data Integrated?

There are two major use cases for data integration.

Business Analysts

The first is business analysts who want to be able to check all their business intelligence in one place. Instead of going to Google Analytics to check how many people visited their website, Mixpanel to check what people clicked inside their website, Chargebee to check analytics on their subscription billings, and  Mailchimp to check how their cold email campaign is doing, data integration allows these analysts to see it all in one place. Let’s look at an example.

This is a simple dashboard with two graphs. The first shows monthly recurring revenue and the second shows the churn for the company. Using a data integration pipeline, the user of this dashboard was able to connect their business accounts to populate these graphs. They are now able to view their data from multiple sources in a simple unified view. Starting to see why this can be useful?

Developers

The second use case is for developers who are building products that require some sort of input of data. Try to put yourself in this scenario.

You are a developer building an accounting product that helps accountants recognize inconsistencies with their (online) bookkeeping. To do this, you need to be able to access the data of the bank account they are working with, their ERP system (an ERP system is software that helps manage main business processes), and their billing system.

To bring all of this data together, the developer would have to set up a data integration pipeline that pulls from the bank account, the ERP system, and the billing system. Once they have done that, they can load the data into their product and check for any inconsistencies in the bookkeeping. 

As you can now see, data integration is a valuable and important part of almost any company’s business processes these days. We will touch more on why it is so important later in this article.

How Does Data Integration Happen?

This is probably the most abstract thing that I’ll touch in this article. As I mentioned before, you have to imagine data integration as a virtual bridge between a source and a target.

Analysts or developers will choose from three options when it comes to building their data integration pipeline. The first is having their own team build integrations. This is a time consuming process — one integration (meaning an integration with one specific source) can take a small team up to two weeks to build, but that is not even the biggest expense. 

Once built, these integrations need to be maintained. This is because, without getting too technical, there are some moving pieces that change on a consistent basis, requiring the specific integration to be updated regularly. 

The second option is hiring a team of professionals to build integrations for you. While this is a hands-off custom solution, this option is typically expensive. The team of professionals would be doing the exact same work if you were to build it yourself, but (obviously) it would abstract away any of the work from your own team.

The third option is using a third-party tool to help a company build their pipeline. This is a hybrid between the two previous options. You are still using your own team to set up the integrations, but many of the necessary resources are abstracted away by the software platform they decide to use. You do not have to build the integration or build the target, but usually just define the transformation layer (this means that you are deciding the format you want your data to be in). 

Something to note with these tools is that some are made for developers (like hotglue) and other tools are made for analysts (like Fivetran). The tools for developers often handle more complex processes as it allows developers to take a more granular approach, while the analyst tools are great for non-technical professionals looking to set up a simple data integration pipeline.

In the past few years, many companies have been trending towards using these types of tools as they are becoming increasingly more efficient, customizable, and affordable.

The reason that some companies elect to handle their own data integration pipeline is because of the complexity the integrations may require. Typically, third-party solutions tend to be a bit more rigid and don’t allow for companies to customize their pipeline on a granular level. 

On a more specific level, there are a bunch of specific processes that are housed under the term data integration. The most popular one is…

ETL (Extract, Transform, Load)

ETL is the most popular processes within data integration. It is used when someone wants to bring data from multiple sources together but also requires that the format of the data to change. 

For example, if someone had an Excel or CSV file that they wanted to move to a specific database but all the other data sources were in JSON format (a very popular data interchange format), that file would have to be transformed into a JSON in order for it to be combined with the others.

Once the file is transformed, it is sent to the target defined for this specific pipeline where it is put together with all the other information that was pulled through. 

Sometimes the process is altered to be ELT (Extract, Load, Transform) for different use cases with smaller amounts of data, but each step still serves the same function.

Why Is Data Integration Important?

Companies that want to remain competitive need to embrace the benefits that data will give them, despite the challenges that arise when dealing with integrating data. Having an efficient data integration pipeline allows for these companies to utilize humongous datasets. It helps these companies improve their value to customers — from deepening their business intelligence to improving their real time information delivery. 

Conclusion

Let’s quickly do a review of some of the vocabulary we learned today.

  • Source: Where data originally sits and will be pulled from
  • Unified view: The singular record that is made up of the data brought in from various sources
  • Data replication: The process of creating copies of data and storing them in different locations  —  usually for safety and security measures
  • Target: Where you want the data to end up
  • Transform/transformation: Changing the formatting of data
  • Pipeline: A data integration process from start (initially getting the data) to end (where the data is outputted)
  • ERP: Enterprise resource planning is software that helps companies manage main business processes like accounting
  • ETL: Stands for Extract, Transform, Load — the most popular data integration framework used today
  • CSV: A popular file format called Comma Seperated Values (CSV) 
  • JSON: A popular type of data format

This is just the beginning to data integration. Data integration as a topic is quite large and this barely scratches the surface, but hopefully this article helped get you started. Cheers to no longer being a data integration dummy (maybe)! 

Thank you so much for reading. 

Data integration Extract, transform, load

Opinions expressed by DZone contributors are their own.

Related

  • Optimizing Your Data Pipeline: Choosing the Right Approach for Efficient Data Handling and Transformation Through ETL and ELT
  • CRM Analytics Data Flow and Recipe, Ultimate Guide to Data Transformation
  • The Evolutionary Adaptation of Data Integration Tools for Regulatory Compliance
  • The API-Centric Revolution: Decoding Data Integration in the Age of Microservices and Cloud Computing

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!