DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workkloads.

Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • How to Configure AWS Glue Job Using Python-Based AWS CDK
  • Python Packages for Validating Database Migration Projects
  • Optimizing Your Data Pipeline: Choosing the Right Approach for Efficient Data Handling and Transformation Through ETL and ELT
  • Model-Driven Development and Testing

Trending

  • The Modern Data Stack Is Overrated — Here’s What Works
  • Unlocking AI Coding Assistants Part 2: Generating Code
  • Java’s Next Act: Native Speed for a Cloud-Native World
  • A Guide to Developing Large Language Models Part 1: Pretraining
  1. DZone
  2. Data Engineering
  3. Big Data
  4. How to Write ETL Operations in Python

How to Write ETL Operations in Python

Clean and transform raw data into an ingestible format using Python.

By 
Hassan Syyid user avatar
Hassan Syyid
·
Feb. 10, 21 · Tutorial
Likes (7)
Comment
Save
Tweet
Share
7.7K Views

Join the DZone community and get the full member experience.

Join For Free

In this article, you’ll learn how to work with Excel/CSV files in a Python environment to clean and transform raw data into a more ingestible format. This is typically useful for data integration.

This example will touch on many common ETL operations such as filter, reduce, explode, and flatten.

Notes

The code for these examples is available publicly on GitHub here, along with descriptions that mirror the information I'll walk you through.

These samples rely on two open source Python packages:

  • pandas: a widely used open-source data analysis and manipulation tool. More info on their site and PyPi.
  • gluestick: a small open source Python package containing util functions for ETL maintained by the hotglue team. More info on PyPi and GitHub.

Without further ado, let's dive in!

Introduction

This example leverages sample Quickbooks data from the Quickbooks Sandbox environment, and was initially created in a hotglue environment — a light-weight data integration tool for startups.

Feel free to follow along with the Jupyter Notebook on GitHub!

Step 1: Read the Data

Let's start by reading the data.

This example is built on a hotglue environment with data coming from Quickbooks. In hotglue, the data is placed in the local sync-output folder in a CSV format. We will use the gluestick package to read the raw data in the input folder into a dictionary of pandas dataframes using the read_csv_folder function.

By specifying index_cols={'Invoice': 'DocNumber'} the Invoices dataframe will use the DocNumber column as an index. By specifying converters, we can use ast to parse the JSON data in the Line and CustomField columns.

Python
 




x


 
1
import ast
2
import gluestick as gs
3
import pandas as pd
4

          
5
# standard directory for hotglue
6
ROOT_DIR = "./sync-output"
7

          
8
# Read input data
9
input_data = gs.read_csv_folder(ROOT_DIR,
10
  index_cols={'Invoice': 'DocNumber'},
11
  converters={
12
    'Invoice': {
13
      'Line': ast.literal_eval, 
14
      'CustomField': ast.literal_eval,
15
      'Categories': ast.literal_eval
16
    }
17
  }
18
)


Take a Peek

Let's take a look at what data we're working with. For simplicity, I've selected the columns I'd like to work with and saved it to input_df. Typically, in hotglue, you can configure this using a field map, but I've done it manually here.

Python
 




xxxxxxxxxx
1


 
1
input_df = input_data['Invoice'][['Id', 'CustomerRef__value', 'CustomerRef__name', 'MetaData__LastUpdatedTime', 'MetaData__CreateTime', 'CurrencyRef__name', 'CurrencyRef__value', 'Line', 'CustomField']]



Step 2: Rename Columns

Let's clean up the data by renaming the columns to more readable names.

Plain Text
 




x


 
1
CustomerRef__value -> CustomerId
2
CustomerRef__name -> Customer
3
MetaData_LastUpdatedTime -> LastUpdated
4
MetaData_CreateTime -> CreatedOn
5
CurrencyRef__name -> Currency
6
CurrencyRef__value -> CurrencyCode 


Python
 




xxxxxxxxxx
1


 
1
# Let's clean up the names of these columns
2
invoices = input_df.pipe(lambda x: x.rename(columns={'CustomerRef__value': 'CustomerId', 'CustomerRef__name': 'Customer',
3
                                              'MetaData__LastUpdatedTime': 'LastUpdated',
4
                                              'MetaData__CreateTime': 'CreatedOn', 'CurrencyRef__name': 'Currency',
5
                                              'CurrencyRef__value': 'CurrencyCode'}))
6
invoices.head()



Step 3: Extract Information

The Line column is actually a serialized JSON object provided by Quickbooks with several useful elements in it. We'll need to start by flattening the JSON and then exploding into unique columns so we can work with the data.

Again, we'll use the gluestick package to accomplish this. The explode_json_to_rows function handles the flattening and exploding in one step. To avoid exploding too many levels of this object, we'll specify max_level=1. 

Here is a snippet from one to give you an idea.

JSON
 




xxxxxxxxxx
1
22


 
1
[{
2
    'Id': '1',
3
    'LineNum': '1',
4
    'Amount': 275.0,
5
    'DetailType': 'SalesItemLineDetail',
6
    'SalesItemLineDetail': {
7
        'ItemRef': {
8
            'value': '5',
9
            'name': 'Rock Fountain'
10
        },
11
        'ItemAccountRef': {
12
            'value': '79',
13
            'name': 'Sales of Product Income'
14
        },
15
        'TaxCodeRef': {
16
            'value': 'TAX',
17
            'name': None
18
        }
19
    },
20
    'SubTotalLineDetail': None,
21
    'DiscountLineDetail': None
22
}]


Python
 




x


 
1
# Let's explode the Line column now
2
invoices = invoices.pipe(gs.explode_json_to_rows, "Line", max_level=1)
3
invoices.head()


Step 4: Filter Rows

For our purposes, we only want to work with rows with a Line.DetailType of SalesItemLineDetail (we dont need sub-total lines). This is a common ETL operation known as filtering and is accomplished easily with pandas:

Python
 




x


 
1
# We filter for only SalesItemLineDetail
2
invoices = invoices.pipe(lambda x: x[x['Line.DetailType'] == 'SalesItemLineDetail'])
3
invoices.head()


Step 5: More Exploding

Look at some of the entries from the Line column we exploded. You'll notice they are name value pairs in JSON.

Let's use gluestick again to explode these into new columns via the json_tuple_to_cols function. We'll need to specify lookup_keys — in our case, the key_prop=name and value_prop=value.

Python
 




x


 
1
# Specify lookup keys
2
qb_lookup_keys = {'key_prop': 'name', 'value_prop': 'value'}
3

          
4
# Explode these into new columns
5
invoices = (invoices.pipe(gs.json_tuple_to_cols, 'Line.SalesItemLineDetail.ItemRef',
6
                  col_config={'cols': {'key_prop': 'Item', 'value_prop': 'Item Id'},
7
                              'look_up': qb_lookup_keys})
8
                  .pipe(gs.json_tuple_to_cols, 'Line.SalesItemLineDetail.ItemAccountRef',
9
                  col_config={'cols': {'key_prop': 'Item Ref', 'value_prop': 'Item Ref Id'},
10
                              'look_up': qb_lookup_keys}))
11
invoices[['Id', 'Item', 'Item Id', 'Item Ref', 'Item Ref Id']].head()


Step 6: Some More Exploding

Take a look at the CustomField column. Below is an example of an entry:

Plain Text
 




xxxxxxxxxx
1


 
1
[{'DefinitionId': '1', 'Name': 'Crew #', 'Type': 'StringType', 'StringValue': '102'}]


You can see this is JSON encoded data, specifying one custom field: Crew # with value 102

To explode this, we'll need to reduce this as we only care about the Name and StringValue. We can use gluestick's explode_json_to_cols function with an array_to_dict_reducer to accomplish this.

Python
 




xxxxxxxxxx
1


 
1
# Grab the string value of entries
2
invoices = invoices.pipe(gs.explode_json_to_cols, 'CustomField', reducer=gs.array_to_dict_reducer('Name', 'StringValue'))
3
invoices[['Id', 'CustomField.Crew #']].head()


Conclusion

Our final data looks something like below. In this sample, we went through several basic ETL operations using a real-world example all with basic Python tools.

Feel free to check out the open source hotglue recipes for more samples in the future. Thanks for reading!

Extract, transform, load Python (language) Data integration Open source Pandas file IO

Published at DZone with permission of Hassan Syyid. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • How to Configure AWS Glue Job Using Python-Based AWS CDK
  • Python Packages for Validating Database Migration Projects
  • Optimizing Your Data Pipeline: Choosing the Right Approach for Efficient Data Handling and Transformation Through ETL and ELT
  • Model-Driven Development and Testing

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!