Accelerated Extract-Load-Transform Data Pipelines
Join the DZone community and get the full member experience.Join For Free
As a columnar database with both strong CPU and GPU performance, the OmniSci platform is well suited for Extract-Load-Transform (ELT) pipelines (as well as the data science workloads we more frequently demonstrate). In this blog post, I’ll demonstrate an example ELT workflow, along with some helpful tips when merging various files with drifting data schemas. If you’re not familiar with the two major data processing workflows, the next section briefly outlines the history and reasoning for ETL-vs-ELT; if you’re just interested in the mechanics of doing ELT in OmniSci, you can skip to the “Baywheels Bikeshare Data” section.
A Brief History of ETL vs. ELT for Loading Data
From the first computerized databases in the 1960s, the Extract-Transform-Load (ETL) data processing methodology has been an integral part of running a data-driven business. Historically, storing and processing data was too expensive to be accumulating data without knowing what you were going to do with it, so a process, such as the following. would occur each day:
- Transfer yesterday’s data from FTP to a staging area.
- Read each data file, keeping only the fields deemed “useful” by a business process and loading that data to a staging database environment.
- A series of stored procedures or views transforms the raw data, enriching and/or summarizing the data and storing in another table.
- The data is loaded from the staging table to the production database table.
- The production database table indices are re-calculated.
- The source data is retained for a short period in case the pipeline fails; otherwise, it is migrated to cold storage or deleted
While the ETL method of data processing can be effective when business goals are well-understood, it has two major flaws: inflexibility and latency.
You may also like: The State of ETL: Traditional to Cloud.
From a flexibility standpoint, once you complete the ETL process, making changes requires developer time (and might not be possible at all depending on how integrated other systems are to the database). If your business rules or business environment changes rapidly, it will be difficult to near impossible to keep the production database up-to-date. If you were able to get a developer to make changes to the ETL process, re-processing the data would take days or weeks at best (assuming the source data was available to begin with).
Fast-forwarding to the early-to-mid-2010s, the Hadoop project showed that storing data on commodity hardware could be cost-effective, and the concept of “data lakes” started to take shape. Businesses began storing data well in advance of knowing what they wanted to do with it.
However, having huge volumes of data in a data lake doesn’t fit well with the ETL view of the world; rather, with newer technologies like distributed clusters of machines, columnar databases, and GPUs, Extract-Load-Transform often makes more sense.
Instead of building elaborate processing pipelines with pre-defined business rules and aggregations, data scientists need to take a bit more time upfront to create their dataset, but with that extra work comes the flexibility to create the exact transformations desired at whatever data fidelity is required.
Baywheels Bikeshare Data
I’ve previously written about analyzing bikeshare data in OmniSci; in the year since I’ve written that post, the bikeshare program rebranded from FordGoBike to Baywheels, but luckily, they still provide the aggregated system-level dataset on a monthly basis for all of their rides. The data is provided in an S3 bucket, and appears to have a very similar size and format; however, once you start trying to load the data, you’ll notice that unfortunately the files have different schemas.
Create Table and Copy From S3 Into OmniSci
For the 2017 FordGoBike trips, we can use a
create table statement, then use
copy from via omnisql to load the data:
However, once you attempt to load the January 2018 file, you’ll get the following message:
When you see a message such as this one, where 0 records are loaded and 100% of the records are rejected, the first question to ask is “Has the schema changed?” In this case, an additional column
bike_share_for_all_trip has been added to the end of the January 2018 file, and since OmniSci isn’t expecting the extra column, it rejects every record.
Adding a Column Using ALTER TABLE
To handle the additional column, we can use
ALTER TABLE with
ADD COLUMN to modify the table schema:
This is as close to a “free” operation as you can get in OmniSci;
ALTER TABLE....ADD COLUMN only changes the table metadata; it doesn’t copy data or do any other processing. Without running any additional commands, the values in the
bike_share_for_all_trip column will be NULL, which is appropriate since we don’t know if the 2017 rides would’ve qualified for this program.
Going through a series of
COPY FROM commands, we can load the 201801 - 201905 monthly files, until the following happens:
Like before, we get a message that 0 records are loaded, all rejected, so we know that something is wrong. And again after inspecting the file, we see a new column
rental_access_method. This column has the strange property where it is only populated when
bike_share_for_all_trip is not populated, which might give us a clue to its meaning. As it stands, it seems like a mistake rather than a new data element, so rather than use
ALTER TABLE again, I’m going to make a staging table to do some additional processing.
Loading Data to Staging Table, Validate With Immerse
Both the 201906 and 201907 files have an extra column that is only populated when `bike_share_for_all_trip` is blank, so I’ll load both of these months into a separate staging table:
Just to up the level of difficulty, you’ll notice the second
COPY FROM statement has an extra argument to set the delimiter to
; ...always great to see that rare “SEMI-COLON separated comma-separated-values” file in the wild!
With the data loaded into a staging table, I can now use OmniSci Immerse to validate a hunch: if the extra column has the same distribution as the
bike_share_for_all_trip column, then I’m going to assume that the extra column is actually used to derive the
bike_share_for_all_tripwith nulls removed (remember, we have nulls because this column didn’t exist in 2017). The middle chart shows the distribution of the
bike_share_for_all_tripcolumn in the staging table (without nulls), and the chart on the right shows the
rental_access_methodcolumn in the staging table (without nulls).
While the distributions aren’t exactly equal, they look pretty close to me, and the fact that the two columns have the same cardinality (two values and NULL) and one is populated when the other isn’t, I’m comfortable merging the two columns together.
Inserting Into an OmniSci Table With a Query
The final step of this ELT process is to insert the data from the staging table into the main table:
Using a case statement, we can re-code the values from the
rental_access_method column to
No, as well as conditionally use those values only when
NULL. Within that same operation, we can use
INSERT INTO to insert that data into the main
baywheels_tripdata table, and then for database cleanliness we can drop the staging table.
ELT: One More Tool in the Data Science Toolbox
In this post I’ve demonstrated that you can do all the data pre-processing for a publicly-available dataset without leaving the OmniSci platform. With the high-performance, columnar nature of OmniSci, complemented by modern GPU hardware, you can remove the need to build elaborate data processing pipelines before analyzing data. Instead, an ELT workflow can help your business be more agile in its data science explorations.
Published at DZone with permission of Randy Zwitch. See the original article here.
Opinions expressed by DZone contributors are their own.