{{announcement.body}}
{{announcement.title}}

Accelerated Extract-Load-Transform Data Pipelines

DZone 's Guide to

Accelerated Extract-Load-Transform Data Pipelines

In this article, we discuss how to perform accelerated ETL processes on data pipelines with OmniSciDB.

· Big Data Zone ·
Free Resource

As a columnar database with both strong CPU and GPU performance, the OmniSci platform is well suited for Extract-Load-Transform (ELT) pipelines (as well as the data science workloads we more frequently demonstrate). In this blog post, I’ll demonstrate an example ELT workflow, along with some helpful tips when merging various files with drifting data schemas. If you’re not familiar with the two major data processing workflows, the next section briefly outlines the history and reasoning for ETL-vs-ELT; if you’re just interested in the mechanics of doing ELT in OmniSci, you can skip to the “Baywheels Bikeshare Data” section.

A Brief History of ETL vs. ELT for Loading Data

From the first computerized databases in the 1960s, the Extract-Transform-Load (ETL) data processing methodology has been an integral part of running a data-driven business. Historically, storing and processing data was too expensive to be accumulating data without knowing what you were going to do with it, so a process, such as the following. would occur each day:

  1. Transfer yesterday’s data from FTP to a staging area.
  2. Read each data file, keeping only the fields deemed “useful” by a business process and loading that data to a staging database environment.
  3. A series of stored procedures or views transforms the raw data, enriching and/or summarizing the data and storing in another table.
  4. The data is loaded from the staging table to the production database table.
  5. The production database table indices are re-calculated.
  6. The source data is retained for a short period in case the pipeline fails; otherwise, it is migrated to cold storage or deleted

While the ETL method of data processing can be effective when business goals are well-understood, it has two major flaws: inflexibility and latency. 

You may also like: The State of ETL: Traditional to Cloud.

From a flexibility standpoint, once you complete the ETL process, making changes requires developer time (and might not be possible at all depending on how integrated other systems are to the database). If your business rules or business environment changes rapidly, it will be difficult to near impossible to keep the production database up-to-date. If you were able to get a developer to make changes to the ETL process, re-processing the data would take days or weeks at best (assuming the source data was available to begin with).

Fast-forwarding to the early-to-mid-2010s, the Hadoop project showed that storing data on commodity hardware could be cost-effective, and the concept of “data lakes” started to take shape. Businesses began storing data well in advance of knowing what they wanted to do with it.

However, having huge volumes of data in a data lake doesn’t fit well with the ETL view of the world; rather, with newer technologies like distributed clusters of machines, columnar databases, and GPUs, Extract-Load-Transform often makes more sense.

Instead of building elaborate processing pipelines with pre-defined business rules and aggregations, data scientists need to take a bit more time upfront to create their dataset, but with that extra work comes the flexibility to create the exact transformations desired at whatever data fidelity is required.

Baywheels Bikeshare Data

I’ve previously written about analyzing bikeshare data in OmniSci; in the year since I’ve written that post, the bikeshare program rebranded from FordGoBike to Baywheels, but luckily, they still provide the aggregated system-level dataset on a monthly basis for all of their rides. The data is provided in an S3 bucket, and appears to have a very similar size and format; however, once you start trying to load the data, you’ll notice that unfortunately the files have different schemas.

Create Table and Copy From S3 Into OmniSci

For the 2017 FordGoBike trips, we can use a create table statement, then use copy from  via omnisql to load the data:

SQL




x
21


 
1
--define table for 2017 schema
2
CREATE TABLE baywheels_tripdata(
3
duration_sec INTEGER,
4
start_time TIMESTAMP,
5
end_time TIMESTAMP,
6
start_station_id SMALLINT,
7
start_station_name TEXT ENCODING DICT,
8
start_station_latitude FLOAT,
9
start_station_longitude FLOAT,
10
end_station_id SMALLINT,
11
end_station_name TEXT ENCODING DICT,
12
end_station_latitude FLOAT,
13
end_station_longitude FLOAT,
14
bike_id SMALLINT,
15
user_type TEXT ENCODING DICT,
16
member_birth_year SMALLINT,
17
member_gender TEXT ENCODING DICT
18
);
19
 
          
20
--load data for 2017
21
copy baywheels_tripdata from 's3://baywheels-data/2017-fordgobike-tripdata.csv';



However, once you attempt to load the January 2018 file, you’ll get the following message:

Shell




xxxxxxxxxx
1


1
omnisql> copy baywheels_tripdata from 's3://baywheels-data/201801-fordgobike-tripdata.csv.zip';
2
 
          
3
Result
4
 
          
5
Loaded: 0 recs, Rejected: 94802 recs in 5.752000 secs



When you see a message such as this one, where 0 records are loaded and 100% of the records are rejected, the first question to ask is “Has the schema changed?” In this case, an additional column bike_share_for_all_trip has been added to the end of the January 2018 file, and since OmniSci isn’t expecting the extra column, it rejects every record.

Adding a Column Using ALTER TABLE

To handle the additional column, we can use  ALTER TABLE  with  ADD COLUMN  to modify the table schema:

SQL




xxxxxxxxxx
1


1
--add column bike_share_for_all_trip, which starts with Jan 2018 file 
2
ALTER TABLE baywheels_tripdata 
3
    ADD COLUMN bike_share_for_all_trip TEXT ENCODING DICT



This is as close to a “free” operation as you can get in OmniSci; ALTER TABLE....ADD COLUMN only changes the table metadata; it doesn’t copy data or do any other processing. Without running any additional commands, the values in the bike_share_for_all_trip column will be NULL, which is appropriate since we don’t know if the 2017 rides would’ve qualified for this program.

Going through a series of COPY FROM commands, we can load the 201801 - 201905 monthly files, until the following happens:

Shell




xxxxxxxxxx
1
11


 
1
omnisql> copy baywheels_tripdata from 's3://baywheels-data/201905-baywheels-tripdata.csv.zip';
2
 
          
3
Result
4
 
          
5
Loaded: 182163 recs, Rejected: 0 recs in 2.440000 secs
6
 
          
7
omnisql> copy baywheels_tripdata from 's3://baywheels-data/201906-baywheels-tripdata.csv.zip';
8
 
          
9
Result
10
 
          
11
Loader truncated due to reject count.  Processed : 0 recs, Rejected: 191772 recs in 8.323000 secs



Like before, we get a message that 0 records are loaded, all rejected, so we know that something is wrong. And again after inspecting the file, we see a new column rental_access_method. This column has the strange property where it is only populated when bike_share_for_all_trip is not populated, which might give us a clue to its meaning. As it stands, it seems like a mistake rather than a new data element, so rather than use  ALTER TABLE  again, I’m going to make a staging table to do some additional processing.

Loading Data to Staging Table, Validate With Immerse

Both the 201906 and 201907 files have an extra column that is only populated when `bike_share_for_all_trip` is blank, so I’ll load both of these months into a separate staging table:

SQL




xxxxxxxxxx
1
23


1
--201906 and 201907 files have extra blank column
2
CREATE TABLE baywheels_tripdata_extracol(
3
duration_sec INTEGER,
4
start_time TIMESTAMP,
5
end_time TIMESTAMP,
6
start_station_id SMALLINT,
7
start_station_name TEXT ENCODING DICT,
8
start_station_latitude FLOAT,
9
start_station_longitude FLOAT,
10
end_station_id SMALLINT,
11
end_station_name TEXT ENCODING DICT,
12
end_station_latitude FLOAT,
13
end_station_longitude FLOAT,
14
bike_id SMALLINT,
15
user_type TEXT ENCODING DICT,
16
member_birth_year SMALLINT,
17
member_gender TEXT ENCODING DICT,
18
bike_share_for_all_trip TEXT ENCODING DICT,
19
extracol TEXT ENCODING DICT
20
);
21
 
          
22
copy baywheels_tripdata_extracol from 's3://baywheels-data/201906-baywheels-tripdata.csv.zip';
23
copy baywheels_tripdata_extracol from 's3://baywheels-data/201907-baywheels-tripdata.csv.zip' with (delimiter=';');



Just to up the level of difficulty, you’ll notice the second COPY FROM  statement has an extra argument to set the delimiter to  ; ...always great to see that rare “SEMI-COLON separated comma-separated-values” file in the wild! 

With the data loaded into a staging table, I can now use OmniSci Immerse to validate a hunch: if the extra column has the same distribution as the bike_share_for_all_trip column, then I’m going to assume that the extra column is actually used to derive the bike_share_for_all_trip column:

 
Comparison of cardinality between suspected similar columns

The chart on the left is the distribution of bike_share_for_all_trip with nulls removed (remember, we have nulls because this column didn’t exist in 2017). The middle chart shows the distribution of the   bike_share_for_all_trip column in the staging table (without nulls), and the chart on the right shows the rental_access_method column in the staging table (without nulls).

While the distributions aren’t exactly equal, they look pretty close to me, and the fact that the two columns have the same cardinality (two values and NULL) and one is populated when the other isn’t, I’m comfortable merging the two columns together.

Inserting Into an OmniSci Table With a Query

The final step of this ELT process is to insert the data from the staging table into the main table:

SQL




xxxxxxxxxx
1
26


1
insert into baywheels_tripdata
2
select
3
duration_sec,
4
start_time,
5
end_time,
6
start_station_id,
7
start_station_name,
8
start_station_latitude,
9
start_station_longitude,
10
end_station_id,
11
end_station_name,
12
end_station_latitude,
13
end_station_longitude,
14
bike_id,
15
user_type,
16
member_birth_year,
17
member_gender,
18
case
19
when bike_share_for_all_trip is not null then bike_share_for_all_trip
20
when extracol = 'app' then 'No'
21
when extracol = 'clipper' then 'Yes'
22
end as bike_share_for_all_trip
23
from baywheels_tripdata_extracol;
24
 
          
25
--drop table with extracol, since it's been corrected and loaded into main table
26
drop table baywheels_tripdata_extracol;



Using a case statement, we can re-code the values from the rental_access_method column to Yes  or No, as well as conditionally use those values only when bike_share_for_all_trip is NULL. Within that same operation, we can use INSERT INTO to insert that data into the main baywheels_tripdata table, and then for database cleanliness we can drop the staging table.

ELT: One More Tool in the Data Science Toolbox

In this post I’ve demonstrated that you can do all the data pre-processing for a publicly-available dataset without leaving the OmniSci platform. With the high-performance, columnar nature of OmniSci, complemented by modern GPU hardware, you can remove the need to build elaborate data processing pipelines before analyzing data. Instead, an ELT workflow can help your business be more agile in its data science explorations.



Further Reading

Topics:
big data, data engineering, data science, data transformation, database, etl, gpu, performance, sql, tutorial

Published at DZone with permission of Randy Zwitch . See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}