Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

1.3 Billion NYC Taxi Rows Into MemSQL Cloud

DZone's Guide to

1.3 Billion NYC Taxi Rows Into MemSQL Cloud

Get an illustration of rapid data ingest with MemSQL using real-world data from the New York City Taxi and Limousine Commission (NYC TLC) yellow taxi trip data set.

· Database Zone
Free Resource

Traditional relational databases weren’t designed for today’s customers. Learn about the world’s first NoSQL Engagement Database purpose-built for the new era of customer experience.

Experience teaches us that when loading data into a database, in whatever form ― normalized, denormalized, schema-less, hierarchical, key-value, document, etc. ― the devil is always in the data load.

In this tutorial, we’ll illustrate rapid data ingest with MemSQL using real-world data from the New York City Taxi and Limousine Commission (NYC TLC) yellow taxi trip data set, which is over 200GB of CSV files. We’ll break the tutorial into several sections:

  • Pre-configuration requirements for everything you need to complete the tutorial including AWS S3 and MemSQL Cloud.
  • Downloading, compressing, and uploading the yellow taxi data to AWS S3.
  • Loading the data with MemSQL Pipelines in about 24 minutes.

Pre-Configuration Requirements

To complete this tutorial, you can use a Mac OS, Linux, or Windows machine, and you’ll need a:

  • Bash compatible terminal.
  • AWS account and S3 bucket with read and write privileges.
  • AWS Command Line Interface (CLI).
  • MySQL compatible client to execute SQL statements.
  • Running MemSQL cluster.

AWS S3

First, log into your AWS account. To get started with AWS S3, review Getting Started with Amazon Simple Storage Service.

We will store approximately 44GB of compressed data versus 200GB of raw data, which typically costs about $4.50 a month. After loading the data into MemSQL, you can delete the S3 data and terminate the AWS account to avoid additional charges.

AWS Command Line Interface (CLI)

Use the AWS CLI to use shell commands to script S3 operations, such as creating a folder, listing folder items, and uploading files. Follow the User Guide for installing the AWS CLI.

AWS Credentials

To upload data to S3 using the AWS CLI and extract files from S3 with MemSQL Pipelines, you’ll need AWS credentials for an IAM user. For an IAM user, you can easily generate an AWS Access Key ID and AWS Secret Access Key following this guide

S3 Bucket

Log into the AWS Console and create a new S3 bucket in the availability zone for your account. If you are using MemSQL Cloud, to avoid ingress and egress data charges, create the S3 bucket in the us-east-1 availability zone. The NYC taxi data set includes yellow taxi, green vehicle, and FHV vehicle data. We’ll only use the yellow taxi data in our initial data load as it is by far the largest data set. In the case of yellow taxi data, the schema changes over time, so we’ll create yellow taxi sub-folders to demarcate the schema changes. In your S3 bucket, create the following folder hierarchy:

|- nyc-taxi
|--- yellow_tripdata
|----- 200901-201412
|----- 201501-201606 
|----- 201607-201612

MySQL Compatible Client

MemSQL uses the open source MySQL protocol, so in order to execute Data Definition Language (DDL) and Data Manipulation Language (DML) statements you’ll need a MySQL compatible client. Here are some popular, free clients:

  • MySQL Workbench
  • Sequel Pro 
  • MySql Client for  Linux APT Repository, Debian Packages
    • sudo apt-get update 
    • sudo apt-get install mysql-client 
  • MySQL Client for Linux RPM, Red Hat Enterprise Linux/Oracle Linux/CentOS 5 systems

    • sudo yum install mysql-community-client mysql-5.* 

MemSQL Cluster

A MemSQL cluster consists of at least one master aggregator node and one leaf node. Although this tutorial is primarily for a cluster in MemSQL Cloud, it is easy to adapt it to any MemSQL cluster environment.

MemSQL Cloud

To experience a turnkey enterprise version of MemSQL, sign up for MemSQL Cloud — free credits up to $300 are available to new customers. You’ll get the full power and speed of MemSQL, but as a managed service that features optimum performance and world-class support. Once you’ve signed up, get started by reviewing the MemSQL Cloud Overview. You’ll need to setup the Security Groups for your own virtual private cloud, then create a cluster based on resource tiers.

You can complete this tutorial using the Small, Medium, or Large cluster sizes. If you have a need for speed, choose Large for a 4 unit MemSQL Cloud cluster and expect the NYC taxi data set to ingest completely in under 24 minutes for a cost of about five dollars. Based on the location of your S3 bucket, your mileage may be different because of the variable nature of AWS S3 network speeds.

You can spin up a cluster in minutes and terminate a cluster at any time. To not use up all your free credits or incur additional charges, spin up a cluster when you are ready to load the data, and after completing this tutorial terminate your cluster.

MemSQL On-Premises or Self-Managed Cloud

MemSQL runs natively on a 64-bit Linux operating systems. Download and install MemSQL either on-premises or in your own self-managed cloud. For Windows or Mac OS, you can use the MemSQL Quickstart Docker image, which is a standalone edition of MemSQL. If using Windows 10, consider installing Bash using the Windows Subsystem for Linux Beta. This beta feature will install Ubuntu on Windows, and you can run MemSQL experimentally in this environment. Very cool.

NYC TLC Trip Data

The NYC TLC data files are as small as 300MB and as large as 2.5GB. With eight years of monthly data, yellow taxi information accounts for about 90% of all data. Many yellow taxi cab files are approximately 2.4GB. When compressed, the files range from 300MB to almost 600MB.

Download, Compress, and Upload the Data

Here’s a rudimentary Bash script to do the work of downloading each file, compressing the raw files, uploading the compressed file to proper folder in S3, and then deleting both the raw CSV and tar.gz files from the machine where the script runs. The shell script requires AWS CLI.

#!/bin/bash
# Change S3_BUCKET to your S3 bucket
export S3_BUCKET="s3://my-s3-bucket"
# Change S3_BASE_FOLDER to your base folder, but should be nyc-taxi if you are following the tutorial
export S3_BASE_FOLDER="nyc-taxi"
# Change AWS_ACCESS_KEY_ID to your key
export AWS_ACCESS_KEY_ID=MY_ACCESS_KEY
# ChangeAWS_SECRET_ACCESS_KEY to your secret key
export AWS_SECRET_ACCESS_KEY=MY_SECRET_KEY
# Change S3_BASE_FOLDER to your base folder.
export AWS_DEFAULT_PROFILE=my-aws-profile-user
# MemSQL Cloud is us-east-1; changing can result in AWS egress and ingress charges
export AWS_DEFAULT_REGION=us-east-1
aws configure --profile $AWS_DEFAULT_PROFILE 
# do not change URL_ROOT
URL_ROOT="https://s3.amazonaws.com/nyc-tlc/trip+data/"
# modify if needed for smaller subsets or add green and fhv for additional cab_types
MONTH_ORDINALS=("01" "02" "03" "04" "05" "06" "07" "08" "09" "10" "11" "12")
YEAR_ORDINALS=("2009" "2010" "2011" "2012" "2013" "2014" "2015" "2016") 
CAB_TYPES=("yellow")
# leave as empty
FILE_NAME=""
S3_FOLDER=""
S3_SUBFOLDER=""
for name in ${CAB_TYPES[@]}
  do
    if [ $name == "yellow" ]; then
      S3_FOLDER="yellow_tripdata"
      YEARS=${YEAR_ORDINALS[@]}
    fi
    for yy in ${YEARS[@]}
      do
        MONTHS=${MONTH_ORDINALS[@]}
        for mm in ${MONTHS[@]}
        do
          FILE_NAME=${name}_tripdata_${yy}-${mm}
          # get the csv file 
          curl -S -O "${URL_ROOT}${FILE_NAME}.csv" && echo "done! curl ${FILE_NAME}.csv" &
          wait
          # tarball the file
          tar -cvzf "${FILE_NAME}.tar.gz" "${FILE_NAME}.csv"  && echo "done! tar ${FILE_NAME}.tar.gz" &
          wait
          # upload to AWS S3 the gz file
          if [[ $name == "yellow"  &&  $yy == "2015" ]]; then
            S3_SUBFOLDER="201501-201606"
          elif [[ $name == "yellow"  &&  $yy == "2016" &&  $mm == "01" ]]; then
            S3_SUBFOLDER="201501-201606"
          elif [[ $name == "yellow"  &&  $yy == "2016" &&  $mm == "02" ]]; then
            S3_SUBFOLDER="201501-201606"
          elif [[ $name == "yellow"  &&  $yy == "2016" &&  $mm == "03" ]]; then
            S3_SUBFOLDER="201501-201606"
          elif [[ $name == "yellow"  &&  $yy == "2016" &&  $mm == "04" ]]; then
            S3_SUBFOLDER="201501-201606"
          elif [[ $name == "yellow"  &&  $yy == "2016" &&  $mm == "05" ]]; then
            S3_SUBFOLDER="201501-201606"
          elif [[ $name == "yellow"  &&  $yy == "2016" &&  $mm == "06" ]]; then
            S3_SUBFOLDER="201501-201606"
          elif [[ $name == "yellow"  &&  $yy == "2016" &&  $mm == "07" ]]; then
            S3_SUBFOLDER="201607-201612"
          elif [[ $name == "yellow"  &&  $yy == "2016" &&  $mm == "08" ]]; then
            S3_SUBFOLDER="201607-201612"
          elif [[ $name == "yellow"  &&  $yy == "2016" &&  $mm == "09" ]]; then
            S3_SUBFOLDER="201607-201612"
          elif [[ $name == "yellow"  &&  $yy == "2016" &&  $mm == "10" ]]; then
            S3_SUBFOLDER="201607-201612"
          elif [[ $name == "yellow"  &&  $yy == "2016" &&  $mm == "11" ]]; then
            S3_SUBFOLDER="201607-201612"
          elif [[ $name == "yellow"  &&  $yy == "2016" &&  $mm == "12" ]]; then
            S3_SUBFOLDER="201607-201612"
          else
            S3_SUBFOLDER="200901-201412"
          fi   
          if [ $name == "yellow" ]; then
            aws s3 cp ${FILE_NAME}.tar.gz ${S3_BUCKET}/${S3_BASE_FOLDER}/${S3_FOLDER}/${S3_SUBFOLDER}/ --profile $AWS_DEFAULT_PROFILE && echo "done! aws s3 cp ${FILE_NAME}.tar.gz" &  
          fi
          wait
          #rm the cv files
          rm -f "${FILE_NAME}.csv" && echo "done! rm -f ${FILE_NAME}.csv" &
          wait
          #rm the gz files
          rm -f "${FILE_NAME}.tar.gz"  && echo "done! rm -f ${FILE_NAME}.tar.gz" &
          wait
        done
      done
  done

In order for the script to work with your S3 bucket and folder, specify values for these variables:

  • S3_BUCKET 
  • S3_BASE_FOLDER 
  • AWS_ACCESS_KEY_ID 
  • AWS_SECRET_ACCESS_KEY 

Save your changes and name the file, nyc_tlc_taxi_files.sh. In the terminal where you are going to run the shell script, modify the file properties: 1 chmod 777 nyc_tlc_taxi_files .sh.

Next, create an AWS profile using the AWS CLI:

AWS_DEFAULT_PROFILE=my-profile
aws configure --profile $AWS_DEFAULT_PROFILE

At the prompts, specify your values for the AWS Access Key ID and AWS Secret Access Key. For Default region name, specify us-east-1. For Default output format, specify JSON.

From a Bash shell, run the nyc_tlc_taxi_files.sh script. There is no need to specify the AWS profile settings again, so press Enter at each prompt. The script outputs the successful processing of each file:

HAL2001:shellscripts Seth$ ./nyc_tlc_taxi_files.sh
AWS Access Key ID [****************7G6Q]:
AWS Secret Access Key [****************EmaN]:
Default region name [us-east-1]:
Default output format [json]:
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 2420M  100 2420M    0     0  4398k      0  0:09:23  0:09:23 --:--:-- 5915k
done! curl yellow_tripdata_2009-01.csv
a yellow_tripdata_2009-01.csv
done! tar yellow_tripdata_2009-01.tar.gz
upload: ./yellow_tripdata_2009-01.tar.gz to s3://my-s3-bucket/my-nyc-taxi-folder/yellow_tripdata/200901-201412/yellow_tripdata_2009-01.tar.gz
done! aws s3 cp yellow_tripdata_2009-01.tar.gz
done! rm -f yellow_tripdata_2009-01.csv
done! rm -f yellow_tripdata_2009-01.tar.gz

Downloading, compressing, and uploading the 96 yellow taxi monthly data files using your local machine will take around six to eleven hours, depending on your connection speeds. So, if you are following along on your laptop, you may want to run the Bash shell script during the night, keeping your machine plugged in with sleep disabled.

To Compress or Not to Compress?

With AWS S3, you ultimately pay for storage size and for data transmission. Depending on the availability zone, it costs on average about five dollars to store 200GB. Compressing 200GB gets that down to under 50GB. Because a MemSQL Pipeline for S3 can process a compressed file, you will save by using compression for large data sets on AWS S3.

Load the NYC Yellow Taxi Trip Data

After the bash shell script completes and all NYC Taxi trip data has been loaded in your various S3 bucket folders, we’ll now load all the data using MemSQL S3 Pipelines.

MemSQL Database and Tables

To get started, we’ll create a database and two tables. Because the yellow taxi data has several schema changes over eight years as described in the data dictionary, we’ll create a staging table to handle the various data formats. After the initial S3 Pipeline data ingest, we’ll insert the staging data into a destination table.

Using a MySQL compatible client, connect to the DDL endpoint for your MemSQL Cloud cluster. If not using MemSQL Cloud, connect to your Master Aggregator to run DDL commands. To create the database and tables, execute the following SQL:

DROP DATABASE IF EXISTS nyc_taxi;
CREATE DATABASE IF NOT EXISTS nyc_taxi;

USE nyc_taxi;

DROP TABLE IF EXISTS yellow_trips_staging;
CREATE TABLE IF NOT EXISTS yellow_trips_staging (
  vendor_id varchar(3) NOT NULL DEFAULT '0',
  pickup_datetime datetime NOT NULL,
  dropoff_datetime datetime NOT NULL,
  pickup_longitude DOUBLE NOT NULL DEFAULT 0,
  pickup_latitude DOUBLE NOT NULL DEFAULT 0,
  passenger_count tinyint NOT NULL DEFAULT 0,
  trip_distance decimal(6,2) NOT NULL DEFAULT 0,
  rate_code tinyint NOT NULL DEFAULT 0,
  store_and_fwd_flag varchar(1) NOT NULL DEFAULT 'N',
  dropoff_longitude DOUBLE NOT NULL DEFAULT 0,
  dropoff_latitude DOUBLE NOT NULL DEFAULT 0,
  PULocationID smallint(3) NOT NULL DEFAULT 0,
  DOLocationID smallint(3) NOT NULL DEFAULT 0,
  payment_type tinyint NOT NULL DEFAULT 0,
  fare_amount decimal(6,2) NOT NULL DEFAULT 0,
  surcharge decimal(6,2) NOT NULL DEFAULT 0, 
  extra decimal(6,2) NOT NULL DEFAULT 0, 
  mta_tax decimal(6,2) NOT NULL DEFAULT 0,
  tip_amount decimal(6,2) NOT NULL DEFAULT 0,
  tolls_amount decimal(6,2) NOT NULL DEFAULT 0,
  improvement_surcharge decimal(6,2) NOT NULL DEFAULT 0,
  total_amount decimal(7,2) NOT NULL DEFAULT 0,
  colA varchar(1), 
  colB varchar(1), 
  key(pickup_datetime, dropoff_datetime) USING CLUSTERED COLUMNSTORE
);

DROP TABLE IF EXISTS yellow_trips;
CREATE TABLE IF NOT EXISTS yellow_trips (
  vendor_id varchar(3) NOT NULL DEFAULT '0',
  pickup_datetime datetime NOT NULL,
  pickup_year smallint(4) NOT NULL DEFAULT 0,
  pickup_month tinyint(2) NOT NULL DEFAULT 0,
  pickup_day tinyint(2) NOT NULL DEFAULT 0,
  pickup_week tinyint(2) NOT NULL DEFAULT 0,
  pickup_time TIME NOT NULL DEFAULT '00:000:00',
  dropoff_datetime datetime NOT NULL,
  dropoff_year smallint(4) NOT NULL DEFAULT 0,
  dropoff_month tinyint(2) NOT NULL DEFAULT 0,
  dropoff_day tinyint(2) NOT NULL DEFAULT 0,
  dropoff_week tinyint(2) NOT NULL DEFAULT 0,
  dropoff_time TIME NOT NULL DEFAULT '00:000:00',
  pickup_pt geographypoint  not null default 'POINT(0 0)',
  dropoff_pt geographypoint not null default 'POINT(0 0)',
  PULocationID smallint(3) NOT NULL DEFAULT 0,
  DOLocationID smallint(3) NOT NULL DEFAULT 0,
  passenger_count tinyint(2) NOT NULL DEFAULT 0,
  trip_distance decimal(6,2) NOT NULL DEFAULT 0,
  rate_code tinyint NOT NULL DEFAULT 0,
  store_and_fwd_flag varchar(1) NOT NULL DEFAULT 'N',
  payment_type tinyint NOT NULL DEFAULT 0,
  fare_amount decimal(6,2) NOT NULL DEFAULT 0,
  surcharge decimal(6,2) NOT NULL DEFAULT 0,
  extra decimal(6,2) NOT NULL DEFAULT 0, 
  mta_tax decimal(6,2) NOT NULL DEFAULT 0,
  tip_amount decimal(6,2) NOT NULL DEFAULT 0,
  tolls_amount decimal(6,2) NOT NULL DEFAULT 0,
  improvement_surcharge decimal(6,2) NOT NULL DEFAULT 0, 
  total_amount decimal(7,2) NOT NULL DEFAULT 0,
  shard key(pickup_datetime, dropoff_datetime),
  key(pickup_datetime, dropoff_datetime) USING CLUSTERED COLUMNSTORE
);

MemSQL Pipelines

Declared ever so simply, but ever so powerfully, with a CREATE PIPELINE statement, MemSQL Pipelines are a native construct in MemSQL. The pipeline syntax allows us to extract one or more gzip compressed (or uncompressed) source files from an S3 bucket folder, then load the data to a destination database table. In other words, we’ll end up with three MemSQL Pipelines to match our S3 folders.

Create Pipelines

You’ll need to specifyyour AWS_SECRET_ACCESS_KEY and AWS_ACCESS_KEY_ID credentials in the following CREATE PIPELINE statements. In the statements, we specify to skip all CSV parsing errors. The three pipelines will load data rapidly and in parallel to the staging table:

USE nyc_taxi;
DROP PIPELINE IF EXISTS pipeline_yellow_200901_201412;
CREATE PIPELINE IF NOT EXISTS pipeline_yellow_200901_201412 AS
  -- IMPORTANT: CHANGE my-s3-bucket 
  LOAD DATA S3 "my-s3-bucket/nyc-taxi/yellow_tripdata/200901-201412/" 
  -- IMPORTANT: CHANGE my_aws_secret_access_key and my_aws_access_key_id 
  CREDENTIALS '{"aws_secret_access_key": "my_aws_secret_access_key", "aws_access_key_id": "my_aws_access_key_id"}' 
  SKIP ALL ERRORS
  INTO TABLE yellow_trips_staging
  FIELDS TERMINATED BY ','
  (vendor_id,pickup_datetime,dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,rate_code,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,surcharge,mta_tax,tip_amount,tolls_amount,total_amount);

DROP PIPELINE IF EXISTS pipeline_yellow_201501_201606;
CREATE PIPELINE IF NOT EXISTS pipeline_yellow_201501_201606 AS
  -- IMPORTANT: CHANGE my-s3-bucket 
  LOAD DATA S3 "my-s3-bucket/nyc-taxi/yellow_tripdata/201501-201606/" 
  -- IMPORTANT: CHANGE my_aws_secret_access_key and my_aws_access_key_id 
  CREDENTIALS '{"aws_secret_access_key": "my_aws_secret_access_key", "aws_access_key_id": "my_aws_access_key_id"}'
  SKIP ALL ERRORS
  INTO TABLE yellow_trips_staging
  FIELDS TERMINATED BY ','
  (vendor_id,pickup_datetime,dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,rate_code,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount);

DROP PIPELINE IF EXISTS pipeline_yellow_201607_201612;
CREATE PIPELINE IF NOT EXISTS pipeline_yellow_201607_201612 AS
  -- IMPORTANT: CHANGE my-s3-bucket 
  LOAD DATA S3 "my-s3-bucket/nyc-taxi/yellow_tripdata/201607-201612/" 
  -- IMPORTANT: CHANGE my_aws_secret_access_key and my_aws_access_key_id 
  CREDENTIALS '{"aws_secret_access_key": "my_aws_secret_access_key", "aws_access_key_id": "my_aws_access_key_id"}'
  SKIP ALL ERRORS
  INTO TABLE yellow_trips_staging
  FIELDS TERMINATED BY ','
  (vendor_id,pickup_datetime,dropoff_datetime,passenger_count,trip_distance,rate_code,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,@,@);

Now, we can start the pipelines with:

USE nyc_taxi;
START PIPELINE pipeline_yellow_200901_201412;
START PIPELINE pipeline_yellow_201501_201606;
START PIPELINE pipeline_yellow_201607_201612;

Verify that your pipelines are running.

USE nyc_taxi;
SHOW PIPELINES;

You can view the files in the batch as identified by BATCH_ID in the following:

USE nyc_taxi;
SELECT * FROM information_schema.PIPELINES_FILES WHERE DATABASE_NAME = 'nyc_taxi';

Parallel Loading With MemSQL Pipelines

To determine the number of files in a pipeline batch, MemSQL uses the database partition count: SHOW PARTITIONS on nyc_taxi;.

The distinct ordinal count of master partitions reflects the total number of database partitions.

Load Skew

Smaller sized files in a pipeline batch will load faster than larger files. Only when all files in a batch are successfully loaded in parallel does the next pipeline batch begin. For the NYC yellow taxi trip data set, the monthly files vary greatly in size, as great as 210MB. In other words, there will be a degree of load skew. The pipeline with the most files to process is pipeline_yellow_200901_201412. It processes six years of monthly data, which is 72 files. A database sized with 32 partitions will create three pipeline batches for a pipeline with 72 files:

SELECT PIPELINE_NAME, BATCH_ID, count(*) as File_Count
FROM information_schema.PIPELINES_FILES 
WHERE DATABASE_NAME = 'nyc_taxi' 
GROUP BY PIPELINE_NAME, BATCH_ID 
ORDER BY PIPELINE_NAME, BATCH_ID;

To view the overall batch time in minutes, the total number of rows, and rows per second, execute the following query:

SELECT Round((SUM(v1.BATCH_TIME) / 60),2) as Total_Minutes, FORMAT(SUM(v1.BATCH_ROWS_WRITTEN),0) as Total_Rows
  ,FORMAT(SUM(v1.BATCH_ROWS_WRITTEN)/ (SUM(v1.BATCH_TIME)),3) as Rows_Per_Sec
FROM (
  SELECT DISTINCT t1.BATCH_ID, t1.BATCH_TIME, t1.BATCH_ROWS_WRITTEN
  FROM INFORMATION_SCHEMA.PIPELINES_BATCHES as t1
  WHERE t1.BATCH_STATE = 'Succeeded' AND t1.DATABASE_NAME = 'nyc_taxi'
 ) as v1

The results for loading all the yellow taxi data into a large-sized cluster in MemSQL Cloud are:

  • Total_Minutes = 23.94
  • Total_Row = 1,308,985,065
  • Rows_Per_Sec = 911,305.258

Now that the load is complete, we can stop the pipelines.

USE nyc_taxi;
STOP PIPELINE pipeline_yellow_200901_201412;
STOP PIPELINE pipeline_yellow_201501_201606;
STOP PIPELINE pipeline_yellow_201607_201612;
SHOW PIPELINES;

From Staging to Destination

With 1.3 billion rows, there will be a few load errors. We can view the files with the most errors with the following query:

SELECT v1.BATCH_SOURCE_PARTITION_ID, v1.ERROR_CODE, COUNT(v1.ERROR_ID) as Total_Errors
FROM (
  SELECT BATCH_SOURCE_PARTITION_ID,ERROR_CODE,ERROR_ID
  FROM information_schema.PIPELINES_ERRORS
  ) as v1
GROUP BY v1.BATCH_SOURCE_PARTITION_ID
ORDER BY total_errors DESC;

There are two files with around 650 row-related errors out of 1.308 billion rows. Of course, there are a few issues with the data itself. There are rows with 0000-00-00 00:00:00 datetimes and rows with meaningliness longitudes and latitudes. We can clean these up with some basic DELETE and UPDATE statements that will take less than 20 seconds to complete.

USE nyc_taxi;
DELETE FROM yellow_trips_staging WHERE pickup_datetime = '0000-00-00 00:00:00' AND dropoff_datetime = '0000-00-00 00:00:00';
UPDATE yellow_trips_staging SET pickup_longitude = 0.0, pickup_latitude = 0.0 WHERE (pickup_longitude < -180.00 OR pickup_latitude > 90.00);
UPDATE yellow_trips_staging SET pickup_longitude = 0.0, pickup_latitude = 0.0 WHERE (pickup_longitude > 180.00 OR pickup_latitude < -90.00);
UPDATE yellow_trips_staging SET dropoff_longitude = 0.0, dropoff_latitude = 0.0 WHERE (pickup_longitude = 180.00 OR  pickup_latitude = 90.00);
UPDATE yellow_trips_staging SET dropoff_longitude = 0.0, dropoff_latitude = 0.0 WHERE (pickup_longitude = -180.00 OR  pickup_latitude = -90.00);
UPDATE yellow_trips_staging SET dropoff_longitude = 0.0, dropoff_latitude = 0.0 WHERE (dropoff_longitude < -180.00 OR dropoff_latitude > 90.00);
UPDATE yellow_trips_staging SET dropoff_longitude = 0.0, dropoff_latitude = 0.0 WHERE (dropoff_longitude > 180.00 OR dropoff_latitude < -90.00);
UPDATE yellow_trips_staging SET dropoff_longitude = 0.0, dropoff_latitude = 0.0 WHERE (dropoff_longitude = 180.00 OR dropoff_latitude = 90.00);
UPDATE yellow_trips_staging SET dropoff_longitude = 0.0, dropoff_latitude = 0.0 WHERE (dropoff_longitude = -180.00 OR dropoff_latitude = -90.00);

With the data cleaned up, we are now ready to copy the data into our destination table. As you may recall, the DDL for yellow_trips table specifies the pickup and dropoff point as a MemSQL geographypoint data type, enabling us to write fast geospatial queries moving forward.

USE nyc_taxi;
INSERT INTO yellow_trips
SELECT vendor_id, pickup_datetime,year(pickup_datetime) as pickup_year, month(pickup_datetime) as pickup_month, day(pickup_datetime) as pickup_day
, week(pickup_datetime) as pickup_week, time(pickup_datetime) as pickup_time, dropoff_datetime, year(dropoff_datetime) as droppoff_year
, month(dropoff_datetime) as droppoff_month, day(dropoff_datetime) as droppoff_day, week(dropoff_datetime) as droppoff_week, time(dropoff_datetime) as droppoff_time
, GEOGRAPHY_POINT(pickup_longitude , pickup_latitude) as pickup_pt, GEOGRAPHY_POINT(dropoff_longitude, dropoff_latitude) as dropoff_pt
, PULocationID, DOLocationID, passenger_count, trip_distance, rate_code, store_and_fwd_flag, payment_type, fare_amount, surcharge, extra, mta_tax
, tip_amount , tolls_amount, improvement_surcharge, total_amount 
FROM yellow_trips_staging;

Finally, we’ll clean up the staging table and update the statistics for the yellow_trips table.

USE nyc_taxi;
TRUNCATE TABLE yellow_trips_staging;
ANALYZE TABLE yellow_trips;
OPTIMIZE TABLE yellow_trips;

Now, we’re ready to start analyzing the data!

Data Science Inspiration

For those looking for inspiration about what and how to analyze the NYC yellow taxi data, take a look at Todd W. Schneider’s now famous blog Analyzing 1.1 Billion NYC Taxi and Uber Trips, with a Vengeance.  As Todd writes, “taken as a whole, the detailed trip-level data is more than just a vast list of taxi pickup and drop off coordinates: it’s a story of New York.”

Next Steps

The MemSQL story is one of relentless innovation and brilliant engineering.

Just five years ago, MemSQL posted “Loading half a billion records in 40 minutes”. That blog details how to chunk up manufactured data into 2000 equally sized files in order to load 500 million rows into MemSQL on a machine with 512GB memory and 64 cores.

Five years later, you can spin up a database cluster in MemSQL Cloud in minutes, and in about half the time load 1.3 billion rows of real-world data with a few simple CREATE PIPELINE statements. Try it yourself and sign up for MemSQL Cloud. New customers receive a $300 free credit.

To preview the next chapter in the amazing MemSQL story that includes an 80X improvement in columnstore query execution and AI similarity matching with SIMD functions, sign up for the MemSQL 6 Beta.

Learn how the world’s first NoSQL Engagement Database delivers unparalleled performance at any scale for customer experience innovation that never ends.

Topics:
database ,memsql ,tutorial ,aws ,mysql ,pipelines

Published at DZone with permission of Seth Luersen, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

THE DZONE NEWSLETTER

Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

X

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}