DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
The Latest "Software Integration: The Intersection of APIs, Microservices, and Cloud-Based Systems" Trend Report
Get the report
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Fast Data Pipeline Design: Updating Per-Event Decisions by Swapping Tables

Fast Data Pipeline Design: Updating Per-Event Decisions by Swapping Tables

Learn how to close the data loop to take knowledge from a big data system and apply this knowledge online to the real-time decision engine.

John Piekos user avatar by
John Piekos
·
Jul. 12, 17 · Tutorial
Like (2)
Save
Tweet
Share
4.40K Views

Join the DZone community and get the full member experience.

Join For Free

A term used often by VoltDB, fast data pipelines are a new modern breed of applications — applications that combine streaming, or “fast data,” tightly with big data.

First, a quick high-level summary of the fast data pipeline architecture:

Fast Data Pipeline

The first thing to notice is that there is a tight coupling of fast and big, although they are separate systems. They have to be, at least at scale. The database system designed to work with millions of event decisions per second is wholly different from the system designed to hold petabytes of data and generate machine learning (ML) models.

There are a number of critical requirements to get the most out of a fast data pipeline. These include the ability to:

  • Ingest/interact with the data feed in real-time.
  • Make decisions on each event in the feed in real-time.
  • Provide visibility into fast-moving data with real-time analytics.
  • Seamlessly integrate into the systems designed to store big data.
  • Ability to deliver analytic results (mined “knowledge”) from the big data systems quickly to the decision engine, closing the data loop. This mined knowledge can be used to inform per event decisions.

Hundreds of fast data pipeline applications have been built and deployed using VoltDB as the fast operational database (the glue) between fast and big. These applications provide real-time decisioning engines in financial fraud detection, digital ad tech optimization, electric smart grid, mobile gaming and IoT industries, among others.

This blog is going to drill into how to implement a specific portion of this fast data pipeline, namely the last bullet: The ability to close the data loop, taking knowledge from a big data system and applying this knowledge, online, to the real-time decision engine (VoltDB).

Closing the Data Loop

“Per-event decisioning” means that an action is computed for each incoming event (each transaction). Usually, some set of facts informs the decision, often computed from historical data. These “facts” could be captured in machine learning models or consist of a set of generated rules to be executed on each incoming event. Or, these facts could be represented as rows in a database table and used to filter and generate optimized decisions for each event. This blog post will focus on the latter: storing and updating facts represented in database tables.

When storing facts in database tables, each row corresponds to some bit of intelligence for a particular value or set of values. For example, the facts might be a pricing table for airline flights, where each row corresponds to a route and service level. Or the values might be a list of demographic segmentation buckets (median income, marital status, etc.) for browser cookies or device IDs, used to serve up demographic-specific ads.

Fact tables are application-specific, can be simple or sophisticated, and are often computed from a historical “big data” data set such as Spark, Hadoop, or commercial data warehouse, etc. Fact tables can often be quite large and can be frequently recomputed, perhaps weekly, daily, or even hourly.

It is often important that the set of facts changes atomically. In other words, if airline prices are changing for ten’s of thousands of flights, all the prices should change all at once, instantly. It is unacceptable that some transactions reference older prices and some newer prices during the period of time it takes to load millions of rows of new data. This problem can be challenging when dealing with large fact tables, as transactionally changing millions of values in can be a slow, blocking operation. Locking a table, thus blocking ongoing operations, is unacceptable when your application is processing hundreds of thousands of transactions per second.

VoltDB solves this challenge in a very simple and efficient manner. VoltDB has the ability to transactionally swap tables in a single operation. How this works is as follows:

  1. Create an exact copy of your fact table schema, giving it a different name — perhaps Facts_Table and Facts_Table_2.
  2. Make sure the schemas are indeed identical (and neither is the source of a view).
  3. While your application is running (and consulting rows in Facts_Table to make decisions), populate Facts_Table_2with your new set of data that you wish future transactions to consult. This table can be populated as slowly (or as quickly) as you like, perhaps over the course of a day.
  4. When your Facts_Table_2 is populated and you are ready to make it “live” in your application, call the VoltDB system procedure @SwapTables. This operation essentially switches the data for the table by swapping internal memory pointers. As such, it executes in single to a sub-millisecond range.
  5. At this point, all the data that was in Facts_Table_2 is now in Facts_Table, and the old data in Facts_Table now resides in Facts_Table_2.You may consider truncating Facts_Table_2 in preparation for your next refresh of facts (and to reduce your memory footprint).

Let’s look at a contrived example using the VoltDB Voter sample application, a simple simulation of an American Idol voting system. Let’s assume that each day, you are going to feature different contestants for whom callers can vote. Voting needs to occur 24×7, each day, with new contestants. The contestants change every day at midnight. We don’t want any downtime — no maintenance window, for example — when changing our contestant list.

Here’s what we need to do to the Voter sample to effect this behavior:

  1. First, we create an exact copy of our CONTESTANTS table, calling it CONTESTANTS_2:
    -- contestants_2 table holds the next day's contestants numbers -- (for voting) and names
    CREATE TABLE contestants_2
    (
      contestant_number integer     NOT NULL
    , contestant_name   varchar(50) NOT NULL
    , CONSTRAINT PK_contestants_2 PRIMARY KEY
      (
        contestant_number
      )
    );
  2. The schemas are identical, and this table is not the source of a materialized view.
  3. The voter application pre-loads the CONTESTANTS table at the start of the benchmark with the following contestants:
    >;; select * from contestants;
    CONTESTANT_NUMBER CONTESTANT_NAME
    ------------------ ----------------
                     1 Edwina Burnam
                     2 Tabatha Gehling
                     3 Kelly Clauss
                     4 Jessie Alloway
                     5 Alana Bregman
                     6 Jessie Eichman
    
    $ cat contestants_2.csv
    1, Tom Brady
    2, Matt Ryan
    3, Aaron Rodgers
    4, Drew Brees
    5, Andrew Luck
    6, Kirk Cousins
    
    $ csvloader contestants_2 -f contestants_2.csv
    Read 6 rows from file and successfully inserted 6 rows (final)
    Elapsed time: 0.905 seconds
    $ sqlcmd
    SQL Command :: localhost:21212
    1>;; select * from contestants_2;
    CONTESTANT_NUMBER CONTESTANT_NAME
    ------------------ ----------------
                     1 Tom Brady
                     2 Matt Ryan
                     3 Aaron Rodgers
                     4 Drew Brees
                     5 Andrew Luck
                     6 Kirk Cousins(Returned 6 rows in 0.01s)
  4. Now that we have the new contestants (fact table) loaded and staged, when we’re ready (at midnight!) we’ll swap the two tables, making the new set of contestants immediately available for voting without interrupting the application. We’ll do this by calling the @SwapTables system procedure as follows:
    $ sqlcmd
     SQL Command :: localhost:21212
     1>;; exec @SwapTables contestants_2 contestants;
     modified_tuples
     ----------------
     12
    
    (Returned 1 rows in 0.02s)
     2>;; select * from contestants;
     CONTESTANT_NUMBER  CONTESTANT_NAME
     ------------------ ----------------
                      6 Kirk Cousins
                      5 Andrew Luck
                      4 Drew Brees
                      3 Aaron Rodgers
                      2 Matt Ryan
                      1 Tom Brady
    
    (Returned 6 rows in 0.01s)
  5. Finally, we’ll truncate the CONTESTANTS_2 table, initializing it once again so it’s ready to be loaded with the next day’s contestants:
    $ sqlcmd
     SQL Command :: localhost:21212
     1>;; truncate table contestants_2;
     (Returned 6 rows in 0.03s)
     2>;; select * from contestants_2;
     CONTESTANT_NUMBER  CONTESTANT_NAME
     ------------------ ----------------
    
    (Returned 0 rows in 0.00s)

Note that steps 3-5 — loading, swapping, and truncating the new fact table — can all be done in an automated fashion, not manually as I have demonstrated with this simple example. Running the Voter sample and arbitrarily invoking @SwapTables during the middle of the run yielded the following results:

A total of 15,294,976 votes were received during the benchmark...
 - 15,142,056 Accepted
 -    152,857 Rejected (Invalid Contestant)
 -         63 Rejected (Maximum Vote Count Reached)
 -          0 Failed   (Transaction Error)

Contestant Name      Votes Received
 Tom Brady           4,472,147
 Kirk Cousins        3,036,647
 Andrew Luck         2,193,442
 Matt Ryan           1,986,615
 Drew Brees          1,963,903
 Aaron Rodgers       1,937,391

 The Winner is: Tom Brady

Apologies to those not New England-based! As you might have guessed, VoltDB’s headquarters are based just outside of Boston, Massachusetts!

Database Big data Pipeline (software) Design

Published at DZone with permission of John Piekos, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • DZone's Article Submission Guidelines
  • How to Submit a Post to DZone
  • HTTP vs Messaging for Microservices Communications
  • The 5 Books You Absolutely Must Read as an Engineering Manager

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: