DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

How does AI transform chaos engineering from an experiment into a critical capability? Learn how to effectively operationalize the chaos.

Data quality isn't just a technical issue: It impacts an organization's compliance, operational efficiency, and customer satisfaction.

Are you a front-end or full-stack developer frustrated by front-end distractions? Learn to move forward with tooling and clear boundaries.

Developer Experience: Demand to support engineering teams has risen, and there is a shift from traditional DevOps to workflow improvements.

Related

  • How to Geo-Partition Data in Distributed SQL
  • Apache Spark for the Impatient
  • Data Privacy and Security: A Developer's Guide to Handling Sensitive Data With DuckDB
  • What Is SQL Injection and How Can It Be Avoided?

Trending

  • Integrating Apache Spark With Drools: A Loan Approval Demo
  • Enterprise Data Loss Prevention (DLP) Security Policies and Tuning
  • Security by Design: Building Full-Stack Applications With DevSecOps
  • How to Introduce a New API Quickly Using Micronaut
  1. DZone
  2. Data Engineering
  3. Databases
  4. How to Update Hive Tables the Easy Way

How to Update Hive Tables the Easy Way

Learn about the simplistic ways to manage data in your Apache Hive tables using the new functions made available in HDP 2.6.

By 
Carter Shanklin user avatar
Carter Shanklin
·
Aug. 23, 17 · Tutorial
Likes (4)
Comment
Save
Tweet
Share
53.6K Views

Join the DZone community and get the full member experience.

Join For Free

This is Part 1 of a 2-part series on how to update Hive tables the easy way. Stay tuned for the next part, coming soon!

Historically, keeping data up-to-date in Apache Hive required custom application development that is complex, non-performant, and difficult to maintain. HDP 2.6 radically simplifies data maintenance with the introduction of SQL MERGE in Hive, complementing existing INSERT, UPDATE, and DELETE capabilities.

This blog shows how to solve common data management problems, including:

  • Hive upserts to synchronize Hive data with a source RDBMS.
  • Update the partition where data lives in Hive.
  • Selectively mask or purge data in Hive.

In a later blog, we’ll show how to manage slowly-changing dimensions (SCDs) with Hive.

The Basics: SQL MERGE, UPDATE and DELETE

These SQL features are the foundation for keeping data up-to-date in Hadoop, so let’s take a quick look at them.

MERGE was standardized in SQL 2008 and is a powerful SQL statement that allows inserting, updating, and deleting data in a single statement. It makes it easy to keep two systems consistent. Let’s look at the SQL specification for MERGE (slightly simplified):

MERGE INTO <target table>
 USING <table reference>
ON <search condition>
 <merge when clause>...
WHEN MATCHED [ AND <search condition> ]
THEN <merge update or delete specification>
WHEN NOT MATCHED [ AND <search condition> ]
THEN <merge insert specification>

MERGE is powerful because you can specify as many WHEN MATCHED/WHEN NOT MATCHED clauses as you want.

In this blog, we’ll also use the more familiar UPDATE statement, which looks like this:

UPDATE <target table>
 SET <set clause list>
 [ WHERE <search condition> ]

UPDATE comes into play when you don’t need to mix INSERTs and UPDATEs together in the same statement.

Get Ready to Keep Data Fresh

With HDP 2.6, there are two things you need to do to allow your tables to be updated.

First, you need to configure your system to allow Hive transactions. In Ambari, this just means toggling the ACID Transactions setting on.

ACID Transactions

Second, your table must be a transactional table. That means the table must be clustered, be stored as ORCFile data, and have a table property that says "transactional" = "true". Here’s an example:

create table customer_partitioned
 (id int, name string, email string, state string)
 partitioned by (signup date)
 clustered by (id) into 2 buckets stored as orc
 tblproperties("transactional"="true");

Use Case 1: Hive Upserts

Suppose you have a source database that you want to load into Hadoop to run large-scale analytics. Records in the source RDBMS are constantly being added and modified, and there’s no log to help you understand which records have changed. Instead, to keep things simple, you just do a full dump every 24 hours and update the Hadoop side to make it a mirror image of the source side.

Let’s create our managed table as follows:

create table customer_partitioned
(id int, name string, email string, state string)
 partitioned by (signup date)
 clustered by (id) into 2 buckets stored as orc
 tblproperties("transactional"="true");

Suppose our source data at Time = 1 looks like this:

And the refreshed load at Time = 2 looks like this:

Upsert combines updates and inserts into one operation, so you don’t need to worry about whether records existing in the target table or not. MERGE is designed with this use case in mind and the basic usage is extremely simple:

merge into customer_partitioned
 using all_updates on customer_partitioned.id = all_updates.id
 when matched then update set
   email=all_updates.email,
   state=all_updates.state
 when not matched then insert
   values(all_updates.id, all_updates.name, all_updates.email,
   all_updates.state, all_updates.signup);

Notice that we use both when matched and when not matched conditions to manage updates and inserts, respectively. After the merge process, the managed table is identical to the staged table at T = 2, and all records are in their respective partitions.

Use Case 2: Update Hive Partitions

A common strategy in Hive is to partition data by date. This simplifies data loads and improves performance. Regardless of your partitioning strategy, you will occasionally have data in the wrong partition. For example, suppose customer data is supplied by a third party and includes a customer signup date. If the provider had a software bug and needed to change customer signup dates, suddenly records are in the wrong partition and need to be cleaned up.

Suppose our initial data looks like this:

And our second load looks like this:

Notice that ID 2 has the wrong Signup date at T = 1 and is in the wrong partition in the Hive table. This needs to be updated somehow so that ID 2 is removed from partition 2017-01-08 and added to 2017-01-10.

Before MERGE, it was nearly impossible to manage these partition key changes. Hive’s MERGE statement doesn’t natively support updating the partition key, but here’s a trick that makes it easy anyway: we introduce a delete marker which we set any time the partition keys and UNION this with a second query that produces an extra row on-the-fly for each of these non-matching records. The code makes the process more obvious:

merge into customer_partitioned
 using (
-- Updates with matching partitions or net new records.
select
  case
       when all_updates.signup <> customer_partitioned.signup then 1
       else 0
     end as delete_flag,
     all_updates.id as match_key,
     all_updates.* from
    all_updates left join customer_partitioned
   on all_updates.id = customer_partitioned.id
      union all

 -- Produce new records when partitions don’t match.
     select 0, null, all_updates.*
     from all_updates, customer_partitioned where
     all_updates.id = customer_partitioned.id
     and all_updates.signup <> customer_partitioned.signup
 ) sub
on customer_partitioned.id = sub.match_key
 when matched and delete_flag=1 then delete
 when matched and delete_flag=0 then
   update set email=sub.email, state=sub.state
 when not matched then
   insert values(sub.id, sub.name, sub.email, sub.state, sub.signup);

After the MERGE process, the managed table and source table are synchronized even though records needed to change partitions. Best of all, this was done in a single operation with full atomicity and isolation.

Use Case 3: Mask or Purge Hive Data

Let’s say that your security office comes to you one day and says that all data from a particular customer needs to be masked or purged. In the past, you could have spent hours or days rewriting affected partitions.

Suppose our Contacts table looks like this:

And we created our table as follows:

create table contacts
 (id int, name string, customer string, phone string)
 clustered by (id) into 2 buckets stored as orc 
tblproperties("transactional"="true");

The security office approaches us, asking for various actions:

  1. Example: Mask all contact phone numbers from customer MaxLeads. Here, we use Hive’s built-in masking capabilities (as of HDP 2.5 or Apache Hive 2.1):

    update contacts set phone = mask(phone) where customer = 'MaxLeads';
  2. Example: Purge all records from customer LeadMax.

    delete from contacts where customer = 'LeadMax';
  3. Example: Purge records matching a given list of keys. Suppose the security office gave us a CSV with certain keys and asked us to delete records matching those keys. We load the security office’s CSV into a table and get the list of keys using a subquery.

    delete from contacts where id in ( select id from purge_list );

Conclusion

Hive’s MERGE and ACID transactions make data management in Hive simple, powerful, and compatible with existing EDW platforms that have been in use for many years. Stay tuned for the next blog in this series, where we will show how to manage slowly-changing dimensions in Hive.

Database Data (computing) Partition (database) Use case sql

Published at DZone with permission of Carter Shanklin, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • How to Geo-Partition Data in Distributed SQL
  • Apache Spark for the Impatient
  • Data Privacy and Security: A Developer's Guide to Handling Sensitive Data With DuckDB
  • What Is SQL Injection and How Can It Be Avoided?

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • [email protected]

Let's be friends: