DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
What's in store for DevOps in 2023? Hear from the experts in our "DZone 2023 Preview: DevOps Edition" on Fri, Jan 27!
Save your seat
  1. DZone
  2. Data Engineering
  3. Databases
  4. Leveraging Your Db2 Skills With Big Data

Leveraging Your Db2 Skills With Big Data

How can you leverage your Db2 skills with Big Data?

jim Wankowski user avatar by
jim Wankowski
·
Sep. 10, 19 · Opinion
Like (2)
Save
Tweet
Share
9.20K Views

Join the DZone community and get the full member experience.

Join For Free

Image title

See how to leverage your Db2 skills with Big Data.

The Challenge

The idea of the traditional data center being centered on relational database technology is quickly evolving. Many new data sources exist today that did not exist as little as 5 years ago. Devices such as active machine sensors on machinery, autos and aircraft, medical sensors, RFIDs, as well as social media and web click-through activity are creating tremendous volumes of mostly unstructured data, which cannot possibly be stored or analyzed in traditional RDMS’s.

These new data sources are pushing companies to explore the concepts of Big Data and Hadoop architecture, which is creating a new set of problems for corporate IT. Hadoop development and administration can be complicated and time-consuming. Developing the complex MapReduce programs to mine this data is a complicated and very specialized skill. Companies need to invest in training their existing personnel or hire people specializing in MapReduce programming and administration. This is the very reason many enterprises have been hesitant to invest in big data applications.

You may also like:  Simplifying Access to Db2 Databases in Jupyter Notebook

Leveraging Existing Skills

A solution to this problem is to be able to leverage existing SQL skills for analyzing Hadoop data. Apache Hive was the original solution to this problem. It provides an open source SQL interface to Hadoop. This allows a person with basic SQL skills to run analytics on Hadoop data using a SQL language called HiveQL without the complexity of MapReduce.

Apache Hive

So, what exactly is Hive? Apache Hive is a data warehouse system for Hadoop. Hive is not a database; it is a metastore that holds the table structure definitions that you define when you create a hive table. This repository is known as the HCatalog and is actually a relational database. The RDBMS is typically MySQL, Postgres, or Oracle.

It is important to understand that Hive does not provide an OLTP type capability to Hadoop Queries. HiveQL will be translated into MapReduce jobs under the covers. It is really best suited for long-running batch type queries due to the performance limitations of this process.

Hive is essentially 3 Things:

  • MapReduce execution engine
  • Storage model
  • Metastore

Image title

Hive tables can be partitioned to help improve performance:

Image title

Db2 Big SQL

IBM’s Db2 Big SQL takes the Hive SQL capability to a higher level. Big SQL is based on Db2 MPP architecture and sits on top of Hive. Big SQL replaces MapReduce with MPP architecture, which is much faster and provides high concurrency, enabling a much closer representation of an OLTP experience.

For a person with a Db2 background, Big SQL makes Hadoop data easily accessible. It has a fully ANSI compliant SQL, plus the syntax for DDL and DML are nearly identical to native Db2 as well.

Here is an example of a create table statement:

create hadoop table users

(

 id int not null primary key,

 office_id int null,

 fname varchar(30) not null,

 lname varchar(30) not null)

row format delimited

 fields terminated by '|'

stored as textfile;


Notice the use of “not null” and “primary key” in the definitions. This syntax is unique to Big SQL. These keywords are not actually enforced in Hadoop but because it is Db2 at its core the information is fed to the optimizer and will undergo query rewrite and optimization very similar to Db2. Defining RI relationships in these tables allows the optimizer to be more intelligent about join order just like Db2. Once this DDL is executed, the metadata will be stored in the Hive HCatalog just the same as native Hive tables.

So now you may want to create a view. Here again the syntax is identical to Db2

create view my_users as

select fname, lname from myschema.users where id > 100;


Other features that should look familiar to you:

  • “Native Tables” with full transactional support on the Head Node:
    • Row oriented, traditional DB2 tables
    • BLU Columnar, In-memory tables (on Head Node Only)
    • Materialized Query Tables
  • GET SNAPSHOT/snapshot table functions
  • RUNSTATS command (db2) à ANALYZE command (Big SQL)
  • Row and Column Security
  • Federation/Fluid Query
  • Views
  • SQL PL Stored Procedures & UDFs
  • Workload Manager
  • System Temporary Table Spaces to support sort overflows
  • User Temporary Table Spaces for Declared Global Temporary Tables
  • HADR for Head Node
  • Oracle PL/SQL support
  • Declared Global Temporary Tables
  • Time Travel Queries

Conclusion

Currently, there is a myriad of SQL engines for Hadoop available. Different engines solve different problems. Most likely, no single SQL engine will address all your modern data warehousing needs or use cases. Depending on how your organization is planning on using Hadoop, you will most likely use a combination of SQL engines. For long-running batch queries, you may want to use native Hive, for simple ad-hoc queries, you may use native Apache Spark SQL, and for complex BI type of queries, Big SQL fills the bill.

So if your company is starting to dabble with Hadoop or has full-blown production clusters, these SQL engines can help you leverage your Db2 skills to start working with Hadoop. Hopefully, this short article will help you kick start your exploration of Big Data!

Further Reading

How to Migrate On-Premise Databases to IBM DB2 On Cloud

Docker Containers and IBM DB2 Warehouse — An Introduction

Big data Relational database Database sql MySQL hadoop Apache Spark Apache Hive

Published at DZone with permission of jim Wankowski. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Simulating and Troubleshooting StackOverflowError in Kotlin
  • How Observability Is Redefining Developer Roles
  • Architecture and Code Design, Pt. 2: Polyglot Persistence Insights To Use Today and in the Upcoming Years
  • Efficiently Computing Permissions at Scale: Our Engineering Approach

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: