DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Related

  • Query Federation in Data Virtualization and Best Practices
  • Snowflake Data Processing With Snowpark DataFrames
  • How to Hive on GCP Using Google DataProc and Cloud Storage: Part 1
  • 9 Best Free and Open Source Tools for Reporting

Trending

  • DGS GraphQL and Spring Boot
  • Infrastructure as Code (IaC) Beyond the Basics
  • Blue Skies Ahead: An AI Case Study on LLM Use for a Graph Theory Related Application
  • Ethical AI in Agile
  1. DZone
  2. Data Engineering
  3. Data
  4. Data Analysis Using Apache Hive and Apache Pig

Data Analysis Using Apache Hive and Apache Pig

Learn about loading and storing data using Hive, an open-source data warehouse system, and Pig, which can be used for the ETL data pipeline and iterative processing.

By 
Rathnadevi Manivannan user avatar
Rathnadevi Manivannan
·
Aug. 18, 17 · Tutorial
Likes (5)
Comment
Save
Tweet
Share
26.4K Views

Join the DZone community and get the full member experience.

Join For Free

Apache Hive, an open-source data warehouse system, is used with Apache Pig for loading and transforming unstructured, structured, or semi-structured data for data analysis and getting better business insights. Pig, a standard ETL scripting language, is used to export and import data into Apache Hive and to process a large number of datasets. Pig can be used for the ETL data pipeline and iterative processing.

In this blog, let's discuss loading and storing data in Hive with Pig Relation using HCatalog.

Prerequisites

Download and configure the following:

  • Hadoop Version 2.6.5
  • Hive 1.2.1
  • Apache Pig version 0.17.0-SNAPSHOT

Use Case

In this blog, let's discuss the below use case:

  • Loading unstructured data into Hive.
  • Processing, transforming, and analyzing data in Pig.
  • Loading structured data into a different table in Hive using Pig.

Data Description

Two cricket data files with Indian Premier League data from 2008 to 2016 is used as a data source. The files are as follows:

  • matches.csv: Provides details about each match played.
  • deliveries.csv: Provides details about consolidated deliveries of all the matches.

These files are extracted and loaded into Hive. The data is further processed, transformed, and analyzed to get the winner for each season and the top five batsmen with the maximum run in each season and overall season.

Synopsis

  • Create database and database tables in Hive.
  • Import data into Hive tables.
  • Call Hive SQL in Shell script.
  • View database architecture.
  • Load and store Hive data into Pig relation.
  • Call Pig script in Shell Script.
  • Apply pivot concept in Hive SQL.
  • View output.

Creating Database and Database Tables in Hive

To create databases and database tables in Hive, save the below query as a SQL file (database_table_creation.sql):select

Importing Data Into Hive Tables

To load data from both the CSV files into Hive, save the below query as a SQL file (data_loading.sql):select

Calling Hive SQL in Shell Script

To automatically create databases and database tables and to import data into Hive, call both the SQL files (database_table_creation.sql and data_loading.sql) using Shell Script.select

Viewing Database Architecture

The database schema and tables created are as follows:select

The raw matches.csv file loaded into Hive schema (ipl_stats.matches) is as follows:

select

The raw deliveries.csv file loaded into Hive schema (ipl_stats.deliveries) is as follows:select

Loading and Storing Hive Data Into Pig Relation

To load and store data from Hive into Pig relation and to perform data processing and transformation, save the below script as Pig file (most_run.pig):select

Note: Create a Hive table before calling Pig file. To write back the processed data into Hive, save the below script as a SQL file (most_run.sql):

select

Calling Pig Script in Shell Script

To automate ETL process, call files (most_run.pig, most_run.sql) using Shell script.select

The data loaded into Hive using Pig script is as follows:

select

Applying Pivot Concept in Hive SQL

As the data loaded into Hive is in rows, the SQL pivot concept is used to convert rows into columns for more data clarity and for gaining better insights. The user-defined aggregation function (UDAF) technique is used to perform pivot in Hive. In this use case, the pivot concept is applied to season and run rows alone.

To use  Collect UDAF, add Brickhouse JAR file into Hive class path.

The top five most run scored batsmen data for each season before applying pivot is shown as follows:

select

The top five most run scored batsmen data for each season after applying pivot is shown as follows:

select

Viewing Output

Let's view winners of a season, the top five most run scored batsmen, 

Viewing Winners of a Season

To view winners of each season, use the following Hive SQL query:select

Viewing Top 5 Most Run Scored Batsmen

To view top five most run scored batsmen, use the following Hive SQL query:select

The top five most run scored batsmen are shown graphically using MS Excel as follows:

select

Viewing Year-Wise Runs of Top 5 Batsmen

To view year-wise runs of the top five batsmen, use the following Hive SQL query:select

The year-wise runs of the top five batsmen are shown graphically using MS Excel as follows:

select

Reference

  • Sample Data Files and Code in GitHub
Data processing Database Apache Pig sql Data analysis Apache Hive

Published at DZone with permission of Rathnadevi Manivannan. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Query Federation in Data Virtualization and Best Practices
  • Snowflake Data Processing With Snowpark DataFrames
  • How to Hive on GCP Using Google DataProc and Cloud Storage: Part 1
  • 9 Best Free and Open Source Tools for Reporting

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: