Data Analysis Using Apache Hive and Apache Pig
Learn about loading and storing data using Hive, an open-source data warehouse system, and Pig, which can be used for the ETL data pipeline and iterative processing.
Join the DZone community and get the full member experience.Join For Free
Apache Hive, an open-source data warehouse system, is used with Apache Pig for loading and transforming unstructured, structured, or semi-structured data for data analysis and getting better business insights. Pig, a standard ETL scripting language, is used to export and import data into Apache Hive and to process a large number of datasets. Pig can be used for the ETL data pipeline and iterative processing.
In this blog, let's discuss loading and storing data in Hive with Pig Relation using HCatalog.
Download and configure the following:
In this blog, let's discuss the below use case:
- Loading unstructured data into Hive.
- Processing, transforming, and analyzing data in Pig.
- Loading structured data into a different table in Hive using Pig.
Two cricket data files with Indian Premier League data from 2008 to 2016 is used as a data source. The files are as follows:
matches.csv: Provides details about each match played.
deliveries.csv: Provides details about consolidated deliveries of all the matches.
These files are extracted and loaded into Hive. The data is further processed, transformed, and analyzed to get the winner for each season and the top five batsmen with the maximum run in each season and overall season.
- Create database and database tables in Hive.
- Import data into Hive tables.
- Call Hive SQL in Shell script.
- View database architecture.
- Load and store Hive data into Pig relation.
- Call Pig script in Shell Script.
- Apply pivot concept in Hive SQL.
- View output.
Creating Database and Database Tables in Hive
Importing Data Into Hive Tables
Calling Hive SQL in Shell Script
Viewing Database Architecture
matches.csv file loaded into Hive schema (
ipl_stats.matches) is as follows:
Loading and Storing Hive Data Into Pig Relation
To load and store data from Hive into Pig relation and to perform data processing and transformation, save the below script as Pig file (
Note: Create a Hive table before calling Pig file. To write back the processed data into Hive, save the below script as a SQL file (
Calling Pig Script in Shell Script
The data loaded into Hive using Pig script is as follows:
Applying Pivot Concept in Hive SQL
As the data loaded into Hive is in rows, the SQL pivot concept is used to convert rows into columns for more data clarity and for gaining better insights. The user-defined aggregation function (UDAF) technique is used to perform pivot in Hive. In this use case, the pivot concept is applied to
run rows alone.
Collect UDAF, add Brickhouse JAR file into Hive class path.
The top five most run scored batsmen data for each season before applying pivot is shown as follows:
The top five most run scored batsmen data for each season after applying pivot is shown as follows:
Let's view winners of a season, the top five most run scored batsmen,
Viewing Winners of a Season
Viewing Top 5 Most Run Scored Batsmen
The top five most run scored batsmen are shown graphically using MS Excel as follows:
Viewing Year-Wise Runs of Top 5 Batsmen
The year-wise runs of the top five batsmen are shown graphically using MS Excel as follows:
Published at DZone with permission of Rathnadevi Manivannan. See the original article here.
Opinions expressed by DZone contributors are their own.