DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Last call! Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • Building an ETL Pipeline With Airflow and ECS
  • RION - A Fast, Compact, Versatile Data Format
  • All You Need to Know About Apache Spark
  • Snowflake vs. Databricks: How to Choose the Right Data Platform

Trending

  • Blue Skies Ahead: An AI Case Study on LLM Use for a Graph Theory Related Application
  • How Clojure Shapes Teams and Products
  • Scalability 101: How to Build, Measure, and Improve It
  • Fixing Common Oracle Database Problems
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Example of ETL Application Using Apache Spark and Hive

Example of ETL Application Using Apache Spark and Hive

In this article, we'll read a sample data set with Spark on HDFS (Hadoop File System), do a simple analytical operation, then write to a table that we'll make in Hive.

By 
Emrah Mete user avatar
Emrah Mete
·
Apr. 24, 19 · Tutorial
Likes (4)
Comment
Save
Tweet
Share
37.1K Views

Join the DZone community and get the full member experience.

Join For Free

Hello everyone!

In this article, I will read a sample data set with Spark on HDFS (Hadoop File System), do a simple analytical operation, then write to a table that I will create in Hive.

Today, the amount of data produced is increasing day-by-day. With the increasing amount of data, it is very important to analyze the collected data. The technologies that enable us to do analytical work on big data are in the focus area of each company. Therefore, almost all companies have big data environments and they are doing data analysis in these environments.

Hadoop is the most widely known and most widely used large data environment in the industry. In the Hadoop ecosystem, there are solutions that meet several different needs, such as data storage, data processing, visualization, and data querying. Each solution and library in the Hadoop ecosystem are technologies at a depth that require separate expertise. You can view the different technologies and usage areas of this ecosystem in the picture below.

Big data technologiesNow let's go to the construction of the sample application. In the example, we will first send the data from our Linux file system to the data storage unit of the Hadoop ecosystem (HDFS) (for example, Extraction). Then we will read the data we have written here with Spark and then we will apply a simple Transformation and write to Hive (Load). Hive is a substructure that allows us to query the data in the hadoop ecosystem, which is stored in this environment. With this infrastructure, we can easily query the data in our big data environment using SQL language.

First, let's take a look at the data set we will work. The data set we will work with is the Sample Sales Dataset in Kaggle. The following table shows the column names and types.

Column Name Data Type
ORDERNUMBER Number
QUANTITYORDERED Number
PRICEEACH Number
ORDERLINENUMBER Number
SALES Number
ORDERDATE Date
STATUS String
QTR_ID Number
MONTH_ID Number
YEAR_ID Number
PRODUCTLINE String
MSRP Number
PRODUCTCODE String
CUSTOMERNAME String
PHONE Number
ADDRESSLINE1 String
ADDRESSLINE2 String
CITY String
STATE String
POSTALCODE Number
COUNTRY String
TERRITORY String
CONTACTLASTNAME String
CONTACTFIRSTNAME String
DEALSIZE

String

This data includes sample sales transactions. We will do some transformations from this sample sales data and create a sample report and insert it into a Hive table.

First, copy the sample data that we downloaded from Kaggle to HDFS. For this, let's create a directory in HDFS.

hadoop fs -mkdir samplesales

Yes, we created the directory in HDFS, now we copy the sample data in our local directory to hdfs.

hadoop fs -copyFromLocal sales_sample_data.csv samplesales

hadoop fs -ls samplesales/

We've written the data to HDFS, so we can start our PySpark interface and start processing the data with Spark.

/user/spark-2.1.0-bin-hadoop2.7/bin/pyspark --master yarn-client --num-executors 4 --driver-memory 2g --executor-memory 4g

Apache SparkSpark started successfully. Let's import the libraries that we will use at this stage.

from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
from pyspark.sql import Row
from pyspark.sql import HiveContext
from pyspark.sql.functions import *

hive_context = HiveContext(sc)
sqlContext = SQLContext(sc)

Now, let's create the line format in which we will keep our data.

RecSales = RecSales = Row('ordernumber','quantityordered','priceeach','orderlinenumber','sales','orderdate','status','qtr_id','month_id','year_id','productline','msrp','productcode','customername','phone','addressline1','addressline2','city','state','postalcode','country','territory','contactlastname','contactfirstname','dealsize')

Now it's time to read the data we received in HDFS and write it in a dataframe. Since the columns are sperated by the delimeter="," , and we must specify this delimeter when loading the data. After data is in the data frame, we give a name and save it as a temp table. We will be able to use this name later when writing SQL in hive_context or sqlContext.

dataSales = sc.textFile("/user/samplesales/")
header = dataSales.first()
dataSales= dataSales.filter(lambda line: line != header)
recSales = dataSales.map(lambda l: l.split(","))
dataSales = recSales.map(lambda l: RecSales(l[0],l[1],l[2],l[3],l[4],l[5],l[6],l[7],l[8],l[9],l[10],l[11],l[12],l[13],l[14],l[15],l[16],l[17],l[18],l[19],l[20],l[21],l[22],l[23],l[24]))
dfRecSales = hive_context.createDataFrame(dataSales)
dfRecSales.registerTempTable("sales")

We have successfully read the data from HDFS and have written it into a data frame object. Now let's have a few simple queries with the data we uploaded with Spark SQL.

hive_context.sql("select count(*) from sales").collect()

hive_context.sql("select * from sales").head()

hive_context.sql("select ordernumber,priceeach  from sales").head(2)

Now let's group the sales by territory, and write the results to a Hive table.

dfterriroty = hive_context.sql("select territory,sum(priceeach) total from sales group by territory")
dfterriroty.registerTempTable("sumterr")
dfterriroty.show()

Hive Table

Let's create a Hive table and write the result.

hive_context.sql("CREATE TABLE IF NOT EXISTS territory_sum (territory String, total INT)")

hive_context.sql("insert into territory_sum select * from sumterr")

Finally, check the data written to Hive.

hive_context.sql("select * from territory_sum").show()

hive_context.sql("select count(*) from territory_sum").show()

hive_context.sql("select count(*) from sumterr").show()

Hive data

Big data Data Types Database Apache Spark Extract, transform, load application Strings hadoop Data set

Opinions expressed by DZone contributors are their own.

Related

  • Building an ETL Pipeline With Airflow and ECS
  • RION - A Fast, Compact, Versatile Data Format
  • All You Need to Know About Apache Spark
  • Snowflake vs. Databricks: How to Choose the Right Data Platform

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!