Data Processing With Apache Spark
A step-by-step tutorial on how to use Apache Spark with SQL Server to process data efficiently from several different types of datafiles.
Join the DZone community and get the full member experience.Join For Free
Spark has emerged as a favorite for analytics, especially those instances that can handle massive volumes of data as well as provide high performance compared to any other conventional database engines. Spark SQL allows users to formulate their complex business requirements to Spark by using the familiar language of SQL.
So, in this blog, we will see how you can process data with Apache Spark and what better way to establish the capabilities of Spark than to put it through its paces and use the Hadoop-DS benchmark to compare performance, throughput, and SQL compatibility against SQL Server.
Before we begin, ensure that the following test environment is available:
- SQL Server: 32 GB RAM with Windows server 2012 R2.
- Hadoop Cluster: Two machines with 8GB RAM, Ubuntu flavor.
For the purpose of this demo, we will use AdventureWorks2016DW data.
The following table is used in a query with the number of records:
No. Of Records
We will compare the performance of three data processing engines: SQL Server, Spark with CSV files as datafiles, and Spark with Parquet files as datafiles.
We will use the following query to process the data:
Let’s measure the performance of each processing engine:
While running the query in our SQL Server with the 32GB RAM and Microsoft 2012 Server, it takes around 2.33 minutes to execute and return the data.
The following is the screenshot of this query:
Spark With CSV Datafiles
Now, let’s export the same dataset to CSV and move it to HDFS.
The following are screenshots of HDFS with the CSV file as an input source.
Now that we have the files for the specific input tables moved to HDFS as CSV files, we can start with Spark Shell and create DataFrames for each source file.
Run tehe following commands for creating SQL
Run the following command to create Fact Schema:
Run the following command to create a DataFrame for Sales with Fact Schema:
Run the following command to create a Customer schema:
Run the following command to create a Customer dataframe with the Customer Schema.
Now create the product schema with the following command:
Create the product dataframe with the Product schema.
Now create the Product Category schema using the following command:
Now create the Product Category dataframe with the Product Category Schema:
Now create the Product Sub Category schema using the following command:
And create the
productSubCategory dataframe using the below command:
Now create temporary views of each dataframe that we have created so far:
And run the same query which we ran in SQL Server:
It took around 3 minutes to execute the result set.
Spark With a Parquet File for a Fact Table
Now, let’s convert the FactInternetSaleNew file to a Parquet file and save it to HDFS using the following command:
Create a dataframe on top of the Parquet file using the below command:
val sales = sqlContext.read.parquet("/user/nituser/sales.parquet")
And create a temp view using the sales dataframe:
Now, we will run the same query which we used in Step 2:
It will return the same result set in less than 20 seconds.
We can conclude by stating that Spark with commodity hardware performs very similar to the high-end SQL Server.
However, Spark outshines other engines when it deals with efficient column-oriented and compressed storage formats.
So, we need to decide the specifications for the processing engine and storage based on business requirements, while also understanding how we can boost the power of such a highly efficient processing engine and get the required performance.
Published at DZone with permission of Vineet Pawar. See the original article here.
Opinions expressed by DZone contributors are their own.