Parsing and Querying CSVs With Apache Spark
Apache Spark is at the center of Big Data Analytics, and this post provides the spark to begin your Big Data journey. Read on to understand the process to ingest a CSV data file to Apache Spark
Join the DZone community and get the full member experience.
Join For FreeIn this post, I will be sharing how to parse and query CSV data using Apache Spark. Querying CSV data is very easy using the Spark CSV library. For that, we will be using SQLContext. With SQLContext, we can query the data like we do in any database language. We can perform all the operation on data like SELECT and also write the data into a new file.
Alright, enough talking now, let’s see how we can do it!
We will set up an SBT project first. After setting up an SBT project, we will start by adding required dependencies into build.sbt.
For this project, we will require 3 major dependencies:
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.1"
//for spark 1.6
libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.6.1"
// for spark sqlcontext
libraryDependencies += "com.databricks" % "spark-csv_2.11" % "1.5.0"
// this is the show stealer,this is the dependency
required to parse and query csv data
Alright, now we are done setting up the project and adding dependencies. We will be writing the code in steps so that it will be easier for you to understand each and every line.
val sparkConf = new SparkConf().setAppName("simpleReading").
setMaster("local[2]") //set spark configuration
val sparkContext = new SparkContext(sparkConf) // make spark context
val sqlContext = new SQLContext(sparkContext) // make sql context
val df = sqlContext.read.
format("com.databricks.spark.csv").
option("header","true").
option("inferSchema","true").
load("employee.csv") //load data from a file
val selectedCity = df.select("city")
selectedCity.write.save("employee.csv")
//select a particular column which is "city" from the file and save the selected data into a new csv
val selectNameAndAge = df.select("name","age") //select
particular columns
selectMake.show() //show the selected columns
val tempTable = df.registerTempTable("my_table")
//makes a temporary table my_table
val usingSQL = sqlContext.sql("select * from my_table")
//select all the csv file's data in temp table
usingSQL.show() //show the temporary table using show function
Final Code
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
/**
* Created by deepak on 7/1/17.
*/
object SparkTest extends App {
val sparkConf = new SparkConf().setAppName("simpleReading").
setMaster("local[2]")
//set spark configuration
val sparkContext = new SparkContext(sparkConf)
// make spark context
val sqlContext = new SQLContext(sparkContext) // make sql context
val df = sqlContext.read.
format("com.databricks.spark.csv").
option("header", "true").
option("inferSchema", "true").load("employee.csv")
//load data from a file
val selectedCity = df.select("city")
selectedCity.write.save("employee.csv")
//save the data in new csv
val selectMake = df.select("name", "age") //select particular column
selectMake.show()
//show make column
val tempTable = df.registerTempTable("my_table")
//makes a temporary table
val usingSQL = sqlContext
.sql("select * from my_table")
//show all the csv file's data in temp table
usingSQL.show()
}
You can find the mini project with all the code at the link Click here. If you find any challenge, Do let me know in the comments. If you enjoyed this post, I’d be very grateful if you’d help it spread.
Keep smiling, Keep coding!
Published at DZone with permission of Deepak Mehra, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments