DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Parsing and Querying CSVs With Apache Spark

Parsing and Querying CSVs With Apache Spark

Apache Spark is at the center of Big Data Analytics, and this post provides the spark to begin your Big Data journey. Read on to understand the process to ingest a CSV data file to Apache Spark

Deepak Mehra user avatar by
Deepak Mehra
·
Jan. 12, 17 · Tutorial
Like (4)
Save
Tweet
Share
17.03K Views

Join the DZone community and get the full member experience.

Join For Free

In this post, I will be sharing how to parse and query CSV data using Apache Spark. Querying CSV data is very easy using the Spark CSV library. For that, we will be using SQLContext. With SQLContext, we can query the data like we do in any database language. We can perform all the operation on data like SELECT and also write the data into a new file.

Alright, enough talking now, let’s see how we can do it!

We will set up an SBT project first. After setting up an SBT project, we will start by adding required dependencies into build.sbt.

For this project, we will require 3 major dependencies:

libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.1"
//for spark 1.6 
libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.6.1"
// for spark sqlcontext
libraryDependencies += "com.databricks" % "spark-csv_2.11" % "1.5.0"
// this is the show stealer,this is the dependency 
required to parse and query csv data

Alright, now we are done setting up the project and adding dependencies. We will be writing the code in steps so that it will be easier for you to understand each and every line. 

val sparkConf = new SparkConf().setAppName("simpleReading").
setMaster("local[2]") //set spark configuration


val sparkContext = new SparkContext(sparkConf) // make spark context


val sqlContext = new SQLContext(sparkContext) // make sql context


val df = sqlContext.read.
    format("com.databricks.spark.csv").
    option("header","true").
    option("inferSchema","true").
load("employee.csv") //load data from a file


val selectedCity = df.select("city")
  selectedCity.write.save("employee.csv") 
//select a particular column which is "city" from the file and save the selected data into a  new csv


val selectNameAndAge = df.select("name","age") //select 
particular columns
  selectMake.show() //show the selected columns


val tempTable = df.registerTempTable("my_table") 
//makes a temporary table my_table
  val usingSQL = sqlContext.sql("select * from my_table") 
//select all the csv file's data in temp table
usingSQL.show() //show the temporary table using show function 

Final Code

import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}

/**
 * Created by deepak on 7/1/17.
 */
object SparkTest extends App {
  val sparkConf = new SparkConf().setAppName("simpleReading").
setMaster("local[2]")
  //set spark configuration
  val sparkContext = new SparkContext(sparkConf)
  // make spark context
  val sqlContext = new SQLContext(sparkContext) // make sql context

  val df = sqlContext.read.
    format("com.databricks.spark.csv").
    option("header", "true").
    option("inferSchema", "true").load("employee.csv") 
//load data from a file

  val selectedCity = df.select("city")
  selectedCity.write.save("employee.csv")
  //save the data in new csv

  val selectMake = df.select("name", "age") //select particular column
  selectMake.show()
  //show make column

  val tempTable = df.registerTempTable("my_table")
  //makes a temporary table
  val usingSQL = sqlContext
    .sql("select * from my_table") 
//show all the csv file's data in temp table
  usingSQL.show()
}

You can find the mini project with all the code at the link Click here. If you find any challenge, Do let me know in the comments. If you enjoyed this post, I’d be very grateful if you’d help it spread.

Keep smiling, Keep coding!

CSV Apache Spark

Published at DZone with permission of Deepak Mehra, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • API Design Patterns Review
  • What Is a Kubernetes CI/CD Pipeline?
  • Microservices Discovery With Eureka
  • Deploying Java Serverless Functions as AWS Lambda

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: