DZone
Big Data Zone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
  • Refcardz
  • Trend Reports
  • Webinars
  • Zones
  • |
    • Agile
    • AI
    • Big Data
    • Cloud
    • Database
    • DevOps
    • Integration
    • IoT
    • Java
    • Microservices
    • Open Source
    • Performance
    • Security
    • Web Dev
DZone > Big Data Zone > Apache Spark: Handle Null Timestamp While Reading CSV in Spark 2.0.0

Apache Spark: Handle Null Timestamp While Reading CSV in Spark 2.0.0

When Spark 2.0.0 tries to read a CSV, it throws an error whenever it gets null values for the timestamp field. Learn how to solve that issue.

Rishi Khandelwal user avatar by
Rishi Khandelwal
·
Jun. 04, 17 · Big Data Zone · Tutorial
Like (2)
Save
Tweet
11.29K Views

Join the DZone community and get the full member experience.

Join For Free

In this blog, I will discuss a problem that I faced recently. One thing to keep in mind is that this problem is specifically related to Spark version 2.0.0. Other than this version, this problem does not occur.

Problem: Spark code was reading a CSV file. This particular CSV file had one timestamp column that could have null values, as well. So when Spark tried to read the CSV, it was throwing an error whenever it got null values for the timestamp field. I needed the solution that could handle null timestamp fields.

You can find the code snippet below:

import org.apache.spark.sql.SparkSession
 
object CsvReader extends App {
 
val sparkSession = SparkSession.builder()
 .master("local")
 .appName("POC")
 .getOrCreate()
 
val df = sparkSession.read
 .format("com.databricks.spark.csv")
 .option("header", "true")
 .option("inferSchema", "true")
 .load("test.csv")
 
df.printSchema()
 df.show()
}

You can see easily that the above code is inferring the schema while reading the CSV file.

Solution: To solve the above problem, we need to follow the below approach:

  1. Provide a custom schema where the timestamp field must be read as a string type.
  2. Cast the timestamp field explicitly.

By using the above approach, we can solve the null timestamp field issue. But there is one thing to notice that we must know first: the field for the timestamp in CSV and the schema for the whole CSV file. Only then will we be able to cast that field from the string to the timestamp explicitly and while maintaining the original schema for the file.

In my case, I am taking below CSV file test.csv:

a

The schema for the CSV file is as:

  • ID: String.

  • PHONE: Integer.

  • BIRTH_DT: Timestamp.

The solution code must be as follows:

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{col, unix_timestamp}
import org.apache.spark.sql.types._

object CsvReader extends App {

  val sparkSession = SparkSession.builder()
    .master("local")
    .appName("POC")
    .getOrCreate()

  val schema = StructType(List(
    StructField("ID", StringType),
    StructField("PHONE", IntegerType),
    StructField("BIRTH_DT", StringType)
  ))

  val df = sparkSession.read
    .format("com.databricks.spark.csv")
    .schema(schema)
    .option("header", "true")
    .load("test.csv")

  val columnName = "BIRTH_DT"
  val updatedDF = df.withColumn(columnName, unix_timestamp(col(columnName), "yyyy-MM-dd HH:mm:ss").cast("timestamp"))

  updatedDF.printSchema()
  updatedDF.show()
}

That’s it. I hope this blog is helpful to you!

CSV

Published at DZone with permission of Rishi Khandelwal, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • A Guide to Events in Vue
  • How to Use Geofences for Precise Audience Messaging
  • Image Classification Using SingleStore DB, Keras, and Tensorflow
  • Building a Kotlin Mobile App with the Salesforce SDK, Part 3: Synchronizing Data

Comments

Big Data Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • MVB Program
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends:

DZone.com is powered by 

AnswerHub logo