Over a million developers have joined DZone.

Create a Spark DataFrame: Read and Parse Multiple (Small) Files

DZone 's Guide to

Create a Spark DataFrame: Read and Parse Multiple (Small) Files

We take a look at how to work with data sets without using UTF -16 encoded files in Apache Spark using the Scala language.

· Big Data Zone ·
Free Resource

This is a very specific use case we faced while loading small CSV files in bulk with UTF -16 encoded files. After reading data from a source location, one might see junk/special characters while using DF.show() which may lead to unexpected results.

So, below, I've shown how to read a file in Spark without using the deafault UTF-8 encoded file.

Soln: We have used the below code to solve this; here folderPath can be any HDFS location

val rdd_file = sparkSession.sparkContext.binaryFiles(folderPath, 2).
mapValues(content => new String(content.toArray(), StandardCharsets.UTF_16))

// rdd_file is created as paired RDD like below

//val rdd_file: RDD[(String, String)] à first String represents filename

//second String represents: String content of file

// From rdd_file we can perform map transformation on each file to get RDD as RDD[List[Employee]]

val rddOfListPojo= rdd_file.map{x =>

val fileData = x._2

// parse fileData to get list of Pojo w.r.t business logic



val rddPojo = rddOfListPojo.flatMap(x => x)

val dsCombined = rddPojo.toDS()

Note: We have created a combined data set called dsCombined (as mentioned above) from the small UTF-16 encoded files.

utf-16 ,dataframe ,apache spark tutorial scala ,data parsing ,big data

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}