Create a Spark DataFrame: Read and Parse Multiple (Small) Files
We take a look at how to work with data sets without using UTF -16 encoded files in Apache Spark using the Scala language.
Join the DZone community and get the full member experience.
Join For FreeThis is a very specific use case we faced while loading small CSV files in bulk with UTF -16 encoded files. After reading data from a source location, one might see junk/special characters while using DF.show()
which may lead to unexpected results.
So, below, I've shown how to read a file in Spark without using the deafault UTF-8 encoded file.
Soln: We have used the below code to solve this; here folderPath
can be any HDFS location
val rdd_file = sparkSession.sparkContext.binaryFiles(folderPath, 2).
mapValues(content => new String(content.toArray(), StandardCharsets.UTF_16))
// rdd_file is created as paired RDD like below
//val rdd_file: RDD[(String, String)] à first String represents filename
//second String represents: String content of file
// From rdd_file we can perform map transformation on each file to get RDD as RDD[List[Employee]]
val rddOfListPojo= rdd_file.map{x =>
val fileData = x._2
// parse fileData to get list of Pojo w.r.t business logic
listOfEmployee
}
val rddPojo = rddOfListPojo.flatMap(x => x)
val dsCombined = rddPojo.toDS()
Note: We have created a combined data set called dsCombined
(as mentioned above) from the small UTF-16 encoded files.
Opinions expressed by DZone contributors are their own.
Comments