DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • Boost Your Spark Jobs: How Photon Accelerates Apache Spark Performance
  • Apache Spark 3 to Apache Spark 4 Migration: What Breaks, What Improves, What's Mandatory
  • Hadoop on AmpereOne Reference Architecture
  • End-to-End Streaming Optimization: Kafka to Delta With Exactly-Once Guarantees

Trending

  • Exactly-Once Processing: Myth vs Reality
  • Stateless JWT Auth Microservice Architecture With Spring Boot 3 and Redis Sentinel
  • The Invisible OOMKill: Why Your Java Pod Keeps Restarting in Kubernetes
  • What Nobody Tells You About Multimodal Data Pipelines for AI Training
  1. DZone
  2. Data Engineering
  3. Big Data
  4. How to Process Nasty Fixed Width Files Using Apache Spark

How to Process Nasty Fixed Width Files Using Apache Spark

Fixed width files can be a pain. Learn how to get around the issues they present using Scala and Spark.

By 
Subhasish Guha user avatar
Subhasish Guha
·
Apr. 19, 19 · Tutorial
Likes (6)
Comment
Save
Tweet
Share
30.3K Views

Join the DZone community and get the full member experience.

Join For Free

A fixed width file is a very common flat file format when working with SAP, Mainframe, and Web Logs. Converting the data into a dataframe using metadata is always a challenge for Spark Developers. This particular article talks about all kinds of typical scenarios that a developer might face while working with a fixed witdth file. This solution is generic to any fixed width file and very easy to implement. This also takes care of the Tail Safe Stack as the RDD gets into the foldLeft operator.

Let's check the source file first and then the metadata file:

 10597    Subhasish Guha                                    [email protected]                 27           TXNPUES                    Yes          5007.10
191675    Ritam Mukherjee                                   [email protected]                         29 SINGLE    OUNHQEX XUFQONY            No 01/14/20133172.53   43537.00  
          Saurav Agarwal                                    [email protected]                 30 MARRIED                                           100000.00 7000.00
 80495    Priyanshu Kumar                                   [email protected]                45 MARRIED   XDZANTV                    No 11/21/2012          

The source file has multiple issues:

  1. The end field does not have all the spaces.

  2. The last two fields are not present.

  3. In between fields, a few things are not present.

The metadata file looks like below:

col_name,size
account_id,10
name,50
mail,40
age,3
status,10
desc,27
cflow,3
process_date,10
debit_amt,10
credit_amt,10

Now please look at the generic code which could load the data in a dataframe:

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.{lit,col,concat_ws,split,struct}
import org.apache.spark.sql.catalyst.expressions.Explode
import org.apache.spark.sql.types.{StructType,StructField,StringType};
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.AnalysisException
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.types.StringType
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.LongType
import org.apache.spark.sql.functions.{lower, upper}
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.functions.{input_file_name, regexp_extract}
import org.apache.spark.sql.functions._
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import scala.reflect.ClassTag

object FixedWidth {


def lsplit(pos: List[Int], str: String): Row = {

val (rest, result) = pos.foldLeft((str, List[String]())) {
case ((s, res),curr) =>
if(s.length()<=curr)
{
val split=s.substring(0).trim()
val rest=""
(rest, split :: res)
}
else if(s.length()>curr)
{
val split=s.substring(0, curr).trim()
val rest=s.substring(curr)
(rest, split :: res)
}
else
{
val split=""
val rest=""
(rest, split :: res)
}
}
Row.fromSeq(result.reverse)
}



def main(args: Array[String]): Unit = {
System.setProperty("hadoop.home.dir", "C://winutils//")
val spark = SparkSession.builder().master("yarn").config("spark.sql.warehouse.dir", "file:///C://Personalized//").getOrCreate()
val rdd= spark.sparkContext.textFile("path of your file")
val metadata=spark.read.option("header", "true").csv("path of your Metadata File")
val header=metadata.select("col_name").rdd.map(x=>x.getString(0).trim()).collect()
val sizeOfColumn=metadata.select("size").rdd.map(x=>x.getString(0).trim()).collect().map(_.toInt).toList
val fields = header.map(fieldName => StructField(fieldName, StringType, nullable = true)) 
val schema = StructType(fields)
val df=spark.createDataFrame(rdd.map { x => lsplit(sizeOfColumn,x) }, schema)
df.show(false)
}
}

The output of this code looks like what I've got below. I hope this helps all the developers who are handling this kind of file and facing some problems. This particular code will handle almost all possible discripencies which we face.

+----------+---------------+------------------------+---+-------+---------------+-----+------------+---------+----------+
|account_id|name           |mail                    |age|status |desc           |cflow|process_date|debit_amt|credit_amt|
+----------+---------------+------------------------+---+-------+---------------+-----+------------+---------+----------+
|10597     |Subhasish Guha |[email protected] |27 |       |TXNPUES        |Yes  |            |5007.10  |          |
|191675    |Ritam Mukherjee|[email protected]         |29 |SINGLE |OUNHQEX XUFQONY|No   |01/14/2013  |3172.53  |43537.00  |
|          |Saurav Agarwal |[email protected] |30 |MARRIED|               |     |            |100000.00|7000.00   |
|80495     |Priyanshu Kumar|[email protected]|45 |MARRIED|XDZANTV        |No   |11/21/2012  |         |          |
+----------+---------------+------------------------+---+-------+---------------+-----+------------+---------+----------+
Apache Spark

Opinions expressed by DZone contributors are their own.

Related

  • Boost Your Spark Jobs: How Photon Accelerates Apache Spark Performance
  • Apache Spark 3 to Apache Spark 4 Migration: What Breaks, What Improves, What's Mandatory
  • Hadoop on AmpereOne Reference Architecture
  • End-to-End Streaming Optimization: Kafka to Delta With Exactly-Once Guarantees

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook