Ingesting Fixed-Width Mainframe Files Into Delta Lake: The Details Nobody Writes Down
Process mainframe fixed-width files by transcoding EBCDIC, extracting fields with Spark, decoding packed decimals, and validating data before loading to Delta Lake.
Join the DZone community and get the full member experience.
Join For FreeFixed-width files from mainframe systems are one of those topics where there's very little written about the specifics and a lot of institutional knowledge living in people's heads or in COBOL copybooks that haven't been updated since 2003. If you're new to a financial organization and someone hands you a layout spec for a mainframe extract, this is for you.
What Makes Mainframe Fixed-Width Files Different
Unlike CSV, there are no delimiters. Each field occupies a fixed byte position in the record. A customer ID always occupies bytes 1-10, an account balance bytes 11-22. To read the file correctly, you need the layout spec: a document that maps field names to byte positions and data types. If the layout spec is wrong or out of date, you will read valid-looking garbage.
The encoding is often EBCDIC rather than ASCII or UTF-8. EBCDIC is IBM's character encoding, used by most mainframe systems, and it encodes characters differently from ASCII. A raw EBCDIC file opened in a text editor will look like random symbols. You need to transcode it to UTF-8 before processing.
Numeric fields often use packed decimal (COMP-3 in COBOL) encoding: a compact binary encoding for fixed-precision decimals, not the same as any standard binary integer or float format. Packed decimal needs to be decoded specifically; treating it as a string or a regular integer will produce wrong values without any error.
The Processing Pipeline
Our pipeline for mainframe fixed-width ingestion has four stages.
Stage 1 is transcoding. We receive the files as EBCDIC encoded binary. A preprocessing step using Python's codecs library converts EBCDIC to UTF-8 and writes the converted file to a staging area in ADLS Gen2. This is not a Spark step; it runs as a single-process Python job because Spark's text processing assumes UTF-8, and trying to do the transcoding inside Spark adds complexity that isn't necessary.
# Stage 1: EBCDIC to UTF-8 transcoding
import codecs
def transcode_ebcdic(input_path: str, output_path: str) -> int:
"""Transcode EBCDIC file to UTF-8, return record count."""
record_count = 0
with open(input_path, 'rb') as src, open(output_path, 'w', encoding='utf-8') as dst:
for line in src:
decoded = line.decode('cp500') # IBM EBCDIC code page 500
dst.write(decoded)
record_count += 1
return record_count
Stage 2 is field extraction. We use Spark with a custom schema definition that maps each field to its byte offset, length, and data type. The schema comes from a YAML config file maintained alongside the pipeline code, the machine-readable version of the layout spec. The extraction logic uses Spark's substring function to pull each field by position and casts it to the appropriate type. For packed decimal fields, we have a custom UDF that implements the packed decimal decoding algorithm.
# layout-spec.yaml - Customer master file layout
file:
name: customer_master
record_length: 250
encoding: EBCDIC
source_system: mainframe-core-banking
fields:
- name: customer_id
offset: 0
length: 10
type: string
- name: customer_name
offset: 10
length: 40
type: string
- name: account_balance
offset: 50
length: 8
type: packed_decimal
scale: 2
signed: true
- name: open_date
offset: 58
length: 8
type: date
format: "YYYYMMDD"
- name: branch_code
offset: 66
length: 5
type: string
- name: status_flag
offset: 71
length: 1
type: string
# Stage 2: Field extraction using layout spec
from pyspark.sql import functions as F
def extract_fields(df, layout_config):
"""Extract fixed-width fields based on YAML layout spec."""
for field in layout_config['fields']:
col_name = field['name']
offset = field['offset'] + 1 # Spark substring is 1-indexed
length = field['length']
if field['type'] == 'packed_decimal':
df = df.withColumn(col_name,
decode_packed_decimal(
F.substring(F.col('raw_record'), offset, length),
F.lit(field.get('scale', 0)),
F.lit(field.get('signed', False))
))
elif field['type'] == 'date':
df = df.withColumn(col_name,
F.to_date(
F.trim(F.substring(F.col('raw_record'), offset, length)),
field['format']
))
else:
df = df.withColumn(col_name,
F.trim(F.substring(F.col('raw_record'), offset, length)))
return df
Stage 3 is validation and dead-lettering. We check for records where the total length doesn't match expectations, where mandatory fields are blank, and where numeric fields are outside valid ranges. Records that fail validation go to a dead-letter table with the raw record content and the validation failure reason. This has caught mainframe data quality issues that the upstream team wasn't aware of on three separate occasions.
Stage 4 is writing to the Delta Lake. Standard Delta Lake append with partition by business date. Nothing special here.
The Packed Decimal UDF
Packed decimal is worth explaining in detail because I've seen it handled incorrectly more than once. In COMP-3 encoding, each byte holds two BCD digits. The last nibble (half-byte) of the last byte is the sign: 0xC for positive, 0xD for negative, 0xF for unsigned. To decode: read each full byte, extract the two-digit nibbles, read the sign nibble from the last byte, assemble the digits into an integer, and apply the sign.
The precision (scale) of the decimal is defined in the COBOL copybook, not in the data itself. A field defined as PIC S9(7)V99 has 7 digits before the decimal point and 2 after. The decimal point is implied, not stored. You need the copybook to know where to place it.
Our UDF takes the raw bytes, the number of implied decimal places, and a boolean for whether the field is signed, and returns a Decimal. We test it against known values from the mainframe team before going live with any new file layout.
# Packed decimal (COMP-3) UDF for Spark
from pyspark.sql.functions import udf
from pyspark.sql.types import DecimalType
from decimal import Decimal
@udf(returnType=DecimalType(15, 2))
def decode_packed_decimal(raw_bytes, scale, signed):
"""Decode COMP-3 packed decimal from raw byte string."""
if raw_bytes is None:
return None
digits = []
for i, byte_val in enumerate(raw_bytes):
high = (byte_val >> 4) & 0x0F
low = byte_val & 0x0F
if i < len(raw_bytes) - 1:
digits.extend([high, low])
else:
digits.append(high) # last high nibble is final digit
sign_nibble = low # last low nibble is sign
value = int(''.join(str(d) for d in digits))
if signed and sign_nibble == 0x0D:
value = -value
return Decimal(value) / Decimal(10 ** scale)
Keeping the Layout Spec in Sync
The most persistent operational challenge is layout spec drift. The mainframe team occasionally adds or removes fields from an extract, or changes a field's byte length, without notifying the data engineering team. When this happens, our stage 2 extraction silently reads garbage for every field at or after the changed position.
We addressed this with a checksum approach. We compute a hash of the layout spec YAML and store it alongside each pipeline run in the run metadata table. We also compute a statistical fingerprint of the incoming file: record count, average record length, and distribution of a few stable fields. If the file fingerprint deviates significantly from the historical baseline, we alert before processing starts. It has caught three layout changes before they caused silent data corruption.
One Thing I Wish Someone Had Told Me Earlier
Get the actual binary file from the mainframe team to test with, not a 'sample' that's been opened in Notepad and re-saved. Transcoding a file through a text editor mangles EBCDIC encoding and packed decimal fields in ways that produce test files that don't represent actual production data. Every time I've been given a 'cleaned up' sample for testing, I've hit unexpected behavior in production that the sample didn't expose. Insist on raw binary extracts from the source system for testing.
Opinions expressed by DZone contributors are their own.
Comments