Ingesting Fixed-Width Mainframe Files Into Delta Lake: The Details Nobody Writes Down

Process mainframe fixed-width files by transcoding EBCDIC, extracting fields with Spark, decoding packed decimals, and validating data before loading to Delta Lake.

May. 27, 26 · Analysis

Likes (0)

Comment

Save

2.6K Views

Fixed-width files from mainframe systems are one of those topics where there's very little written about the specifics and a lot of institutional knowledge living in people's heads or in COBOL copybooks that haven't been updated since 2003. If you're new to a financial organization and someone hands you a layout spec for a mainframe extract, this is for you.

What Makes Mainframe Fixed-Width Files Different

Unlike CSV, there are no delimiters. Each field occupies a fixed byte position in the record. A customer ID always occupies bytes 1-10, an account balance bytes 11-22. To read the file correctly, you need the layout spec: a document that maps field names to byte positions and data types. If the layout spec is wrong or out of date, you will read valid-looking garbage.

The encoding is often EBCDIC rather than ASCII or UTF-8. EBCDIC is IBM's character encoding, used by most mainframe systems, and it encodes characters differently from ASCII. A raw EBCDIC file opened in a text editor will look like random symbols. You need to transcode it to UTF-8 before processing.

Numeric fields often use packed decimal (COMP-3 in COBOL) encoding: a compact binary encoding for fixed-precision decimals, not the same as any standard binary integer or float format. Packed decimal needs to be decoded specifically; treating it as a string or a regular integer will produce wrong values without any error.

The Processing Pipeline

Our pipeline for mainframe fixed-width ingestion has four stages.

Stage 1 is transcoding. We receive the files as EBCDIC encoded binary. A preprocessing step using Python's codecs library converts EBCDIC to UTF-8 and writes the converted file to a staging area in ADLS Gen2. This is not a Spark step; it runs as a single-process Python job because Spark's text processing assumes UTF-8, and trying to do the transcoding inside Spark adds complexity that isn't necessary.

    Python
   
 

   # Stage 1: EBCDIC to UTF-8 transcoding
import codecs

def transcode_ebcdic(input_path: str, output_path: str) -> int:
    """Transcode EBCDIC file to UTF-8, return record count."""
    record_count = 0
    with open(input_path, 'rb') as src, open(output_path, 'w', encoding='utf-8') as dst:
        for line in src:
            decoded = line.decode('cp500')  # IBM EBCDIC code page 500
            dst.write(decoded)
            record_count += 1
    return record_count
  

Stage 2 is field extraction. We use Spark with a custom schema definition that maps each field to its byte offset, length, and data type. The schema comes from a YAML config file maintained alongside the pipeline code, the machine-readable version of the layout spec. The extraction logic uses Spark's substring function to pull each field by position and casts it to the appropriate type. For packed decimal fields, we have a custom UDF that implements the packed decimal decoding algorithm.

    Python
   
 

   # layout-spec.yaml - Customer master file layout
file:
  name: customer_master
  record_length: 250
  encoding: EBCDIC
  source_system: mainframe-core-banking
fields:
  - name: customer_id
    offset: 0
    length: 10
    type: string
  - name: customer_name
    offset: 10
    length: 40
    type: string
  - name: account_balance
    offset: 50
    length: 8
    type: packed_decimal
    scale: 2
    signed: true
  - name: open_date
    offset: 58
    length: 8
    type: date
    format: "YYYYMMDD"
  - name: branch_code
    offset: 66
    length: 5
    type: string
  - name: status_flag
    offset: 71
    length: 1
    type: string
# Stage 2: Field extraction using layout spec
from pyspark.sql import functions as F

def extract_fields(df, layout_config):
    """Extract fixed-width fields based on YAML layout spec."""
    for field in layout_config['fields']:
        col_name = field['name']
        offset = field['offset'] + 1  # Spark substring is 1-indexed
        length = field['length']

        if field['type'] == 'packed_decimal':
            df = df.withColumn(col_name,
                decode_packed_decimal(
                    F.substring(F.col('raw_record'), offset, length),
                    F.lit(field.get('scale', 0)),
                    F.lit(field.get('signed', False))
                ))
        elif field['type'] == 'date':
            df = df.withColumn(col_name,
                F.to_date(
                    F.trim(F.substring(F.col('raw_record'), offset, length)),
                    field['format']
                ))
        else:
            df = df.withColumn(col_name,
                F.trim(F.substring(F.col('raw_record'), offset, length)))
    return df
  

Stage 3 is validation and dead-lettering. We check for records where the total length doesn't match expectations, where mandatory fields are blank, and where numeric fields are outside valid ranges. Records that fail validation go to a dead-letter table with the raw record content and the validation failure reason. This has caught mainframe data quality issues that the upstream team wasn't aware of on three separate occasions.

Stage 4 is writing to the Delta Lake. Standard Delta Lake append with partition by business date. Nothing special here.

The Packed Decimal UDF

Packed decimal is worth explaining in detail because I've seen it handled incorrectly more than once. In COMP-3 encoding, each byte holds two BCD digits. The last nibble (half-byte) of the last byte is the sign: 0xC for positive, 0xD for negative, 0xF for unsigned. To decode: read each full byte, extract the two-digit nibbles, read the sign nibble from the last byte, assemble the digits into an integer, and apply the sign.

The precision (scale) of the decimal is defined in the COBOL copybook, not in the data itself. A field defined as PIC S9(7)V99 has 7 digits before the decimal point and 2 after. The decimal point is implied, not stored. You need the copybook to know where to place it.

Our UDF takes the raw bytes, the number of implied decimal places, and a boolean for whether the field is signed, and returns a Decimal. We test it against known values from the mainframe team before going live with any new file layout.

    Python
   
 

   # Packed decimal (COMP-3) UDF for Spark
from pyspark.sql.functions import udf
from pyspark.sql.types import DecimalType
from decimal import Decimal

@udf(returnType=DecimalType(15, 2))
def decode_packed_decimal(raw_bytes, scale, signed):
    """Decode COMP-3 packed decimal from raw byte string."""
    if raw_bytes is None:
        return None
    digits = []
    for i, byte_val in enumerate(raw_bytes):
        high = (byte_val >> 4) & 0x0F
        low = byte_val & 0x0F
        if i < len(raw_bytes) - 1:
            digits.extend([high, low])
        else:
            digits.append(high)  # last high nibble is final digit
            sign_nibble = low    # last low nibble is sign
    value = int(''.join(str(d) for d in digits))
    if signed and sign_nibble == 0x0D:
        value = -value
    return Decimal(value) / Decimal(10 ** scale)

  

Keeping the Layout Spec in Sync

The most persistent operational challenge is layout spec drift. The mainframe team occasionally adds or removes fields from an extract, or changes a field's byte length, without notifying the data engineering team. When this happens, our stage 2 extraction silently reads garbage for every field at or after the changed position.

We addressed this with a checksum approach. We compute a hash of the layout spec YAML and store it alongside each pipeline run in the run metadata table. We also compute a statistical fingerprint of the incoming file: record count, average record length, and distribution of a few stable fields. If the file fingerprint deviates significantly from the historical baseline, we alert before processing starts. It has caught three layout changes before they caused silent data corruption.

One Thing I Wish Someone Had Told Me Earlier

Get the actual binary file from the mainframe team to test with, not a 'sample' that's been opened in Notepad and re-saved. Transcoding a file through a text editor mangles EBCDIC encoding and packed decimal fields in ways that produce test files that don't represent actual production data. Every time I've been given a 'cleaned up' sample for testing, I've hit unexpected behavior in production that the sample didn't expose. Insist on raw binary extracts from the source system for testing.

DELTA (taxonomy) Data Types

Opinions expressed by DZone contributors are their own.

Related

Trending