DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • Stop Poisoning Your Models: How I Built a CV Dataset Quality Toolkit I Can Reuse Forever
  • Advanced Auto Loader Patterns for Large-Scale JSON and Semi-Structured Data
  • From 13,000 to 20,000+ Endpoints: Architecting Forensics for the Remote Workforce
  • Architecting Scalable JSON Pipelines: The Power of a Single PySpark Schema

Trending

  • Mocking Kafka for Local Spring Development
  • Introduction to Tactical DDD With Java: Steps to Build Semantic Code
  • Ingesting Fixed-Width Mainframe Files Into Delta Lake: The Details Nobody Writes Down
  • Rethinking Java CRUDs With Event Sourcing and CQRS Patterns
  1. DZone
  2. Data Engineering
  3. Data
  4. Wait, What Format Is That? A Cross—Domain Guide for Everyone

Wait, What Format Is That? A Cross—Domain Guide for Everyone

Tech leaders can’t afford to say, 'what format is that?' Decode data formats across domains, make sharper decisions, and lead without getting lost in translation.

By 
Syamanthaka B user avatar
Syamanthaka B
·
Aug. 27, 25 · Tutorial
Likes (2)
Comment
Save
Tweet
Share
1.9K Views

Join the DZone community and get the full member experience.

Join For Free

Are you an Engineering or Technology Leader who is looking up “what’s that file format”, while sitting in a meeting where they are throwing jargon about file formats?

Are you an Architect who has switched domains only to discover that there is an entire jungle of file formats that you are unfamiliar with, and now need to integrate into the solution you are building?

Are you a Data Engineer who is jumping across projects and has to build data pipelines, but encounter an overwhelmingly large number of file types?

Are you a budding software Engineer who is trying to make sense of how to read a particular type of file, in your first job?

If you’ve answered yes to any of these, this article is for you. 

Once upon a clock cycle, file types were simple – a word file with a .doc extension, an excel with a .xl extension and an occasional text file with a .txt extension. Then one fine kernel tick later, the floodgates opened and a flurry of file formats poured in. Domain specific, human readable, machine readable and now even AI readable file types and file extensions exist. 

Knowing how each of these works, and how to work with them, is essential to making better sense of your data and turning it into something useful. One small glitch in the abstraction can derail the entire pipeline. 

Every industry, every domain, has its own specifications, rules and regulations and evolving standards. 

Well, to be fair, there are a gazillion file types out there. But in this article, I humbly try to explain a select few, the ones worth understanding across roles and domains. 

Section 1: Generic Data Formats

Before we dive deep into domain specific formats, let us look at some universal ones. 

Ok, so the generic formats, these form the backbone of almost every tech stack. Whether you are building APIs, microservices or running analytics models, you would encounter these. Some of them may even seem simple, but there needs to be thoughtful design consideration to prevent the overall program from affecting performance, readability, easy integrations etc.

CSV

Stands for Comma-Separated Values. You’ve seen this file many times, it's everywhere. You’ve most probably even opened it in Excel.

Format – Each row is a line, values are separated by commas. Can have an optional first row to depict headers.

Uses – reports, data exchange between systems.

Benefits 

  • Light weight
  • Easy to read and write in most programming languages
  • Works across platforms

Potential issues  

  • Column order is very important as it does not have a schema per se.
  • Delimiter (,) can be present in the data – this needs to be explicitly handled by “delimiter escaping” the value by enclosing in quotes.
  • Encoding inconsistencies – we’d have seen those weird characters in a CSV at least a few times. This occurs when there was a mismatch between how the file was saved to how it is being read. Eg. UTF-8 to Windows. 

Architect’s tip – CSVs are great for easy integrations or one-time data exchanges. But if there is versioning or structure expected, it's best not to use CSV. Alternately, if you must, ensure to add metadata, validations as required. 

Sl number, name, classication, weight per unit, color, ciities produced 1,Apple,Fruit,40,red,”Paris,New York, Mumbai”

JSON

Stands for JavScript Object Notation. This is very structured and is expected to be as human readable as machine readable. The format is most widely preferred by frontend to backend interactive modules in web based applications. It is also the preferred language for REST APIs, if you’ve seen the details of an API on an application like Postman for instance. Data is typically nested into key-value pairs. 

Format – begins with the curly brace and everything within enclosed curly brace {} is one row. Repeats rows each enclosed in {}. It is supportive of nested structures within the row. It can have both strings and numbers. 

Uses – config files, API payloads, NoSQL database files

Benefits  

  • Human and machine readable
  • Readable by all programming languages
  • Easy to parse in Python, JavaScript and more

Potential issues 

  • If there are many levels of nesting, becomes “not-so” human readable after all
  • Schema is also not strictly enforced, which may pose challenges in some use cases.
  • The more the number of rows, the slower and bulkier the file tends to get.

Architect’s tip – Very flexible, but the program will need to have parsing error handling built in. Use of OpenAPI or JSON Schema to validate the input parameters is highly recommended. 

JSON
 
{

    Sl_number: 1,

    Name: apple

    Classification: Fruit,

   Weight_per_unit: 40,

   Color: Red,

   Cities_produced: {Paris, New York, Mumbai}

}


YAML 

The name is quite unique and trendsetting to be honest! YAML stands for Yet Another Markup Language! It is primarily used in configurations. It is also human readable like JSON. It’s used in tools like Github actions, Kubernetes etc. for setting up configurations easily. One can say, it addressed some of the challenges that JSON poses and simplifies the format. 

Format – Key-value pairs that use indentation-based structure. Although there is no curly brace or comma, the caveat is that it is sensitive to white spaces. 

Uses – configuration settings in Kubernetes, CI/CD pipelines, terraform or Infrastructure as code language.

Benefits 

  • Most readable of all the formats mentioned above
  • Comments are possible, making it super versatile.
  • Suitable for configurations that need to be structured. 

Potential issues 

  • White space sensitivity, indentation sensitivity can potentially break the whole thing.
  • When there are too many nesting layers, can get extremely cumbersome to validate.

Architect’s tip – Ahem, it's more suited to humans than machines. Best avoided for data. Use for config and config like use cases. Would be a good idea to add linting to the IDE that you are using to ensure the white space trouble is avoided. 

YAML
 
Fridge_routine_post_shopping:

               Classification : “Decide if fruit or veg”

               Shelf_selection: 

Fruit: “Door shelf”

Vegetables: “Fresh Draw”

               Cooking:

                              Fruit: “Not required”


A Ready Reference for Generic File Formats.                         

File Format

Human Readable

Machine Readable

Use cases

Beware of

CSV

Low

High

NoSQL data, cross platform share

Separators in values

JSON

Medium

High

APIs

File bulk with rows being added

YAML

High

High

Configs, IaC

White spaces, indentations

Section 2: Healthcare Formats

HL7 

Picture this. You go to a hospital for an illness. They make you go through a set of tests. Maybe you happen to be an in-patient for a few days and there are some medications administered. Now maybe you didn’t quite like the treatment there and want to go to some other hospital or clinic. You go else where only to find that they want to repeat all the tests all over again. And you are thinking, “if only there was a way to transfer all those records here somehow”. You’d think just giving those documents might suffice, but this new hospital maintains their records in a different way, and the data is not compatible with your format, unless someone spends considerable amount of time on it. Well, that’s where HL7 steps in to do exactly that. 

So, in a sentence, HL7 is a messaging standard that makes transfer of medical records across departments or institutions seamless. Of course, it has incremental versions with each improvement and that is what the v(n) would indicate. 

Format  

V2 – The most commonly used version. This is a delimiter-based messaging format.

V3 – More xml based. But v2 is the more preferred one.

FHIR – popularly pronounced as “fire format” – is a modern version of HL7. Uses RESTful, JSON or XML based structure and is most suited to web applications, APIs etc.

Potential issues  

  • Difficult to debug
  • There can be vendor-specific implementations which leads to inconsistencies.
  • Validations and schemas are not rigid.

Architect’s tip – Don’t read raw HL7 messages into the core system, use it in the integration layer instead. Try to build in validations at various stages.

Section 3: Finance Formats

The Banking and finance industry involves digital transactions, where money is transferred from one account to another. But there would be only chaos if each bank had its own format for the files that process and acknowledge these transfers. That is where standardized messaging formats play a significant role.

The formats define details like sender, receiver, amount, currency and purpose of the transaction. This gets packaged, sent and then interpreted. Standardization ensures systems across the world are able to communicate, making it seamless for users to transact irrespective of where they are. The industry itself being regulated ensures that this safeguards the interest of the user, while maintaining jurisdictions, time zones and currencies. 

SWIFT MT

SWIFT stands for Society for Worldwide Interbank Financial Telecommunication and MT stands for Message Type. This is a structured, text-based set of formats used in interbank communication and money transfers. It has been around since the late 1970s. 

Format

A tag defines the field in this kind of file, but it is slightly more complicated than a regular XML tag for instance. For instance, :50K: – this is the tag associated with the name and address of the sender! 

A message code is in the format of MTnxx, where nxx is a 3-digit number. MT of course stands for Message Type. The n depicts a category with value from 0-9 and the 2 digit xx represents a specific message type. 

MT Categories

Category

Description

0

Financial Institution Transfers

1

Customer Payments and Cheques

2

Financial Institution Transfers

3

Treasury Markets (Forex, Derivatives)

4

Collection and Cash Letters

5

Securities Markets

6

Precious Metals and Syndications

7

Documentary Credits and Guarantees

8

Travellers Cheques

9

Cash Management and Customer Reporting


Common Tags

Tag

Field Name

Description

:20:

Transaction Reference Number

Unique ID for the transaction

:23B:

Bank Operation Code

Type of operation (e.g., CRED for credit transfer)

:32A:

Value Date / Currency / Amount

Date, currency code, and transaction amount (e.g., 240731EUR1000,00)

:33B:

Currency / Original Ordered Amount

Used when currency of ordering differs from transfer

:50A:

Ordering Customer (Account)

Sender's account number and bank identifier (uses BIC)

:50K:

Ordering Customer

Name and address of the sender (used in place of 50A if BIC is not used)

:52A:

Ordering Institution

Institution that initiated the transfer

:53A:

Sender's Correspondent

Sender’s correspondent bank (if any)

:54A:

Receiver's Correspondent

Receiver’s correspondent bank (if any)

:56A:

Intermediary Institution

Another intermediary bank used

:57A:

Account With Institution

Beneficiary's bank

:59:

Beneficiary Customer

Account number and name of recipient

:70:

Remittance Information

Purpose or reason for the payment

:71A:

Details of Charges

Who pays charges (OUR, SHA, BEN)

:72:

Sender to Receiver Information

Free-text field for additional info


Popular Messages

MT Code

Name

Purpose

MT103

Single Customer Credit Transfer

Standard customer wire transfer (person-to-person or B2B)

MT202

General Financial Institution Transfer

Bank-to-bank payments (no customer involved)

MT202 COV

Cover Payment

Supplement to MT103, carries the funds between banks

MT101

Request for Transfer

Customer initiates a transfer from their account via a bank

MT940

Customer Statement Message

End-of-day bank statement

MT950

Statement Message

Statement of bank’s own accounts

MT199

Free Format Message

Used for general correspondence or additional information

MT999

Proprietary Message

Bank-specific custom message (informal use)


Potential Issues

  • Very rigid structure, need to know tags, categories etc. as shown in the tables above. This makes it difficult to extend with custom data or even meta data. 
  • There is no clarity for the codes on their own. For eg. There is no way to know :59: means “Beneficiary Customer”, unless one has a lookup reference!
  • There may be a custom interpretation from bank to bank, which can potentially defeat the purpose. There may also be a version change across institutions.
  • Difficult to parse programmatically. 

Architect’s View

When designing a solution, always build in a normalization layer or a rule engine. This would come into use for validations at field-level. Decouple the format and business logic. With the parsing locl, be flexible for dynamic schema changes. Also having a built-in correlation logic into the orchestration engine is a more robust solution. Remember to build solid tests and validations and of course, the security layer.

ISO 20022

This is a global standard for financial messaging introduced to address the limitations of the MT format and similar older formats. This format uses the XML based information making messages more machine readable, well-structured and eases the extensibility factor significantly. The format supports richer data, more detailed transaction information and all of this leading to better analytics.

Format

XML tags define the data clearly. Eg. <CreDtTM> is for credit date time!

Message Types

Message Type

Description

Business Area

acmt.001

Account Opening

Cards & ATM

caaa.001

Authorisation Request

Cards & ATM

catm.001

ATM Initialization

Cards & ATM

camt.029

Resolution of Investigation

Cash Management

camt.052

Bank to Customer Account Report

Cash Management

camt.053

Bank to Customer Statement

Cash Management

camt.054

Debit/Credit Notification

Cash Management

camt.060

Account Reporting Request

Cash Management

acmt.023

KYC Update

Compliance

auth.001

Regulatory Reporting

Compliance

fxtr.001

FX Trade Instruction

FX & Derivatives

trea.001

Treasury Trade Capture

FX & Derivatives

pacs.002

FI to FI Payment Status

Payments

pacs.004

Payment Return

Payments

pacs.008

FI to FI Customer Credit Transfer

Payments

pacs.009

FI Credit Transfer

Payments

pain.001

Customer Credit Transfer Initiation

Payments

pain.002

Payment Status Report

Payments

pain.008

Customer Direct Debit Initiation

Payments

seev.001

Meeting Notification

Securities

semt.002

Securities Balance Accounting Report

Securities

sese.023

Securities Settlement Instruction

Securities

setr.001

Subscription Order

Securities

tsmt.001

Baseline Creation

Trade Services

tsmt.025

Status Advice

Trade Services


Benefits

  • Richer data
  • Better interoperability (especially cross-border transactions)
  • Easy to extend with support for custom fields that may be local or industry specific.
  • Auditability, traceability and compliance compatible. Especially useful in KYC (Know Your Customer) and AML (Anti Money Laundering) flags. 

Potential Issues

  • XML schema can get very complex over time.
  • Steep Learning Curve
  • May not work with legacy systems.
  • Constantly changing versions.

Architect’s View

It is recommended to use a rule engine to keep business rules flexible. Use of JSON can help with normalization of the schema complexity. Best option would be to abstract the parsing logic itself and splitting them into modules. This is especially useful when optional fields, newer tags etc. come into play. Keep auditability as a fundamental part of your design. 

Wrapping It All Up

File formats are not just plain old syntax. They are the entire domain’s logic, expectations and regulation all bundled together. This, whether you are working with plain old CSV or the modern ISO 20022. Whether you are mapping HL7 in healthcare or decoding MT tags in finance. The understanding of the format itself elevates you from being just a coder or architect. It makes you the “speaker” of the language of the domain. 

Remember, one doesn’t need to master all the file types out there. But learn to read a format, understand the “why” behind it before you start working with it, and that is all there is to it. 

This article may not have covered every complex jungle out there, but hopefully, I’ve pointed you in the right direction, to explore further. 

JSON Data (computing) Transfer (computing) CSV

Opinions expressed by DZone contributors are their own.

Related

  • Stop Poisoning Your Models: How I Built a CV Dataset Quality Toolkit I Can Reuse Forever
  • Advanced Auto Loader Patterns for Large-Scale JSON and Semi-Structured Data
  • From 13,000 to 20,000+ Endpoints: Architecting Forensics for the Remote Workforce
  • Architecting Scalable JSON Pipelines: The Power of a Single PySpark Schema

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook