Wait, What Format Is That? A Cross—Domain Guide for Everyone

Tech leaders can’t afford to say, 'what format is that?' Decode data formats across domains, make sharper decisions, and lead without getting lost in translation.

Syamanthaka B

Aug. 27, 25 · Tutorial

Likes (2)

Comment

Save

2.0K Views

Are you an Engineering or Technology Leader who is looking up “what’s that file format”, while sitting in a meeting where they are throwing jargon about file formats?

Are you an Architect who has switched domains only to discover that there is an entire jungle of file formats that you are unfamiliar with, and now need to integrate into the solution you are building?

Are you a Data Engineer who is jumping across projects and has to build data pipelines, but encounter an overwhelmingly large number of file types?

Are you a budding software Engineer who is trying to make sense of how to read a particular type of file, in your first job?

If you’ve answered yes to any of these, this article is for you.

Once upon a clock cycle, file types were simple – a word file with a .doc extension, an excel with a .xl extension and an occasional text file with a .txt extension. Then one fine kernel tick later, the floodgates opened and a flurry of file formats poured in. Domain specific, human readable, machine readable and now even AI readable file types and file extensions exist.

Knowing how each of these works, and how to work with them, is essential to making better sense of your data and turning it into something useful. One small glitch in the abstraction can derail the entire pipeline.

Every industry, every domain, has its own specifications, rules and regulations and evolving standards.

Well, to be fair, there are a gazillion file types out there. But in this article, I humbly try to explain a select few, the ones worth understanding across roles and domains.

Section 1: Generic Data Formats

Before we dive deep into domain specific formats, let us look at some universal ones.

Ok, so the generic formats, these form the backbone of almost every tech stack. Whether you are building APIs, microservices or running analytics models, you would encounter these. Some of them may even seem simple, but there needs to be thoughtful design consideration to prevent the overall program from affecting performance, readability, easy integrations etc.

CSV

Stands for Comma-Separated Values. You’ve seen this file many times, it's everywhere. You’ve most probably even opened it in Excel.

Format – Each row is a line, values are separated by commas. Can have an optional first row to depict headers.

Uses – reports, data exchange between systems.

Benefits

Light weight
Easy to read and write in most programming languages
Works across platforms

Potential issues

Column order is very important as it does not have a schema per se.
Delimiter (,) can be present in the data – this needs to be explicitly handled by “delimiter escaping” the value by enclosing in quotes.
Encoding inconsistencies – we’d have seen those weird characters in a CSV at least a few times. This occurs when there was a mismatch between how the file was saved to how it is being read. Eg. UTF-8 to Windows.

Architect’s tip – CSVs are great for easy integrations or one-time data exchanges. But if there is versioning or structure expected, it's best not to use CSV. Alternately, if you must, ensure to add metadata, validations as required.

Sl number, name, classication, weight per unit, color, ciities produced 1,Apple,Fruit,40,red,”Paris,New York, Mumbai”

JSON

Stands for JavScript Object Notation. This is very structured and is expected to be as human readable as machine readable. The format is most widely preferred by frontend to backend interactive modules in web based applications. It is also the preferred language for REST APIs, if you’ve seen the details of an API on an application like Postman for instance. Data is typically nested into key-value pairs.

Format – begins with the curly brace and everything within enclosed curly brace {} is one row. Repeats rows each enclosed in {}. It is supportive of nested structures within the row. It can have both strings and numbers.

Uses – config files, API payloads, NoSQL database files

Benefits

Human and machine readable
Readable by all programming languages
Easy to parse in Python, JavaScript and more

Potential issues

If there are many levels of nesting, becomes “not-so” human readable after all
Schema is also not strictly enforced, which may pose challenges in some use cases.
The more the number of rows, the slower and bulkier the file tends to get.

Architect’s tip – Very flexible, but the program will need to have parsing error handling built in. Use of OpenAPI or JSON Schema to validate the input parameters is highly recommended.

    JSON
   
   {

    Sl_number: 1,

    Name: apple

    Classification: Fruit,

   Weight_per_unit: 40,

   Color: Red,

   Cities_produced: {Paris, New York, Mumbai}

}

YAML

The name is quite unique and trendsetting to be honest! YAML stands for Yet Another Markup Language! It is primarily used in configurations. It is also human readable like JSON. It’s used in tools like Github actions, Kubernetes etc. for setting up configurations easily. One can say, it addressed some of the challenges that JSON poses and simplifies the format.

Format – Key-value pairs that use indentation-based structure. Although there is no curly brace or comma, the caveat is that it is sensitive to white spaces.

Uses – configuration settings in Kubernetes, CI/CD pipelines, terraform or Infrastructure as code language.

Benefits

Most readable of all the formats mentioned above
Comments are possible, making it super versatile.
Suitable for configurations that need to be structured.

Potential issues

White space sensitivity, indentation sensitivity can potentially break the whole thing.
When there are too many nesting layers, can get extremely cumbersome to validate.

Architect’s tip – Ahem, it's more suited to humans than machines. Best avoided for data. Use for config and config like use cases. Would be a good idea to add linting to the IDE that you are using to ensure the white space trouble is avoided.

    YAML
   
   Fridge_routine_post_shopping:

               Classification : “Decide if fruit or veg”

               Shelf_selection: 

Fruit: “Door shelf”

Vegetables: “Fresh Draw”

               Cooking:

                              Fruit: “Not required”

A Ready Reference for Generic File Formats.

File Format	Human Readable	Machine Readable	Use cases	Beware of
CSV	Low	High	NoSQL data, cross platform share	Separators in values
JSON	Medium	High	APIs	File bulk with rows being added
YAML	High	High	Configs, IaC	White spaces, indentations

Section 2: Healthcare Formats

HL7

Picture this. You go to a hospital for an illness. They make you go through a set of tests. Maybe you happen to be an in-patient for a few days and there are some medications administered. Now maybe you didn’t quite like the treatment there and want to go to some other hospital or clinic. You go else where only to find that they want to repeat all the tests all over again. And you are thinking, “if only there was a way to transfer all those records here somehow”. You’d think just giving those documents might suffice, but this new hospital maintains their records in a different way, and the data is not compatible with your format, unless someone spends considerable amount of time on it. Well, that’s where HL7 steps in to do exactly that.

So, in a sentence, HL7 is a messaging standard that makes transfer of medical records across departments or institutions seamless. Of course, it has incremental versions with each improvement and that is what the v(n) would indicate.

Format

V2 – The most commonly used version. This is a delimiter-based messaging format.

V3 – More xml based. But v2 is the more preferred one.

FHIR – popularly pronounced as “fire format” – is a modern version of HL7. Uses RESTful, JSON or XML based structure and is most suited to web applications, APIs etc.

Potential issues

Difficult to debug
There can be vendor-specific implementations which leads to inconsistencies.
Validations and schemas are not rigid.

Architect’s tip – Don’t read raw HL7 messages into the core system, use it in the integration layer instead. Try to build in validations at various stages.

Section 3: Finance Formats

The Banking and finance industry involves digital transactions, where money is transferred from one account to another. But there would be only chaos if each bank had its own format for the files that process and acknowledge these transfers. That is where standardized messaging formats play a significant role.

The formats define details like sender, receiver, amount, currency and purpose of the transaction. This gets packaged, sent and then interpreted. Standardization ensures systems across the world are able to communicate, making it seamless for users to transact irrespective of where they are. The industry itself being regulated ensures that this safeguards the interest of the user, while maintaining jurisdictions, time zones and currencies.

SWIFT MT

SWIFT stands for Society for Worldwide Interbank Financial Telecommunication and MT stands for Message Type. This is a structured, text-based set of formats used in interbank communication and money transfers. It has been around since the late 1970s.

Format

A tag defines the field in this kind of file, but it is slightly more complicated than a regular XML tag for instance. For instance, :50K: – this is the tag associated with the name and address of the sender!

A message code is in the format of MTnxx, where nxx is a 3-digit number. MT of course stands for Message Type. The n depicts a category with value from 0-9 and the 2 digit xx represents a specific message type.

MT Categories

Category	Description
0	Financial Institution Transfers
1	Customer Payments and Cheques
2	Financial Institution Transfers
3	Treasury Markets (Forex, Derivatives)
4	Collection and Cash Letters
5	Securities Markets
6	Precious Metals and Syndications
7	Documentary Credits and Guarantees
8	Travellers Cheques
9	Cash Management and Customer Reporting

Common Tags

Tag	Field Name	Description
:20:	Transaction Reference Number	Unique ID for the transaction
:23B:	Bank Operation Code	Type of operation (e.g., CRED for credit transfer)
:32A:	Value Date / Currency / Amount	Date, currency code, and transaction amount (e.g., 240731EUR1000,00)
:33B:	Currency / Original Ordered Amount	Used when currency of ordering differs from transfer
:50A:	Ordering Customer (Account)	Sender's account number and bank identifier (uses BIC)
:50K:	Ordering Customer	Name and address of the sender (used in place of 50A if BIC is not used)
:52A:	Ordering Institution	Institution that initiated the transfer
:53A:	Sender's Correspondent	Sender’s correspondent bank (if any)
:54A:	Receiver's Correspondent	Receiver’s correspondent bank (if any)
:56A:	Intermediary Institution	Another intermediary bank used
:57A:	Account With Institution	Beneficiary's bank
:59:	Beneficiary Customer	Account number and name of recipient
:70:	Remittance Information	Purpose or reason for the payment
:71A:	Details of Charges	Who pays charges (OUR, SHA, BEN)
:72:	Sender to Receiver Information	Free-text field for additional info

Popular Messages

MT Code	Name	Purpose
MT103	Single Customer Credit Transfer	Standard customer wire transfer (person-to-person or B2B)
MT202	General Financial Institution Transfer	Bank-to-bank payments (no customer involved)
MT202 COV	Cover Payment	Supplement to MT103, carries the funds between banks
MT101	Request for Transfer	Customer initiates a transfer from their account via a bank
MT940	Customer Statement Message	End-of-day bank statement
MT950	Statement Message	Statement of bank’s own accounts
MT199	Free Format Message	Used for general correspondence or additional information
MT999	Proprietary Message	Bank-specific custom message (informal use)

Potential Issues

Very rigid structure, need to know tags, categories etc. as shown in the tables above. This makes it difficult to extend with custom data or even meta data.
There is no clarity for the codes on their own. For eg. There is no way to know :59: means “Beneficiary Customer”, unless one has a lookup reference!
There may be a custom interpretation from bank to bank, which can potentially defeat the purpose. There may also be a version change across institutions.
Difficult to parse programmatically.

Architect’s View

When designing a solution, always build in a normalization layer or a rule engine. This would come into use for validations at field-level. Decouple the format and business logic. With the parsing locl, be flexible for dynamic schema changes. Also having a built-in correlation logic into the orchestration engine is a more robust solution. Remember to build solid tests and validations and of course, the security layer.

ISO 20022

This is a global standard for financial messaging introduced to address the limitations of the MT format and similar older formats. This format uses the XML based information making messages more machine readable, well-structured and eases the extensibility factor significantly. The format supports richer data, more detailed transaction information and all of this leading to better analytics.

Format

XML tags define the data clearly. Eg. <CreDtTM> is for credit date time!

Message Types

Message Type	Description	Business Area
acmt.001	Account Opening	Cards & ATM
caaa.001	Authorisation Request	Cards & ATM
catm.001	ATM Initialization	Cards & ATM
camt.029	Resolution of Investigation	Cash Management
camt.052	Bank to Customer Account Report	Cash Management
camt.053	Bank to Customer Statement	Cash Management
camt.054	Debit/Credit Notification	Cash Management
camt.060	Account Reporting Request	Cash Management
acmt.023	KYC Update	Compliance
auth.001	Regulatory Reporting	Compliance
fxtr.001	FX Trade Instruction	FX & Derivatives
trea.001	Treasury Trade Capture	FX & Derivatives
pacs.002	FI to FI Payment Status	Payments
pacs.004	Payment Return	Payments
pacs.008	FI to FI Customer Credit Transfer	Payments
pacs.009	FI Credit Transfer	Payments
pain.001	Customer Credit Transfer Initiation	Payments
pain.002	Payment Status Report	Payments
pain.008	Customer Direct Debit Initiation	Payments
seev.001	Meeting Notification	Securities
semt.002	Securities Balance Accounting Report	Securities
sese.023	Securities Settlement Instruction	Securities
setr.001	Subscription Order	Securities
tsmt.001	Baseline Creation	Trade Services
tsmt.025	Status Advice	Trade Services

Benefits

Richer data
Better interoperability (especially cross-border transactions)
Easy to extend with support for custom fields that may be local or industry specific.
Auditability, traceability and compliance compatible. Especially useful in KYC (Know Your Customer) and AML (Anti Money Laundering) flags.

Potential Issues

XML schema can get very complex over time.
Steep Learning Curve
May not work with legacy systems.
Constantly changing versions.

Architect’s View

It is recommended to use a rule engine to keep business rules flexible. Use of JSON can help with normalization of the schema complexity. Best option would be to abstract the parsing logic itself and splitting them into modules. This is especially useful when optional fields, newer tags etc. come into play. Keep auditability as a fundamental part of your design.

Wrapping It All Up

File formats are not just plain old syntax. They are the entire domain’s logic, expectations and regulation all bundled together. This, whether you are working with plain old CSV or the modern ISO 20022. Whether you are mapping HL7 in healthcare or decoding MT tags in finance. The understanding of the format itself elevates you from being just a coder or architect. It makes you the “speaker” of the language of the domain.

Remember, one doesn’t need to master all the file types out there. But learn to read a format, understand the “why” behind it before you start working with it, and that is all there is to it.

This article may not have covered every complex jungle out there, but hopefully, I’ve pointed you in the right direction, to explore further.

JSON Data (computing) Transfer (computing) CSV

Opinions expressed by DZone contributors are their own.

Related

Trending

Wait, What Format Is That? A Cross—Domain Guide for Everyone

Tech leaders can’t afford to say, 'what format is that?' Decode data formats across domains, make sharper decisions, and lead without getting lost in translation.

Section 1: Generic Data Formats

CSV

JSON

YAML

A Ready Reference for Generic File Formats.

Section 2: Healthcare Formats

HL7

Section 3: Finance Formats

SWIFT MT

ISO 20022

Wrapping It All Up

Related

Partner Resources