Wait, What Format Is That? A Cross—Domain Guide for Everyone
Tech leaders can’t afford to say, 'what format is that?' Decode data formats across domains, make sharper decisions, and lead without getting lost in translation.
Join the DZone community and get the full member experience.
Join For FreeAre you an Engineering or Technology Leader who is looking up “what’s that file format”, while sitting in a meeting where they are throwing jargon about file formats?
Are you an Architect who has switched domains only to discover that there is an entire jungle of file formats that you are unfamiliar with, and now need to integrate into the solution you are building?
Are you a Data Engineer who is jumping across projects and has to build data pipelines, but encounter an overwhelmingly large number of file types?
Are you a budding software Engineer who is trying to make sense of how to read a particular type of file, in your first job?
If you’ve answered yes to any of these, this article is for you.
Once upon a clock cycle, file types were simple – a word file with a .doc extension, an excel with a .xl extension and an occasional text file with a .txt extension. Then one fine kernel tick later, the floodgates opened and a flurry of file formats poured in. Domain specific, human readable, machine readable and now even AI readable file types and file extensions exist.
Knowing how each of these works, and how to work with them, is essential to making better sense of your data and turning it into something useful. One small glitch in the abstraction can derail the entire pipeline.
Every industry, every domain, has its own specifications, rules and regulations and evolving standards.
Well, to be fair, there are a gazillion file types out there. But in this article, I humbly try to explain a select few, the ones worth understanding across roles and domains.
Section 1: Generic Data Formats
Before we dive deep into domain specific formats, let us look at some universal ones.
Ok, so the generic formats, these form the backbone of almost every tech stack. Whether you are building APIs, microservices or running analytics models, you would encounter these. Some of them may even seem simple, but there needs to be thoughtful design consideration to prevent the overall program from affecting performance, readability, easy integrations etc.
CSV
Stands for Comma-Separated Values. You’ve seen this file many times, it's everywhere. You’ve most probably even opened it in Excel.
Format – Each row is a line, values are separated by commas. Can have an optional first row to depict headers.
Uses – reports, data exchange between systems.
Benefits
- Light weight
- Easy to read and write in most programming languages
- Works across platforms
Potential issues
- Column order is very important as it does not have a schema per se.
- Delimiter (,) can be present in the data – this needs to be explicitly handled by “delimiter escaping” the value by enclosing in quotes.
- Encoding inconsistencies – we’d have seen those weird characters in a CSV at least a few times. This occurs when there was a mismatch between how the file was saved to how it is being read. Eg. UTF-8 to Windows.
Architect’s tip – CSVs are great for easy integrations or one-time data exchanges. But if there is versioning or structure expected, it's best not to use CSV. Alternately, if you must, ensure to add metadata, validations as required.
Sl number, name, classication, weight per unit, color, ciities produced 1,Apple,Fruit,40,red,”Paris,New York, Mumbai”
JSON
Stands for JavScript Object Notation. This is very structured and is expected to be as human readable as machine readable. The format is most widely preferred by frontend to backend interactive modules in web based applications. It is also the preferred language for REST APIs, if you’ve seen the details of an API on an application like Postman for instance. Data is typically nested into key-value pairs.
Format – begins with the curly brace and everything within enclosed curly brace {} is one row. Repeats rows each enclosed in {}. It is supportive of nested structures within the row. It can have both strings and numbers.
Uses – config files, API payloads, NoSQL database files
Benefits
- Human and machine readable
- Readable by all programming languages
- Easy to parse in Python, JavaScript and more
Potential issues
- If there are many levels of nesting, becomes “not-so” human readable after all
- Schema is also not strictly enforced, which may pose challenges in some use cases.
- The more the number of rows, the slower and bulkier the file tends to get.
Architect’s tip – Very flexible, but the program will need to have parsing error handling built in. Use of OpenAPI or JSON Schema to validate the input parameters is highly recommended.
{
Sl_number: 1,
Name: apple
Classification: Fruit,
Weight_per_unit: 40,
Color: Red,
Cities_produced: {Paris, New York, Mumbai}
}
YAML
The name is quite unique and trendsetting to be honest! YAML stands for Yet Another Markup Language! It is primarily used in configurations. It is also human readable like JSON. It’s used in tools like Github actions, Kubernetes etc. for setting up configurations easily. One can say, it addressed some of the challenges that JSON poses and simplifies the format.
Format – Key-value pairs that use indentation-based structure. Although there is no curly brace or comma, the caveat is that it is sensitive to white spaces.
Uses – configuration settings in Kubernetes, CI/CD pipelines, terraform or Infrastructure as code language.
Benefits
- Most readable of all the formats mentioned above
- Comments are possible, making it super versatile.
- Suitable for configurations that need to be structured.
Potential issues
- White space sensitivity, indentation sensitivity can potentially break the whole thing.
- When there are too many nesting layers, can get extremely cumbersome to validate.
Architect’s tip – Ahem, it's more suited to humans than machines. Best avoided for data. Use for config and config like use cases. Would be a good idea to add linting to the IDE that you are using to ensure the white space trouble is avoided.
Fridge_routine_post_shopping:
Classification : “Decide if fruit or veg”
Shelf_selection:
Fruit: “Door shelf”
Vegetables: “Fresh Draw”
Cooking:
Fruit: “Not required”
A Ready Reference for Generic File Formats.
|
File Format |
Human Readable |
Machine Readable |
Use cases |
Beware of |
|
CSV |
Low |
High |
NoSQL data, cross platform share |
Separators in values |
|
JSON |
Medium |
High |
APIs |
File bulk with rows being added |
|
YAML |
High |
High |
Configs, IaC |
White spaces, indentations |
Section 2: Healthcare Formats
HL7
Picture this. You go to a hospital for an illness. They make you go through a set of tests. Maybe you happen to be an in-patient for a few days and there are some medications administered. Now maybe you didn’t quite like the treatment there and want to go to some other hospital or clinic. You go else where only to find that they want to repeat all the tests all over again. And you are thinking, “if only there was a way to transfer all those records here somehow”. You’d think just giving those documents might suffice, but this new hospital maintains their records in a different way, and the data is not compatible with your format, unless someone spends considerable amount of time on it. Well, that’s where HL7 steps in to do exactly that.
So, in a sentence, HL7 is a messaging standard that makes transfer of medical records across departments or institutions seamless. Of course, it has incremental versions with each improvement and that is what the v(n) would indicate.
Format
V2 – The most commonly used version. This is a delimiter-based messaging format.
V3 – More xml based. But v2 is the more preferred one.
FHIR – popularly pronounced as “fire format” – is a modern version of HL7. Uses RESTful, JSON or XML based structure and is most suited to web applications, APIs etc.
Potential issues
- Difficult to debug
- There can be vendor-specific implementations which leads to inconsistencies.
- Validations and schemas are not rigid.
Architect’s tip – Don’t read raw HL7 messages into the core system, use it in the integration layer instead. Try to build in validations at various stages.
Section 3: Finance Formats
The Banking and finance industry involves digital transactions, where money is transferred from one account to another. But there would be only chaos if each bank had its own format for the files that process and acknowledge these transfers. That is where standardized messaging formats play a significant role.
The formats define details like sender, receiver, amount, currency and purpose of the transaction. This gets packaged, sent and then interpreted. Standardization ensures systems across the world are able to communicate, making it seamless for users to transact irrespective of where they are. The industry itself being regulated ensures that this safeguards the interest of the user, while maintaining jurisdictions, time zones and currencies.
SWIFT MT
SWIFT stands for Society for Worldwide Interbank Financial Telecommunication and MT stands for Message Type. This is a structured, text-based set of formats used in interbank communication and money transfers. It has been around since the late 1970s.
Format
A tag defines the field in this kind of file, but it is slightly more complicated than a regular XML tag for instance. For instance, :50K: – this is the tag associated with the name and address of the sender!
A message code is in the format of MTnxx, where nxx is a 3-digit number. MT of course stands for Message Type. The n depicts a category with value from 0-9 and the 2 digit xx represents a specific message type.
MT Categories
|
Category |
Description |
|
0 |
Financial Institution Transfers |
|
1 |
Customer Payments and Cheques |
|
2 |
Financial Institution Transfers |
|
3 |
Treasury Markets (Forex, Derivatives) |
|
4 |
Collection and Cash Letters |
|
5 |
Securities Markets |
|
6 |
Precious Metals and Syndications |
|
7 |
Documentary Credits and Guarantees |
|
8 |
Travellers Cheques |
|
9 |
Cash Management and Customer Reporting |
Common Tags
|
Tag |
Field Name |
Description |
|
:20: |
Transaction Reference Number |
Unique ID for the transaction |
|
:23B: |
Bank Operation Code |
Type of operation (e.g., CRED for credit transfer) |
|
:32A: |
Value Date / Currency / Amount |
Date, currency code, and transaction amount (e.g., 240731EUR1000,00) |
|
:33B: |
Currency / Original Ordered Amount |
Used when currency of ordering differs from transfer |
|
:50A: |
Ordering Customer (Account) |
Sender's account number and bank identifier (uses BIC) |
|
:50K: |
Ordering Customer |
Name and address of the sender (used in place of 50A if BIC is not used) |
|
:52A: |
Ordering Institution |
Institution that initiated the transfer |
|
:53A: |
Sender's Correspondent |
Sender’s correspondent bank (if any) |
|
:54A: |
Receiver's Correspondent |
Receiver’s correspondent bank (if any) |
|
:56A: |
Intermediary Institution |
Another intermediary bank used |
|
:57A: |
Account With Institution |
Beneficiary's bank |
|
:59: |
Beneficiary Customer |
Account number and name of recipient |
|
:70: |
Remittance Information |
Purpose or reason for the payment |
|
:71A: |
Details of Charges |
Who pays charges (OUR, SHA, BEN) |
|
:72: |
Sender to Receiver Information |
Free-text field for additional info |
Popular Messages
|
MT Code |
Name |
Purpose |
|
MT103 |
Single Customer Credit Transfer |
Standard customer wire transfer (person-to-person or B2B) |
|
MT202 |
General Financial Institution Transfer |
Bank-to-bank payments (no customer involved) |
|
MT202 COV |
Cover Payment |
Supplement to MT103, carries the funds between banks |
|
MT101 |
Request for Transfer |
Customer initiates a transfer from their account via a bank |
|
MT940 |
Customer Statement Message |
End-of-day bank statement |
|
MT950 |
Statement Message |
Statement of bank’s own accounts |
|
MT199 |
Free Format Message |
Used for general correspondence or additional information |
|
MT999 |
Proprietary Message |
Bank-specific custom message (informal use) |
Potential Issues
- Very rigid structure, need to know tags, categories etc. as shown in the tables above. This makes it difficult to extend with custom data or even meta data.
- There is no clarity for the codes on their own. For eg. There is no way to know :59: means “Beneficiary Customer”, unless one has a lookup reference!
- There may be a custom interpretation from bank to bank, which can potentially defeat the purpose. There may also be a version change across institutions.
- Difficult to parse programmatically.
Architect’s View
When designing a solution, always build in a normalization layer or a rule engine. This would come into use for validations at field-level. Decouple the format and business logic. With the parsing locl, be flexible for dynamic schema changes. Also having a built-in correlation logic into the orchestration engine is a more robust solution. Remember to build solid tests and validations and of course, the security layer.
ISO 20022
This is a global standard for financial messaging introduced to address the limitations of the MT format and similar older formats. This format uses the XML based information making messages more machine readable, well-structured and eases the extensibility factor significantly. The format supports richer data, more detailed transaction information and all of this leading to better analytics.
Format
XML tags define the data clearly. Eg. <CreDtTM> is for credit date time!
Message Types
|
Message Type |
Description |
Business Area |
|
acmt.001 |
Account Opening |
Cards & ATM |
|
caaa.001 |
Authorisation Request |
Cards & ATM |
|
catm.001 |
ATM Initialization |
Cards & ATM |
|
camt.029 |
Resolution of Investigation |
Cash Management |
|
camt.052 |
Bank to Customer Account Report |
Cash Management |
|
camt.053 |
Bank to Customer Statement |
Cash Management |
|
camt.054 |
Debit/Credit Notification |
Cash Management |
|
camt.060 |
Account Reporting Request |
Cash Management |
|
acmt.023 |
KYC Update |
Compliance |
|
auth.001 |
Regulatory Reporting |
Compliance |
|
fxtr.001 |
FX Trade Instruction |
FX & Derivatives |
|
trea.001 |
Treasury Trade Capture |
FX & Derivatives |
|
pacs.002 |
FI to FI Payment Status |
Payments |
|
pacs.004 |
Payment Return |
Payments |
|
pacs.008 |
FI to FI Customer Credit Transfer |
Payments |
|
pacs.009 |
FI Credit Transfer |
Payments |
|
pain.001 |
Customer Credit Transfer Initiation |
Payments |
|
pain.002 |
Payment Status Report |
Payments |
|
pain.008 |
Customer Direct Debit Initiation |
Payments |
|
seev.001 |
Meeting Notification |
Securities |
|
semt.002 |
Securities Balance Accounting Report |
Securities |
|
sese.023 |
Securities Settlement Instruction |
Securities |
|
setr.001 |
Subscription Order |
Securities |
|
tsmt.001 |
Baseline Creation |
Trade Services |
|
tsmt.025 |
Status Advice |
Trade Services |
Benefits
- Richer data
- Better interoperability (especially cross-border transactions)
- Easy to extend with support for custom fields that may be local or industry specific.
- Auditability, traceability and compliance compatible. Especially useful in KYC (Know Your Customer) and AML (Anti Money Laundering) flags.
Potential Issues
- XML schema can get very complex over time.
- Steep Learning Curve
- May not work with legacy systems.
- Constantly changing versions.
Architect’s View
It is recommended to use a rule engine to keep business rules flexible. Use of JSON can help with normalization of the schema complexity. Best option would be to abstract the parsing logic itself and splitting them into modules. This is especially useful when optional fields, newer tags etc. come into play. Keep auditability as a fundamental part of your design.
Wrapping It All Up
File formats are not just plain old syntax. They are the entire domain’s logic, expectations and regulation all bundled together. This, whether you are working with plain old CSV or the modern ISO 20022. Whether you are mapping HL7 in healthcare or decoding MT tags in finance. The understanding of the format itself elevates you from being just a coder or architect. It makes you the “speaker” of the language of the domain.
Remember, one doesn’t need to master all the file types out there. But learn to read a format, understand the “why” behind it before you start working with it, and that is all there is to it.
This article may not have covered every complex jungle out there, but hopefully, I’ve pointed you in the right direction, to explore further.
Opinions expressed by DZone contributors are their own.
Comments