Generating Schema-Valid Synthetic ISO 20022 Messages for Privacy-Preserving Fraud Detection
Leverage a schema-aware federated approach to generate synthetic ISO 20022 payments data with strict personal information privacy and XSD compliance.
Join the DZone community and get the full member experience.
Join For FreeModern fraud detection systems depend on machine learning models trained on large volumes of payment transaction data. The challenge is that real payment messages — especially ISO 20022 formats such as pacs.008 and pacs.009 — contain highly sensitive financial and customer information that cannot be freely shared across institutions.
This creates a structural limitation. Fraud patterns often emerge only when data is analyzed across multiple financial institutions, yet regulatory, privacy, and competitive constraints prevent raw transaction data from leaving institutional boundaries.
One practical solution is synthetic data generation: producing artificial payment messages that preserve statistical and behavioral characteristics of real transactions without exposing any sensitive or personally identifiable information. When designed correctly, synthetic ISO 20022 messages can be safely shared and used to train stronger fraud detection models.
This article presents a schema-aware, engineering-focused approach for generating synthetic pacs.008 and pacs.009 messages, inspired by architectural principles described in a recent patent on privacy-preserving, collaborative synthetic data generation. The focus here is not on payment business processes, but on how developers can implement such a system, validate it, and integrate it into machine learning workflows.
The Engineering Problem
ISO 20022 payment messages are not simple records. They are deeply nested XML documents governed by strict XSD schemas, mandatory and optional fields, and non-trivial inter-field constraints. Common challenges include:
- Deeply nested structures with repeating elements
- Mandatory validation against official XSDs
- Strong dependencies between fields (amounts, currencies, settlement methods)
- Temporal and sequencing behavior that matters for fraud detection
At the same time, fraud models require behavioral realism, not just syntactic correctness. A valuable synthetic dataset must preserve signals such as transaction velocity, amount clustering, routing patterns, and time-of-day effects.
High-Level Architecture
Each participating financial institution operates a local synthetic data generator trained on its internal transaction patterns. No raw transactions are shared externally. Instead, institutions generate schema-valid synthetic ISO 20022 messages that can be shared for analytics or model training.
At a high level:
- Real transaction data remains internal
- Local generators learn statistical distributions and correlations
- Synthetic
pacs.008andpacs.009messages are generated - Messages are validated against official XSDs
- Only synthetic outputs are shared
This approach enables collaboration while maintaining strict data privacy boundaries.
Implementation Details
1. Parsing ISO 20022 XSD Schemas
The foundation of any ISO 20022 generator is the official XSD schema. Hard-coding message templates is brittle and error-prone. Instead, the generator should derive structure directly from the schema.
from lxml import etree
def load_schema(xsd_path):
with open(xsd_path, 'rb') as f:
schema_root = etree.XML(f.read())
return etree.XMLSchema(schema_root)
pacs008_schema = load_schema("pacs.008.001.08.xsd")
pacs009_schema = load_schema("pacs.009.001.08.xsd")
This allows the generator to remain resilient to schema evolution and supports multiple message versions with minimal changes.
2. Constraint-Aware Field Generation
ISO 20022 messages contain interdependent fields. Generating fields independently leads to invalid or unrealistic payment messages. Examples of common business constraints include:
pacs.009omits customer elements present inpacs.008considering they are FI to FI payments- Settlement date should not precede the message creation time
- Settlement amount must be expressed for the settlement currency
- Certain charge bearer values depend on allowed enumerations
A robust generator explicitly enforces these business rules:
import random
def generate_settlement_amount(currency):
profiles = {
"USD": (2500, 900),
"EUR": (2100, 800)
}
mean, std = profiles.get(currency, (1500, 600))
value = max(0.01, random.gauss(mean, std))
return round(value, 2)
Using distributions rather than uniform randomness helps preserve fraud-relevant behavior such as amount clustering and high-value outliers.
3. Preserving Fraud-Relevant Signals
Synthetic data is only useful for fraud detection if it preserves behavioral patterns, not just schema validity. Important signals include:
- Transaction bursts and velocity
- Time-of-day and day-of-week effects
- Repeated routing paths
- Consistent rounding behavior
For example, timestamps should be generated using correlated windows rather than random values.
from datetime import datetime, timedelta
def generate_timestamp(base_time, jitter_minutes=15):
offset = random.randint(-jitter_minutes, jitter_minutes)
return (base_time + timedelta(minutes=offset)).isoformat()
This enables the creation of synthetic transaction sequences, which are far more valuable for machine learning than isolated records.
4. Message Assembly
Once fields are generated, the message is assembled using builder components that mirror production payment engines. To maintain scalability and reuse across multiple message types, we implement a modular builder pattern. This ensures that the intricate Agent and Party blocks can be generated consistently across both pacs.008 and pacs.009:
from lxml import etree
def add_agent_block(parent, element_name, bic, name):
"""Helper to build Financial Institution blocks (e.g., DbtrAgt, CdtrAgt)"""
agent = etree.SubElement(parent, element_name)
fin_instn = etree.SubElement(agent, "FinInstnId")
etree.SubElement(fin_instn, "BICFI").text = bic
etree.SubElement(fin_instn, "Nm").text = name
def build_pacs008_message(data):
# ISO 20022 Namespaces are required for institutional message routing
NS = {"ns": "urn:iso:std:iso:20022:tech:xsd:pacs.008.001.08"}
root = etree.Element("Document", xmlns=NS["ns"])
body = etree.SubElement(root, "FIToFICstmrCdtTrf")
# 1. Group Header (GrpHdr)
grp_hdr = etree.SubElement(body, "GrpHdr")
etree.SubElement(grp_hdr, "MsgId").text = data["msg_id"]
etree.SubElement(grp_hdr, "CreDtTm").text = data["timestamp"]
etree.SubElement(grp_hdr, "NbOfTxs").text = "1"
sttlm_inf = etree.SubElement(grp_hdr, "SttlmInf")
etree.SubElement(sttlm_inf, "SttlmMtd").text = data["settlement_method"] # e.g., 'CLRG' or 'INDA'
# 2. Credit Transfer Transaction Information (CdtTrfTxInf)
tx_inf = etree.SubElement(body, "CdtTrfTxInf")
# Payment ID
pmt_id = etree.SubElement(tx_inf, "PmtId")
etree.SubElement(pmt_id, "EndToEndId").text = data["e2e_id"]
etree.SubElement(pmt_id, "TxId").text = data["tx_id"]
# Amount & Currency (Uses XML Attributes)
amt = etree.SubElement(tx_inf, "IntrBkSttlmAmt", Ccy=data["currency"])
amt.text = f"{data['amount']:.2f}"
# 3. Agents (Debtor and Creditor Financial Institutions)
add_agent_block(tx_inf, "DbtrAgt", data["dbtr_bic"], data["dbtr_bank_name"])
add_agent_block(tx_inf, "CdtrAgt", data["cdtr_bic"], data["cdtr_bank_name"])
# 4. Debtor & Creditor (The actual people/entities)
dbtr = etree.SubElement(tx_inf, "Dbtr")
etree.SubElement(dbtr, "Nm").text = data["dbtr_name"]
cdtr = etree.SubElement(tx_inf, "Cdtr")
etree.SubElement(cdtr, "Nm").text = data["cdtr_name"]
# 5. Charges Information (ChrgsInf) - High relevance for Fraud Detection
# Unusual charge distributions can be a signal of money laundering
chrgs = etree.SubElement(tx_inf, "ChrgsInf")
chrgs_amt = etree.SubElement(chrgs, "Amt", Ccy=data["currency"])
chrgs_amt.text = f"{data['charge_amount']:.2f}"
add_agent_block(chrgs, "Agt", data["dbtr_bic"], data["dbtr_bank_name"])
return root
def build_pacs009_message(data):
# Notice we reuse the same 'data' dictionary but
# extract different synthetic features
NS = {"ns": "urn:iso:std:iso:20022:tech:xsd:pacs.009.001.08"}
root = etree.Element("Document", xmlns=NS["ns"])
body = etree.SubElement(root, "FICdtTrf")
# Reuse the same synthetic MsgId and Amount from the pacs.008
# to maintain "Inter-message Integrity"
grp_hdr = etree.SubElement(body, "GrpHdr")
etree.SubElement(grp_hdr, "MsgId").text = data["msg_id"]
# pacs.009 focuses on the FI-to-FI settlement
tx_inf = etree.SubElement(body, "CdtTrfTxInf")
amt = etree.SubElement(tx_inf, "IntrBkSttlmAmt", Ccy=data["currency"])
amt.text = f"{data['amount']:.2f}"
# We reuse our helper function to keep the bank identities consistent
add_agent_block(tx_inf, "InstgAgt", data["dbtr_bic"], data["dbtr_bank_name"])
add_agent_block(tx_inf, "InstdAgt", data["cdtr_bic"], data["cdtr_bank_name"])
return root
In practice, separate builders are typically used for group headers, settlement details, party identification, and agent blocks. This modularity allows reuse across pacs.008 and pacs.009.
5. Schema Validation Loop
Every generated message must be validated against the official ISO 20022 XSD before being released.
def validate_message(xml_element, schema):
xml_doc = etree.ElementTree(xml_element)
return schema.validate(xml_doc)
A typical pattern is:
- Generate message.
- Validate against XSD.
- Log validation failures.
- Regenerate only the failing subtree.
This feedback loop improves quality while keeping the generation efficient.
Why This Matters for Fraud Detection
By enforcing schema correctness, preserving behavioral signals, and validating every message, financial institutions can share synthetic ISO 20022 data that is:
- Suitable for supervised and unsupervised ML
- Safe to exchange across organizational boundaries
- Representative of real payment flows
This enables fraud detection models to learn cross-institution patterns that would otherwise remain hidden in siloed datasets.
Final Wrap-Up and Next Steps
As ISO 20022 adoption accelerates, the need for realistic, privacy-safe payment data will only grow. Synthetic pacs.008 and pacs.009 generation offers a practical way to support testing, analytics, and fraud model training without exposing sensitive transaction data.
Looking ahead, this approach can be extended by:
- Sharing learned distributions or model parameters rather than data
- Training federated fraud detection models across institutions
- Iteratively refining synthetic generators based on ML performance
While this article focused on implementation mechanics, the same principles apply across ISO 20022 message types and regulated data domains. When engineered correctly, synthetic data can serve as a safe foundation for building smarter, more resilient fraud detection systems.
Opinions expressed by DZone contributors are their own.
Comments