Generating Schema-Valid Synthetic ISO 20022 Messages for Privacy-Preserving Fraud Detection

Leverage a schema-aware federated approach to generate synthetic ISO 20022 payments data with strict personal information privacy and XSD compliance.

Jan. 28, 26 · Analysis

Likes (0)

Comment

Save

1.1K Views

Modern fraud detection systems depend on machine learning models trained on large volumes of payment transaction data. The challenge is that real payment messages — especially ISO 20022 formats such as pacs.008 and pacs.009 — contain highly sensitive financial and customer information that cannot be freely shared across institutions.

This creates a structural limitation. Fraud patterns often emerge only when data is analyzed across multiple financial institutions, yet regulatory, privacy, and competitive constraints prevent raw transaction data from leaving institutional boundaries.

One practical solution is synthetic data generation: producing artificial payment messages that preserve statistical and behavioral characteristics of real transactions without exposing any sensitive or personally identifiable information. When designed correctly, synthetic ISO 20022 messages can be safely shared and used to train stronger fraud detection models.

This article presents a schema-aware, engineering-focused approach for generating synthetic pacs.008 and pacs.009 messages, inspired by architectural principles described in a recent patent on privacy-preserving, collaborative synthetic data generation. The focus here is not on payment business processes, but on how developers can implement such a system, validate it, and integrate it into machine learning workflows.

The Engineering Problem

ISO 20022 payment messages are not simple records. They are deeply nested XML documents governed by strict XSD schemas, mandatory and optional fields, and non-trivial inter-field constraints. Common challenges include:

Deeply nested structures with repeating elements
Mandatory validation against official XSDs
Strong dependencies between fields (amounts, currencies, settlement methods)
Temporal and sequencing behavior that matters for fraud detection

At the same time, fraud models require behavioral realism, not just syntactic correctness. A valuable synthetic dataset must preserve signals such as transaction velocity, amount clustering, routing patterns, and time-of-day effects.

High-Level Architecture

Each participating financial institution operates a local synthetic data generator trained on its internal transaction patterns. No raw transactions are shared externally. Instead, institutions generate schema-valid synthetic ISO 20022 messages that can be shared for analytics or model training.

At a high level:

Real transaction data remains internal
Local generators learn statistical distributions and correlations
Synthetic pacs.008 and pacs.009 messages are generated
Messages are validated against official XSDs
Only synthetic outputs are shared

This approach enables collaboration while maintaining strict data privacy boundaries.

Implementation Details

1. Parsing ISO 20022 XSD Schemas

The foundation of any ISO 20022 generator is the official XSD schema. Hard-coding message templates is brittle and error-prone. Instead, the generator should derive structure directly from the schema.

    Python
   
   from lxml import etree

def load_schema(xsd_path):
    with open(xsd_path, 'rb') as f:
        schema_root = etree.XML(f.read())
    return etree.XMLSchema(schema_root)

pacs008_schema = load_schema("pacs.008.001.08.xsd")
pacs009_schema = load_schema("pacs.009.001.08.xsd")

This allows the generator to remain resilient to schema evolution and supports multiple message versions with minimal changes.

2. Constraint-Aware Field Generation

ISO 20022 messages contain interdependent fields. Generating fields independently leads to invalid or unrealistic payment messages. Examples of common business constraints include:

pacs.009omits customer elements present in pacs.008 considering they are FI to FI payments
Settlement date should not precede the message creation time
Settlement amount must be expressed for the settlement currency
Certain charge bearer values depend on allowed enumerations

A robust generator explicitly enforces these business rules:

    Python
   
 

   import random

def generate_settlement_amount(currency):
    profiles = {
        "USD": (2500, 900),
        "EUR": (2100, 800)
    }
    mean, std = profiles.get(currency, (1500, 600))
    value = max(0.01, random.gauss(mean, std))
    return round(value, 2)

  

Using distributions rather than uniform randomness helps preserve fraud-relevant behavior such as amount clustering and high-value outliers.

3. Preserving Fraud-Relevant Signals

Synthetic data is only useful for fraud detection if it preserves behavioral patterns, not just schema validity. Important signals include:

Transaction bursts and velocity
Time-of-day and day-of-week effects
Repeated routing paths
Consistent rounding behavior

For example, timestamps should be generated using correlated windows rather than random values.

    Python
   
   from datetime import datetime, timedelta

def generate_timestamp(base_time, jitter_minutes=15):
    offset = random.randint(-jitter_minutes, jitter_minutes)
    return (base_time + timedelta(minutes=offset)).isoformat()

This enables the creation of synthetic transaction sequences, which are far more valuable for machine learning than isolated records.

4. Message Assembly

Once fields are generated, the message is assembled using builder components that mirror production payment engines. To maintain scalability and reuse across multiple message types, we implement a modular builder pattern. This ensures that the intricate Agent and Party blocks can be generated consistently across both pacs.008 and pacs.009:

    Python
   
 

   from lxml import etree

def add_agent_block(parent, element_name, bic, name):
    """Helper to build Financial Institution blocks (e.g., DbtrAgt, CdtrAgt)"""
    agent = etree.SubElement(parent, element_name)
    fin_instn = etree.SubElement(agent, "FinInstnId")
    etree.SubElement(fin_instn, "BICFI").text = bic
    etree.SubElement(fin_instn, "Nm").text = name

def build_pacs008_message(data):
    # ISO 20022 Namespaces are required for institutional message routing
    NS = {"ns": "urn:iso:std:iso:20022:tech:xsd:pacs.008.001.08"}
    root = etree.Element("Document", xmlns=NS["ns"])
    body = etree.SubElement(root, "FIToFICstmrCdtTrf")

    # 1. Group Header (GrpHdr)
    grp_hdr = etree.SubElement(body, "GrpHdr")
    etree.SubElement(grp_hdr, "MsgId").text = data["msg_id"]
    etree.SubElement(grp_hdr, "CreDtTm").text = data["timestamp"]
    etree.SubElement(grp_hdr, "NbOfTxs").text = "1"
    
    sttlm_inf = etree.SubElement(grp_hdr, "SttlmInf")
    etree.SubElement(sttlm_inf, "SttlmMtd").text = data["settlement_method"] # e.g., 'CLRG' or 'INDA'

    # 2. Credit Transfer Transaction Information (CdtTrfTxInf)
    tx_inf = etree.SubElement(body, "CdtTrfTxInf")
    
    # Payment ID
    pmt_id = etree.SubElement(tx_inf, "PmtId")
    etree.SubElement(pmt_id, "EndToEndId").text = data["e2e_id"]
    etree.SubElement(pmt_id, "TxId").text = data["tx_id"]

    # Amount & Currency (Uses XML Attributes)
    amt = etree.SubElement(tx_inf, "IntrBkSttlmAmt", Ccy=data["currency"])
    amt.text = f"{data['amount']:.2f}"

    # 3. Agents (Debtor and Creditor Financial Institutions)
    add_agent_block(tx_inf, "DbtrAgt", data["dbtr_bic"], data["dbtr_bank_name"])
    add_agent_block(tx_inf, "CdtrAgt", data["cdtr_bic"], data["cdtr_bank_name"])

    # 4. Debtor & Creditor (The actual people/entities)
    dbtr = etree.SubElement(tx_inf, "Dbtr")
    etree.SubElement(dbtr, "Nm").text = data["dbtr_name"]
    
    cdtr = etree.SubElement(tx_inf, "Cdtr")
    etree.SubElement(cdtr, "Nm").text = data["cdtr_name"]

    # 5. Charges Information (ChrgsInf) - High relevance for Fraud Detection
    # Unusual charge distributions can be a signal of money laundering
    chrgs = etree.SubElement(tx_inf, "ChrgsInf")
    chrgs_amt = etree.SubElement(chrgs, "Amt", Ccy=data["currency"])
    chrgs_amt.text = f"{data['charge_amount']:.2f}"
    add_agent_block(chrgs, "Agt", data["dbtr_bic"], data["dbtr_bank_name"])

    return root

 def build_pacs009_message(data):
    # Notice we reuse the same 'data' dictionary but 
    # extract different synthetic features
    NS = {"ns": "urn:iso:std:iso:20022:tech:xsd:pacs.009.001.08"}
    root = etree.Element("Document", xmlns=NS["ns"])
    body = etree.SubElement(root, "FICdtTrf")

    # Reuse the same synthetic MsgId and Amount from the pacs.008 
    # to maintain "Inter-message Integrity"
    grp_hdr = etree.SubElement(body, "GrpHdr")
    etree.SubElement(grp_hdr, "MsgId").text = data["msg_id"] 
    
    # pacs.009 focuses on the FI-to-FI settlement
    tx_inf = etree.SubElement(body, "CdtTrfTxInf")
    amt = etree.SubElement(tx_inf, "IntrBkSttlmAmt", Ccy=data["currency"])
    amt.text = f"{data['amount']:.2f}"

    # We reuse our helper function to keep the bank identities consistent
    add_agent_block(tx_inf, "InstgAgt", data["dbtr_bic"], data["dbtr_bank_name"])
    add_agent_block(tx_inf, "InstdAgt", data["cdtr_bic"], data["cdtr_bank_name"])

    return root
  

In practice, separate builders are typically used for group headers, settlement details, party identification, and agent blocks. This modularity allows reuse across pacs.008 and pacs.009.

5. Schema Validation Loop

Every generated message must be validated against the official ISO 20022 XSD before being released.

    Python
   
   def validate_message(xml_element, schema):
    xml_doc = etree.ElementTree(xml_element)
    return schema.validate(xml_doc)

A typical pattern is:

Generate message.
Validate against XSD.
Log validation failures.
Regenerate only the failing subtree.

This feedback loop improves quality while keeping the generation efficient.

Why This Matters for Fraud Detection

By enforcing schema correctness, preserving behavioral signals, and validating every message, financial institutions can share synthetic ISO 20022 data that is:

Suitable for supervised and unsupervised ML
Safe to exchange across organizational boundaries
Representative of real payment flows

This enables fraud detection models to learn cross-institution patterns that would otherwise remain hidden in siloed datasets.

Final Wrap-Up and Next Steps

As ISO 20022 adoption accelerates, the need for realistic, privacy-safe payment data will only grow. Synthetic pacs.008 and pacs.009 generation offers a practical way to support testing, analytics, and fraud model training without exposing sensitive transaction data.

Looking ahead, this approach can be extended by:

Sharing learned distributions or model parameters rather than data
Training federated fraud detection models across institutions
Iteratively refining synthetic generators based on ML performance

While this article focused on implementation mechanics, the same principles apply across ISO 20022 message types and regulated data domains. When engineered correctly, synthetic data can serve as a safe foundation for building smarter, more resilient fraud detection systems.

Synthetic data Data (computing)

Opinions expressed by DZone contributors are their own.

Related

Trending