DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • How Synthetic Data Generation Accelerates the Software Development Lifecycle in the Enterprise
  • Unlocking the Potential of Synthetic Data for AI Development
  • Breaking Barriers: The Rise of Synthetic Data in Machine Learning and AI
  • Empowering ADHD Research With Generative AI: A Developer's Guide to Synthetic Data Generation

Trending

  • Engineering Closed-Loop Graph-RAG Systems, Part 1: From Retrieval to Reasoning
  • Skills, Java 17, and Theme Accents
  • Reproducible Development Environments, One Command Away: Introducing CodingBooth
  • Building a RAG-Powered Bug Triage Agent With AWS Bedrock and OpenSearch k-NN
  1. DZone
  2. Data Engineering
  3. Data
  4. Generating Schema-Valid Synthetic ISO 20022 Messages for Privacy-Preserving Fraud Detection

Generating Schema-Valid Synthetic ISO 20022 Messages for Privacy-Preserving Fraud Detection

Leverage a schema-aware federated approach to generate synthetic ISO 20022 payments data with strict personal information privacy and XSD compliance.

By 
Senthilnathan Dhanasekaran user avatar
Senthilnathan Dhanasekaran
·
Jan. 28, 26 · Analysis
Likes (0)
Comment
Save
Tweet
Share
1.1K Views

Join the DZone community and get the full member experience.

Join For Free

Modern fraud detection systems depend on machine learning models trained on large volumes of payment transaction data. The challenge is that real payment messages — especially ISO 20022 formats such as pacs.008 and pacs.009 — contain highly sensitive financial and customer information that cannot be freely shared across institutions.

This creates a structural limitation. Fraud patterns often emerge only when data is analyzed across multiple financial institutions, yet regulatory, privacy, and competitive constraints prevent raw transaction data from leaving institutional boundaries.

One practical solution is synthetic data generation: producing artificial payment messages that preserve statistical and behavioral characteristics of real transactions without exposing any sensitive or personally identifiable information. When designed correctly, synthetic ISO 20022 messages can be safely shared and used to train stronger fraud detection models.

This article presents a schema-aware, engineering-focused approach for generating synthetic pacs.008 and pacs.009 messages, inspired by architectural principles described in a recent patent on privacy-preserving, collaborative synthetic data generation. The focus here is not on payment business processes, but on how developers can implement such a system, validate it, and integrate it into machine learning workflows.

The Engineering Problem

ISO 20022 payment messages are not simple records. They are deeply nested XML documents governed by strict XSD schemas, mandatory and optional fields, and non-trivial inter-field constraints. Common challenges include:

  • Deeply nested structures with repeating elements
  • Mandatory validation against official XSDs
  • Strong dependencies between fields (amounts, currencies, settlement methods)
  • Temporal and sequencing behavior that matters for fraud detection

At the same time, fraud models require behavioral realism, not just syntactic correctness. A valuable synthetic dataset must preserve signals such as transaction velocity, amount clustering, routing patterns, and time-of-day effects.

High-Level Architecture

Each participating financial institution operates a local synthetic data generator trained on its internal transaction patterns. No raw transactions are shared externally. Instead, institutions generate schema-valid synthetic ISO 20022 messages that can be shared for analytics or model training.

At a high level:

  1. Real transaction data remains internal
  2. Local generators learn statistical distributions and correlations
  3. Synthetic pacs.008 and pacs.009 messages are generated
  4. Messages are validated against official XSDs
  5. Only synthetic outputs are shared

This approach enables collaboration while maintaining strict data privacy boundaries.

Implementation Details

1. Parsing ISO 20022 XSD Schemas

The foundation of any ISO 20022 generator is the official XSD schema. Hard-coding message templates is brittle and error-prone. Instead, the generator should derive structure directly from the schema.

Python
 
from lxml import etree

def load_schema(xsd_path):
    with open(xsd_path, 'rb') as f:
        schema_root = etree.XML(f.read())
    return etree.XMLSchema(schema_root)

pacs008_schema = load_schema("pacs.008.001.08.xsd")
pacs009_schema = load_schema("pacs.009.001.08.xsd")


This allows the generator to remain resilient to schema evolution and supports multiple message versions with minimal changes.

2. Constraint-Aware Field Generation

ISO 20022 messages contain interdependent fields. Generating fields independently leads to invalid or unrealistic payment messages. Examples of common business constraints include:

  • pacs.009omits customer elements present in pacs.008 considering they are FI to FI payments
  • Settlement date should not precede the message creation time
  • Settlement amount must be expressed for the settlement currency
  • Certain charge bearer values depend on allowed enumerations

A robust generator explicitly enforces these business rules:

Python
 
import random

def generate_settlement_amount(currency):
    profiles = {
        "USD": (2500, 900),
        "EUR": (2100, 800)
    }
    mean, std = profiles.get(currency, (1500, 600))
    value = max(0.01, random.gauss(mean, std))
    return round(value, 2)


Using distributions rather than uniform randomness helps preserve fraud-relevant behavior such as amount clustering and high-value outliers.

3. Preserving Fraud-Relevant Signals

Synthetic data is only useful for fraud detection if it preserves behavioral patterns, not just schema validity. Important signals include:

  • Transaction bursts and velocity
  • Time-of-day and day-of-week effects
  • Repeated routing paths
  • Consistent rounding behavior

For example, timestamps should be generated using correlated windows rather than random values.

Python
 
from datetime import datetime, timedelta

def generate_timestamp(base_time, jitter_minutes=15):
    offset = random.randint(-jitter_minutes, jitter_minutes)
    return (base_time + timedelta(minutes=offset)).isoformat()


This enables the creation of synthetic transaction sequences, which are far more valuable for machine learning than isolated records.

4. Message Assembly

Once fields are generated, the message is assembled using builder components that mirror production payment engines. To maintain scalability and reuse across multiple message types, we implement a modular builder pattern. This ensures that the intricate Agent and Party blocks can be generated consistently across both pacs.008 and pacs.009:

Python
 
from lxml import etree

def add_agent_block(parent, element_name, bic, name):
    """Helper to build Financial Institution blocks (e.g., DbtrAgt, CdtrAgt)"""
    agent = etree.SubElement(parent, element_name)
    fin_instn = etree.SubElement(agent, "FinInstnId")
    etree.SubElement(fin_instn, "BICFI").text = bic
    etree.SubElement(fin_instn, "Nm").text = name

def build_pacs008_message(data):
    # ISO 20022 Namespaces are required for institutional message routing
    NS = {"ns": "urn:iso:std:iso:20022:tech:xsd:pacs.008.001.08"}
    root = etree.Element("Document", xmlns=NS["ns"])
    body = etree.SubElement(root, "FIToFICstmrCdtTrf")

    # 1. Group Header (GrpHdr)
    grp_hdr = etree.SubElement(body, "GrpHdr")
    etree.SubElement(grp_hdr, "MsgId").text = data["msg_id"]
    etree.SubElement(grp_hdr, "CreDtTm").text = data["timestamp"]
    etree.SubElement(grp_hdr, "NbOfTxs").text = "1"
    
    sttlm_inf = etree.SubElement(grp_hdr, "SttlmInf")
    etree.SubElement(sttlm_inf, "SttlmMtd").text = data["settlement_method"] # e.g., 'CLRG' or 'INDA'

    # 2. Credit Transfer Transaction Information (CdtTrfTxInf)
    tx_inf = etree.SubElement(body, "CdtTrfTxInf")
    
    # Payment ID
    pmt_id = etree.SubElement(tx_inf, "PmtId")
    etree.SubElement(pmt_id, "EndToEndId").text = data["e2e_id"]
    etree.SubElement(pmt_id, "TxId").text = data["tx_id"]

    # Amount & Currency (Uses XML Attributes)
    amt = etree.SubElement(tx_inf, "IntrBkSttlmAmt", Ccy=data["currency"])
    amt.text = f"{data['amount']:.2f}"

    # 3. Agents (Debtor and Creditor Financial Institutions)
    add_agent_block(tx_inf, "DbtrAgt", data["dbtr_bic"], data["dbtr_bank_name"])
    add_agent_block(tx_inf, "CdtrAgt", data["cdtr_bic"], data["cdtr_bank_name"])

    # 4. Debtor & Creditor (The actual people/entities)
    dbtr = etree.SubElement(tx_inf, "Dbtr")
    etree.SubElement(dbtr, "Nm").text = data["dbtr_name"]
    
    cdtr = etree.SubElement(tx_inf, "Cdtr")
    etree.SubElement(cdtr, "Nm").text = data["cdtr_name"]

    # 5. Charges Information (ChrgsInf) - High relevance for Fraud Detection
    # Unusual charge distributions can be a signal of money laundering
    chrgs = etree.SubElement(tx_inf, "ChrgsInf")
    chrgs_amt = etree.SubElement(chrgs, "Amt", Ccy=data["currency"])
    chrgs_amt.text = f"{data['charge_amount']:.2f}"
    add_agent_block(chrgs, "Agt", data["dbtr_bic"], data["dbtr_bank_name"])

    return root

 def build_pacs009_message(data):
    # Notice we reuse the same 'data' dictionary but 
    # extract different synthetic features
    NS = {"ns": "urn:iso:std:iso:20022:tech:xsd:pacs.009.001.08"}
    root = etree.Element("Document", xmlns=NS["ns"])
    body = etree.SubElement(root, "FICdtTrf")

    # Reuse the same synthetic MsgId and Amount from the pacs.008 
    # to maintain "Inter-message Integrity"
    grp_hdr = etree.SubElement(body, "GrpHdr")
    etree.SubElement(grp_hdr, "MsgId").text = data["msg_id"] 
    
    # pacs.009 focuses on the FI-to-FI settlement
    tx_inf = etree.SubElement(body, "CdtTrfTxInf")
    amt = etree.SubElement(tx_inf, "IntrBkSttlmAmt", Ccy=data["currency"])
    amt.text = f"{data['amount']:.2f}"

    # We reuse our helper function to keep the bank identities consistent
    add_agent_block(tx_inf, "InstgAgt", data["dbtr_bic"], data["dbtr_bank_name"])
    add_agent_block(tx_inf, "InstdAgt", data["cdtr_bic"], data["cdtr_bank_name"])

    return root


In practice, separate builders are typically used for group headers, settlement details, party identification, and agent blocks. This modularity allows reuse across pacs.008 and pacs.009.

5. Schema Validation Loop

Every generated message must be validated against the official ISO 20022 XSD before being released.

Python
 
def validate_message(xml_element, schema):
    xml_doc = etree.ElementTree(xml_element)
    return schema.validate(xml_doc)


A typical pattern is:

  1. Generate message.
  2. Validate against XSD.
  3. Log validation failures.
  4. Regenerate only the failing subtree.

 This feedback loop improves quality while keeping the generation efficient.

Why This Matters for Fraud Detection

By enforcing schema correctness, preserving behavioral signals, and validating every message, financial institutions can share synthetic ISO 20022 data that is:

  • Suitable for supervised and unsupervised ML
  • Safe to exchange across organizational boundaries
  • Representative of real payment flows

This enables fraud detection models to learn cross-institution patterns that would otherwise remain hidden in siloed datasets.

Final Wrap-Up and Next Steps

As ISO 20022 adoption accelerates, the need for realistic, privacy-safe payment data will only grow.  Synthetic pacs.008 and pacs.009 generation offers a practical way to support testing, analytics, and fraud model training without exposing sensitive transaction data. 

Looking ahead, this approach can be extended by:

  • Sharing learned distributions or model parameters rather than data
  • Training federated fraud detection models across institutions
  • Iteratively refining synthetic generators based on ML performance

While this article focused on implementation mechanics, the same principles apply across ISO 20022 message types and regulated data domains. When engineered correctly, synthetic data can serve as a safe foundation for building smarter, more resilient fraud detection systems.

Synthetic data Data (computing)

Opinions expressed by DZone contributors are their own.

Related

  • How Synthetic Data Generation Accelerates the Software Development Lifecycle in the Enterprise
  • Unlocking the Potential of Synthetic Data for AI Development
  • Breaking Barriers: The Rise of Synthetic Data in Machine Learning and AI
  • Empowering ADHD Research With Generative AI: A Developer's Guide to Synthetic Data Generation

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook