Streaming vs In-Memory DataWeave: Designing for 1M+ Records Without Crashing
MuleSoft’s default in-memory DataWeave can’t handle million-record files. Streaming solves this by processing data efficiently without OutOfMemory errors.
Join the DZone community and get the full member experience.
Join For FreeThe Real Problem With Scaling DataWeave
MuleSoft is built to handle enterprise integrations — but most developers test with small payloads. Everything looks fine in dev, until one day a real file with 1 million records hits your flow. Suddenly, your worker crashes with an OutOfMemoryError, and the job fails halfway through.
The truth is, DataWeave by default works in-memory. That’s acceptable for small datasets, but in production, we often deal with:
- Banking: daily ACH transaction exports, sometimes hundreds of MBs.
- Healthcare: claims data with millions of rows and deeply nested fields.
- Retail: product catalogs or clickstream logs from thousands of stores.
If you’re not designing for streaming, your flow will eventually hit the wall.
In-Memory vs. Streaming — What’s Actually Happening?
In-Memory (Default Behavior)
- Mule loads the entire payload into memory before transformations.
- Fast when the payload is small (<50k records).
- Breaks down once files grow into hundreds of MBs or GBs.
Think of it like opening a giant Excel file. It works fine with a few thousand rows, but try opening 2 million, and Excel freezes.
If a file is 1 GB, Mule will attempt to hold that 1 GB in memory, plus the transformed copy. That’s a recipe for a crash.
Streaming (The Right Way for Big Data)
- Mule reads the file record by record (or in small chunks).
- Each record is transformed and discarded before moving to the next.
- Memory usage stays flat and predictable.
Think of it as a conveyor belt — records come in, get processed, and move out. You never hold the entire dataset at once.
%dw 2.0
output application/json
var file = readUrl("classpath://data.json", "application/json", {streaming:true})
---
file pluck ((value, key) -> {
id: value.id,
amount: value.amount
})
This approach safely scales to millions of records.
Why Use pluck Instead of map?
Both map and pluck transform collections, but they differ in how they handle memory.
- map – creates a new array with all results in memory. If you have 1M records, Mule holds 1M transformed objects.
- pluck – iterates through key-value pairs and streams results more efficiently.
In practice, pluck is the safer option for massive CSV or JSON datasets.
Performance Comparison: 1.2M Records on 0.2 vCore Worker
In-Memory
- Memory usage: Scales with file size, spikes near heap limits.
- Processing speed: ~8–10 minutes.
- Notes: Prone to crashes on large payloads.
Streaming (streaming: true)
- Memory usage: Flat, ~300MB steady.
- Processing speed: ~10–12 minutes.
- Notes: Stable, no crashes.
Takeaway
In-memory seems faster at a small scale, but streaming wins in real-world production because stability always beats small performance gains.
Real Case: Processing ACH Transactions
At a credit union, we processed 1.2M ACH transactions daily from a Symitar core system.
- In-memory: A 0.2 vCore worker (500MB heap) crashed midway.
- Streaming: The same worker processed the entire dataset without issues.
%dw 2.0
output application/json
var file = readUrl("classpath://ACH_Transactions.csv", "application/csv", {streaming:true})
---
file pluck ((txn, i) -> {
transactionId: txn.txnId,
amount: txn.amount,
date: (txn.date as Date {format: "MM/dd/yyyy"}) as String
})
This ran smoothly in production — no heap errors, no worker restarts.
Common Mistakes That Break Streaming
Even seasoned developers sometimes disable streaming unintentionally:
- Assigning the payload to a variable (
var bigData = payload) forces in-memory storage. - Using map instead of pluck for huge collections.
- Forgetting to test with production-scale data and relying only on 1k-row samples.
- Adding deep nesting in transformations without early filters, which multiplies memory use.
Avoiding these mistakes can save hours of debugging and prevent costly production outages.
Why This Matters for Enterprises
At enterprise scale, memory errors aren’t just technical headaches — they turn into business risks.
In banking, a failed ACH batch can delay payroll for thousands of employees. In healthcare, a rejected claims file can stall reimbursements for hospitals. Every retry not only burns time but also wastes compute resources, driving up CloudHub costs.
By enabling streaming from the start, enterprises minimize downtime, avoid SLA penalties, and reduce operational costs. This small design decision translates into measurable savings and customer trust.
Performance Tuning Tips for Large Data Sets
Streaming alone isn’t a silver bullet. To get the most out of it:
- Right-size workers: A 0.2 vCore can handle millions of rows with streaming, but complex joins may need 1 vCore.
- Filter early: Shrink datasets upfront before expensive transformations.
- Batch where possible: Split very large files into smaller chunks for parallelism.
- Use smart logging: Log samples, not every record, to prevent disk bloat.
- Property-driven configs: Parameterize {streaming: true} to avoid accidental overrides.
- Automated testing: Add performance tests with large payloads in CI/CD pipelines to catch scaling issues early.
Future-Proofing With Event-Driven Architectures
Streaming aligns perfectly with modern event-driven patterns. Whether MuleSoft consumes from Kafka, Azure Event Hub, or AWS Kinesis, the principle is the same: don’t load everything into memory.
As organizations adopt real-time analytics and data pipelines, MuleSoft’s streaming capabilities become the bridge between batch-oriented flat files and event-driven systems. Building with streaming today sets the stage for tomorrow’s real-time enterprise integrations.
Wrapping Up
MuleSoft’s DataWeave is powerful, but its default in-memory mode wasn’t designed for 1M+ record datasets. To build resilient, production-ready flows:
- Enable streaming for large files.
- Prefer pluck over map.
- Avoid holding payloads in variables.
- Test with real-world datasets, not just dev samples.
It’s a small design choice that makes the difference between a flow that fails in production and one that scales effortlessly to millions of records every single day.
Opinions expressed by DZone contributors are their own.
Comments