Ten Years of Beam: From Google's Dataflow Paper to 4 Trillion Events at LinkedIn

Apache Beam turns ten. From Google's 2015 Dataflow paper to 4 trillion daily events at LinkedIn — what it got right, where it falls short, and what comes next.

Abgar Simonean

May. 14, 26 · Analysis

Likes (0)

Comment

Save

1.7K Views

In August 2015, a team of engineers at Google published a paper with a title so long it barely fits on a conference slide: "The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing." The opening line was:

We as a field must stop trying to groom unbounded datasets into finite pools of information that eventually become complete.

Ten years later, the programming model born from that paper — Apache Beam — processes 4 trillion events daily at LinkedIn alone, powers fraud detection at Transmit Security, runs the cybersecurity backbone at Palo Alto Networks, and handles a large chunk of Google Cloud's data infrastructure through Dataflow.

But Beam's story is not a straight line from academic paper to industry dominance. It is better described as a story of ideas that were ahead of their time, engineering trade-offs that still generate debate, and an abstraction layer whose costs and benefits became fully clear years after its inception.

The Lineage: MapReduce, FlumeJava, MillWheel

Beam did not appear from nothing. It descends from three internal Google systems, each solving a different piece of the data processing puzzle.

MapReduce, introduced in a 2004 paper, described the mental model: Split work across machines, process in parallel, and combine the results. Hadoop took that idea open-source and launched a decade of big data infrastructure. But MapReduce was batch-only, so it assumed your data had a beginning and an end.

FlumeJava (2010) raised the abstraction. Instead of thinking in terms of maps and reducing steps, engineers described pipelines of transformations on collections. The system handled optimization and parallelization, so engineers had more focus on the domain problem at hand, and thus it made batch pipelines readable and composable.

MillWheel (2013) tackled streaming. It processed events one at a time, maintained state, and handled exactly-once semantics at Google's scale, but it was a separate system with a separate programming model. If you wanted to run your pipeline logic in both batch and streaming, you would maintain two codebases.

This was a problem: two codebases meant two mental models and, inevitably, two sets of bugs. The 2015 Dataflow paper proposed the fix: Treat batch as a special case of streaming, not the other way around. Bounded data is just unbounded data that happens to end. This sounds obvious in retrospect, but at the time, it was a big shift.

The Donation and the Incubator

In January 2016, Google and partners — Cloudera contributed a Spark runner, dataArtisans (now Ververica) contributed a Flink runner, and Talend joined the effort — donated the Cloud Dataflow SDKs to the Apache Software Foundation. The project entered the Apache Incubator under the name Beam, a portmanteau of Batch and strEAM.

The incubation was fast. By December 2016, Beam graduated to a top-level Apache project. The numbers from the graduation assessment tell a story: out of roughly 22 major modules in the codebase, at least 10 had been developed from scratch by the community with minimal Google contribution. No single organization held more than 50% of unique monthly contributors. A perfect example of open source done right.

The first stable release, version 2.0.0, came in May 2017. At that point, Beam was in production use at Google Cloud, PayPal, and Talend. Five runners were officially supported. The programming model had proven itself inside Google for over a decade; now it had the opportunity to prove itself everywhere else.

What Beam Got Right

Three core design decisions have held up over the past ten years. They are worth examining because they explain why Beam survived in a market crowded with alternatives.

Batch Is a Special Case of Streaming

The Dataflow paper's central insight was that the same four questions apply to all data processing: What results are being computed? Where in event time are results grouped? When in processing time are results materialized? How do refinements of results relate?

This framework — what, where, when, how — turned out to be general enough to express everything from a simple MapReduce job to complex session-windowed streaming aggregations. It meant LinkedIn could write one pipeline and run it in batch mode on Spark for backfills and in streaming mode on Samza for real-time processing. When they did this, their backfill duration dropped from seven hours to 25 minutes, and memory consumption was cut in half.

Runner Abstraction

Beam pipelines do not execute directly. They compile to a runner — Dataflow, Flink, Spark, Samza, or others — which handles the actual distributed execution. At the time, this was a controversial choice - it meant Beam is always an abstraction over something else, and abstractions have overhead. But in retrospect, the trade-off has aged well.

Ricardo, Switzerland's largest online marketplace, built Beam pipelines on a self-managed Flink cluster in their data center. When they migrated to Google Cloud, they switched to the Dataflow runner without rewriting pipeline code. It saved them months of engineering work.

Palo Alto Networks runs its cybersecurity platform on Beam with both the Dataflow runner (on GCP) and Flink (on AWS). In their own words:

"With the right abstraction we have the flexibility to run workloads where needed. Thanks to Beam, we are not locked to any vendor."

Windowing and watermarks as First-Class Concepts

Most streaming frameworks bolted on windowing support after the fact. Conveniently fixed windows, sliding windows, session windows, and custom window functions are all part of the Beam core model. Watermarks — heuristic estimates of how far behind your data might be — are a foundational mechanism.

In practice, this matters a lot. For example, at LinkedIn, the anti-abuse platform uses Beam's windowing to aggregate user activity signals in real-time, reducing the time to label abusive behavior from days to minutes. At Palo Alto Networks, sub-second windowing over hundreds of billions of security events per day makes the difference between catching an intrusion and missing it.

The GCP Angle: Where Beam and Dataflow Reinforce Each Other

Beam's relationship with Google Cloud Platform deserves specific examination because it illustrates both the strengths and the tensions of the project.

Dataflow is the only fully managed, serverless runner for Beam. With Dataflow, you do not provision clusters, nor do you tune executor memory. You write a Beam pipeline, pass --runner=DataflowRunner in your options, and the service handles autoscaling, fault tolerance, and monitoring. For teams already invested in GCP — using Pub/Sub for messaging, BigQuery for analytics, Cloud Storage for data lakes — the integration is seamless.

Google recently introduced Managed I/O for Dataflow, which automatically upgrades your Beam I/O connectors to the latest vetted version during job submission. If a critical bug fix lands in the Beam Kafka connector, Dataflow will pick it up without you changing a line of code, as of writing this blog post no self-managed Flink or Spark cluster can offer this.

The pattern I've seen work especially well in my experience: Pub/Sub → Dataflow (Beam) → BigQuery.

You can read from BigQuery in batch mode for historical backfills using ReadFromBigQuery with a SQL query, or read from Pub/Sub in streaming mode for real-time ingestion. Google published a codelab in 2025 showing Beam pipelines running Gemini model inference through Dataflow's RunInference API, with results written to BigQuery. The data processing layer and the ML inference layer are the same pipeline.

There is, however, tension here: the more you depend on Managed I/O and Dataflow-specific optimizations, the less portable your pipeline becomes in practice. You are using an abstraction layer designed for portability while building on features unique to one runner. This is not necessarily wrong; it might be the right engineering choice for your team, but you should make it with open eyes.

What Beam Got Wrong, or at Least Has Not Fixed

I believe that honesty about a technology's weaknesses is more useful than cheerleading, and Beam has real gaps.

Performance Overhead

The runner abstraction adds a translation layer between your code and execution. Benchmarks published by Beside the Park in September 2025 measured Java on Beam's Portable Runner at up to 2x slower than Classic Runners. The Portable Runner enables cross-language pipelines — a Python transform talking to a Java transform in the same pipeline, but if your entire pipeline is Java, you are paying for portability you do not use. Classic Runners (available for JVM languages) perform better, but the gap between Beam-on-Flink and native Flink is still nonzero.

Debugging Complexity

When a Beam pipeline fails on Dataflow, you are debugging through two layers: Beam's SDK-level logic and the runner's execution translation. When something goes wrong with BigQuery writes, for example, errors surface through Beam's FailedRows side output — a well-designed pattern, but one that adds indirection. When it is 2 AM, and your pipeline is stuck, every layer between you and the root cause adds minutes and is not fun in general.

Ecosystem Size Relative to Spark

Spark has a vastly larger community, more Stack Overflow answers, more blog posts, more hiring candidates, and more mature notebook-based tooling (Jupyter, Databricks). If you Google a Beam error message, you might find three relevant results. If you Google a Spark error message, you will find thirty. Now, obviously, with the introduction of LLM tools, this is not as pressing a problem as it was in 2016, for example, but this still matters for engineering teams making technology choices. A tool is only as good as the team's ability to debug and maintain it.

Beam YAML Is Promising But Unproven for Complex Workloads

Beam YAML, the no-code SDK that went stable in version 2.52, lets engineers define pipelines declaratively in YAML configuration files instead of writing SDK code. It just gained Iceberg support in March 2026. The concept is: most production pipelines are not clever, and they do not need 500 lines of Java. But the Beam blog itself acknowledged that YAML "has gained little adoption for complex ML tasks."

The Production Evidence at Scale

Here is what Beam runs today, based on published case studies:

LinkedIn: 4 trillion events daily, 3,000+ pipelines across multiple data centers. Unified streaming and batch processing through Samza and Spark runners. 2x cost optimization with anti-abuse labeling accelerated from days to minutes.

Palo Alto Networks: Hundreds of billions of security events per day. 30,000 Dataflow jobs. 15 million events per second. 4 petabytes of daily data volume. Processing costs reduced by more than 60%.

Booking.com: 1M+ queries monthly for ad bidding and performance analytics. 2 PB+ of analytical data scanned. 36x processing acceleration. 4x faster time-to-market.

Credit Karma: 5-10 TB processed daily at 5K events per second. 20,000+ ML features managed. Pipeline uptime jumped from 80% to 99%.

What the Next Decade Needs

If Beam is going to remain relevant for the next ten years, there are specific problems the community needs to address.

Close the performance gap with native runners. The abstraction tax is real, and in an era where cloud compute bills are under constant scrutiny, a 2x overhead is a hard sell for performance-sensitive workloads. The Portability Framework needs to improve, or the community needs to invest more in engine-specific optimizations within the runner implementations.

Make state management competitive with Flink. Flink's built-in state management — with fine-grained checkpointing and queryable state — is ahead of what Beam offers natively. Beam delegates state handling to the runner, which means state behavior varies depending on your execution engine. For stateful streaming applications, this inconsistency is a friction point.

Invest in Beam YAML for the 80% use case. Most data pipelines are not LinkedIn-scale streaming systems; they are extract-transform-load jobs that read from one place, apply some business rules, and write to another. If Beam YAML can become the standard way to express those pipelines — with full Managed I/O support on Dataflow and good integration with Iceberg and Kafka — it could expand Beam's reach far beyond the current community of JVM and Python SDK users.

Build better tooling for debugging and observability. The gap between Beam's pipeline abstraction and the runner's execution reality is where engineers lose hours. Better error messages, better tracing through the SDK → runner → execution boundary, and better integration with standard observability stacks (OpenTelemetry, Prometheus) would lower the operational cost of running Beam in production. On a more personal note, seeing improvements to the DirectRunner would go a long way. In my experience, the DirectRunner is where most engineers first encounter Beam, and it is also where the gap between "works locally" and "works on Dataflow" is most painful. A DirectRunner that more faithfully simulates distributed execution semantics, even at the cost of being slower, would catch entire categories of bugs before they reach a staging environment.

Conclusion

Apache Beam is not the right tool for every data pipeline. If your workload is batch-only and your team already knows Spark, switching to Beam for theoretical portability you may never exercise is a bad trade. If you need the absolute lowest latency in a streaming system and you know Flink well, native Flink will outperform Beam-on-Flink.

But for a specific and growing set of problems — unified batch and streaming with the same code, genuine multi-runner portability during cloud migrations, serverless execution on GCP via Dataflow, ML inference embedded in data pipelines — Beam is the strongest option available.

Ten years ago, a team at Google argued that unbounded data processing needed a new foundation. The model they proposed has survived contact with reality at a scale few other frameworks can claim.

Beam Summit 2026 is happening June 22–23 in New York City. If the next decade is anything like the last, the conversations there will shape how we process data for years to come.

Big data Data processing Event

Opinions expressed by DZone contributors are their own.

Related

Trending