Serverless Spark Isn't Always the Answer: A Case Study
Processing 500M+ records with 100 concurrent users under a 5-minute SLA demands smart architecture. We evaluate seven compute models and why hybrid approaches often win.
Join the DZone community and get the full member experience.
Join For FreeProcessing billions of records with strict latency requirements isn't a "pick your favorite database" problem. It's an architectural decision that will define system scalability, team velocity, and operational budgets for years to come.
The challenge involves multiple competing constraints:
- 500M+ input records from multiple data sources
- Complex multi-stage calculations with business rules
- Sub-5-minute processing time (p90)
- Support for 10-100 concurrent executions
- Real-time progress tracking and error recovery
- An operational overhead budget of less than 3 hours per week
The solution needs to deliver today while avoiding tomorrow's technical debt.
The Evaluation and Trade-offs
When evaluating compute engines on AWS, the options are overwhelming. Redshift, EMR, Glue, Athena, Lambda, and that's before considering third-party options or managed platforms. Here's a deep evaluation of seven distinct approaches and their real-world trade-offs.
AWS EMR Serverless
Represents the industry-standard approach to big-data processing without the overhead of cluster management. The performance case is compelling — with pre-warmed worker pools, EMR Serverless can process 500M records in 3-4 minutes. Each execution runs as an isolated Spark application, providing natural concurrency. Scaling from 1 to 100+ workers happens automatically, and the Spark UI provides stage-by-stage progress tracking, making debugging tractable.
But the hidden costs emerge at scale. Pre-warming economics don't favor sporadic workloads. Avoiding cold starts requires keeping worker pools warm, which means paying for idle capacity between executions. For usage patterns where users run jobs throughout the day (not continuously), there's a choice: pay for idle workers or accept 2-3 minute cold starts on every execution. Neither optimizes for both cost and user experience.
Cost variance creates budget uncertainty. EMR Serverless pricing is $0.052624 per vCPU per hour and $0.0057785 per GB per hour. A single job using 20 executors with 4 vCPU and 16GB each for 4 minutes costs approximately $3-4. With 100 concurrent users running jobs throughout the day, monthly costs can reach $15,000 - $ 30,000, with significant day-to-day variance depending on usage patterns. Finance teams struggle with this unpredictability.
AWS imposes soft limits on the number of concurrent EMR Serverless applications per account (default: 100). High-concurrency workloads hit this ceiling and require an increase in limits. More problematically, all applications compete for the same account-level resources, creating subtle resource contention that's difficult to debug. Each execution generates its own Spark UI, CloudWatch log streams, and metrics. When failures occur, troubleshooting requires hunting through dozens of Spark UIs to identify which stage failed in which job.
Perhaps most critically, if 80% of data volume can be pre-aggregated, distributed computing processes what could fit on a single large instance. Spark excels at distributing work across hundreds of nodes for terabyte-scale data. For 20-50M pre-aggregated records, you're paying for distributed coordination overhead without receiving the benefits. Warm pools reduce cold start from 2-3 minutes to 30-60 seconds, but that's still significant latency on every user request. When users click "Run" and see a spinner, every second counts. EMR Serverless excels at continuously processing massive datasets requiring distributed computing. For workloads where heavy lifting happens in batch preprocessing and real-time queries hit pre-aggregated data, it's over-engineering.
AWS Glue ETL
Offers Spark's power wrapped in AWS abstractions that handle operational complexity. No cluster management, native Glue Catalog integration, automatic schema detection, and job bookmarks for incremental processing. The value proposition is clear: get Spark's power without the operational burden. Glue handles the infrastructure, you write the transformations.
The problem: every Glue job has 1-2 minutes of initialization overhead depending on network traffic. For batch processing running hourly or daily, this is acceptable. For on-demand processing where a user clicks a button and waits, that's 20-40% of the latency budget consumed before processing a single record. Two minutes of unavoidable overhead on every user request makes Glue unsuitable for synchronous, user-facing workloads.
However, for batch processing, Glue is genuinely excellent. Event-driven Glue jobs triggered by S3 uploads or EventBridge events are fast, reliable, and fully serverless. The cold start matters less when jobs run 5-10 times daily based on actual data updates rather than 100+ times based on user requests. This observation becomes critical for hybrid architectures.
DuckDB
Offers an entirely different approach — an embedded analytical database that speaks pure SQL, runs in-process, requires zero infrastructure, and delivers vectorized columnar performance rivaling Spark for single-node workloads. The appeal is immediate. No cluster management — it's a single binary embedded in applications. No cold start — it initializes in milliseconds. Runs on Lambda, ECS, EC2, or laptops. For datasets that fit in memory on a single large instance, DuckDB delivers impressive performance. Sub-second queries on tens of millions of records. Vectorized execution leveraging modern CPU instructions. Direct Parquet reading from S3 without data loading.
The constraint: DuckDB cannot distribute computation across multiple machines. For 500M+ records with complex joins, very large instances are required (16+ vCPU, 64GB+ RAM). Even then, memory-intensive operations might exceed available capacity. The architectural question becomes: can the problem be structured so DuckDB's single-node limitation doesn't matter?
StarRocks
Represents the high-end option — a massively parallel processing (MPP) analytical database designed for real-time OLAP workloads with sub-second query performance. It offers MySQL-compatible SQL, horizontal scaling with the addition of backend nodes, and the capability to handle both batch loads and streaming updates. Performance benchmarks show queries that take minutes in Redshift completing in seconds in StarRocks.
The operational reality is daunting. StarRocks requires a distributed cluster with frontend nodes (query coordination), backend nodes (storage and compute), and metadata management. StarRocks isn't "deploy and forget" — it requires dedicated database administration, including 24/7 cluster monitoring, node failure handling, upgrade management, backup strategies, and performance tuning. The choice is either running the cluster continuously ($2,000+ monthly) or accepting 8-10 minute cold starts when spinning it up on-demand — which violates sub-5-minute latency requirements.
Between setup, development, testing, and operational hardening, expect 5-6 months to production. The learning curve for StarRocks-specific concepts is steep, and the community—while growing—is smaller than the Spark ecosystem. For sub-second queries on multi-billion-row tables with high-concurrency OLAP workloads, StarRocks justifies the investment. But for workloads where 3-5 minute response times suffice, it's over-engineering.
dbt
ccupies a different category—it's a transformation framework, not a compute engine. dbt organizes and orchestrates SQL transformations but runs them on an underlying database or data warehouse. dbt offers SQL-based transformations with Jinja templating, modular and testable components, automatic data lineage and documentation, and Git-based version control.
For analytics teams managing complex transformation logic, dbt brings software engineering best practices to the data warehouse. But it doesn't solve compute architecture problems. If the underlying database has concurrency issues, dbt won't fix them. If query performance is slow, dbt won't make it faste— it organizes slow queries more effectively. Excellent for transformation-layer organization, but not a solution to compute-architecture challenges.
The Insight That Changed Everything
After weeks of benchmarking and architecture discussions, a realization reframed the entire problem: not all processing needs to happen in real-time.
The problem was initially framed as monolithic computation — user clicks button, the system process 500M records, then returns results. But analyzing data sources revealed a pattern: reference and mapping tables (10M records) updated weekly with low volatility. Core metrics and aggregations (200M records) updated daily with medium volatility. while user-specific inputs and parameters (5M records) provided on-demand with high volatility.
Traditional architectures reprocess reference tables and core metrics on every user request, even though they only change once per day or week. It's like recalculating a mortgage payment from scratch every time you check your bank balance — technically correct, but wildly inefficient.
The key insight: separate batch preprocessing from real-time computation. Pre-aggregate slow-changing data into optimized "Tier 2" tables. The batch layer runs event-driven jobs when source data actually changes — 1-10 times per day instead of 100+ times. The real-time layer loads these pre-aggregated tables and only processes user-specific inputs. This architectural split transforms technology weaknesses into complementary strengths.
The Hybrid Architecture: Best of Both Worlds
The winning architecture combines two technologies, each handling what it does best.
AWS Glue ETL handles event-driven preprocessing of slow-changing data. Glue's 1-2 minute cold start doesn't matter when jobs run 5-10 times per day based on actual data updates. Its Spark engine efficiently handles 500M+ records. Job bookmarks enable incremental processing. EventBridge triggers ensure processing happens exactly when needed.
DuckDB on ECS Fargate handles on-demand user requests. DuckDB's single-node limitation doesn't matter because it processes pre-aggregated data that fits comfortably in memory. Sub-second cold start eliminates user wait time. Pure SQL reduces complexity. ECS Fargate provides serverless container execution with multi-AZ deployment for high availability.
The data flow is straightforward: source data updates trigger EventBridge events, which launch Glue ETL jobs for batch processing. These jobs write Tier 2 aggregates to S3 in Parquet format. When users make requests through API Gateway, ECS Fargate tasks running DuckDB load these pre-aggregated tables, execute business logic, and write results back to S3 for BI dashboards. Each layer handles what it's optimized for — Glue processes massive datasets with complex transformations in batch mode, while DuckDB executes fast SQL queries on pre-aggregated data for instant user feedback.
Glue handles three types of event-driven preprocessing. When mapping tables or dimensional data updates weekly, EventBridge events trigger Glue jobs that apply deduplication logic, cleanse data, and write optimized Tier 2 aggregates to S3 in Parquet format, partitioned by key dimensions. When upstream metrics complete processing daily, another Glue job aggregates data by key dimensions—regions, categories, time periods—and writes pre-aggregated results to S3, reducing 200M+ raw records to 20-50M aggregated records. When users upload input files on demand, a Glue job validates, cleanses, and explodes hierarchical data into granular records ready for real-time processing. The output is a set of optimized Tier 2 aggregate tables in S3, stored in columnar Parquet format with intelligent partitioning, forming the foundation for fast real-time queries.
When users click "Run" in the UI, the system triggers an ECS Fargate task running DuckDB. The process is remarkably simple—initialize a DuckDB connection, load pre-aggregated Tier 2 tables directly from S3 using read_parquet, execute business logic in pure SQL joining reference tables with metrics and user inputs, and write results back to S3. The entire process completes in under 2 minutes. DuckDB's vectorized execution handles tens of millions of pre-aggregated records in memory with straightforward SQL—no Spark complexity, no distributed systems concepts.
Why This Combination Works
The hybrid approach solved problems that single-technology solutions couldn't. Pre-aggregation reduces real-time processing from 500M to 20-50M records. DuckDB handles that in under 2 minutes, delivering total user-perceived latency under 5 minutes at p90, often much faster. Glue jobs run independently with native AWS auto-scaling. DuckDB instances on ECS scale horizontally—each user request gets its own container. No resource contention, no queue management, supporting 100+ concurrent executions without custom orchestration.
Glue jobs run 5-10x per day instead of 100+x. DuckDB on Fargate spins up on demand, bills per second, and shuts down after processing. No 24/7 clusters means estimated monthly costs under $500 for moderate workloads, compared to $15,000-30,000 for pure EMR Serverless approaches. Both Glue and ECS Fargate are fully serverless—no clusters to manage, no nodes to patch, no capacity planning. CloudWatch provides unified monitoring, making the operational overhead target of less than 3 hours per week achievable.
Glue Console shows job progress and metrics. DuckDB logs capture query execution details. CloudWatch ties it together. When failures occur, it's clear which stage failed and why. Partial results persist in S3 after each Glue job stage. Glue job bookmarks enable incremental processing for future requirements. The modular design means components can be swapped — replace DuckDB with StarRocks if query complexity grows, or switch to EMR if Glue becomes limiting.
When NOT to Use This Approach
This hybrid architecture isn't universally applicable. If data changes constantly in real time — streaming data that updates every second — pre-aggregation doesn't help. Consider Flink, Kinesis Analytics, or real-time OLAP databases like StarRocks or ClickHouse instead. If you need sub-second query response times and DuckDB on pre-aggregated data delivers 1-2 minute response times, you need a different architecture — probably an MPP database with hot data in memory.
If you have deep Spark expertise and want full control over tuning every partition and optimizing every shuffle, EMR gives you that control. The hybrid approach trades some performance ceiling for operational simplicity. If you're processing small datasets with fewer than 10M records, this architecture is over-engineered. A well-optimized Redshift query or even a Lambda function with pandas might be more straightforward and cheaper.
If you can't separate batch processing from real-time processing — if all data must be processed in real-time with no pre-aggregation — this approach won't work. You'll need a pure real-time architecture. If business logic requires very complex analytical SQL with complex window functions or recursive queries, while DuckDB supports these, an MPP warehouse might be more appropriate.
The Decision Framework
When facing similar architectural choices, start by analyzing data volatility. What percentage of data changes hourly vs daily vs weekly? Can you separate slow-changing reference data from fast-changing transactional data? If more than 80% of data can be pre-aggregated, batch plus real-time hybrid makes sense.
Define your real SLA. What's the actual latency requirement at p50, p90, p99? Is it a hard requirement or a goal? Can users tolerate 2-5 minutes, or do you need a sub-second response? Assess team capabilities. What does the team know today — SQL, Python, Spark, distributed systems? How much time can you invest in learning new technologies? What's the operational overhead tolerance?
Calculate cost at scale. What's the expected query volume — 10 per day, 100 per day, 1000 per day? What's the data growth trajectory over two years? What's the budget for compute — hundreds, thousands, tens of thousands monthly? Consider operational complexity. How many engineers will maintain this? What's the on-call burden tolerance? Do you need 24/7 availability or business-hours-only?
The decision tree is straightforward. If you can pre-aggregate more than 80% of the data, proceed. If not, consider EMR Serverless, StarRocks, or Snowflake. If you need less than a 1-second response time, consider StarRocks or ClickHouse if you have 6 months and a DBA team. If response times of 2-5 minutes are acceptable, the hybrid Glue plus DuckDB approach is likely optimal.
The Bottom Line
If you're processing hundreds of millions of records with strict latency requirements, resist the urge to find a single "perfect" technology. Instead, ask what data changes frequently versus rarely. Separate batch preprocessing from real-time computation — this decision alone can reduce real-time processing volume by 10x. Ask what the team already knows. Learning curves have a real cost. A "technically inferior" solution that ships in 4 weeks often beats a "technically superior" solution that takes 6 months.
Ask what the operational overhead budget is. Complexity is expensive. A solution that requires three hours per week to maintain is fundamentally different from one that requires a dedicated DBA team. Ask what the actual SLA is. Don't over-optimize. If users are happy with 3-minute response times, don't build for 300ms. Save the complexity budget for when you actually need it.
The hybrid Glue plus DuckDB approach works when there's clear separation between slow-changing aggregates and fast-changing user inputs, when operational overhead budgets are constrained, and when 5-minute SLAs are achievable without heroic engineering. Your situation will be different. But the framework — analyze data volatility, match technology to access patterns, optimize for team skills, minimize operational complexity — applies universally.
Opinions expressed by DZone contributors are their own.
Comments