The Data Warehouse Concurrency Playbook: Surviving the "Super Bowl" Moment
Classify requests (dashboards vs exploration/jobs), cap and prioritize concurrency, and fall back to cache/rollups so critical dashboards stay responsive during spikes.
Join the DZone community and get the full member experience.
Join For FreeIt was a normal Tuesday until someone dropped a real-time dashboard link into a big team group. A few people opened it, and then a few hundred did. Within minutes, a slack pattern appeared: queries timing out, dashboards spinning, and the inevitable 'Is the data broken?'.
The confusing part here is that the CPU wasn't paged, the warehouse didn't look obviously maxed out, and nothing was 'red.' Yet the platform was unusable. That's what concurrency incidents look like in data: not a clean failure but a slow collapse into queues and retries.
This article is a practical playbook to make spikes boring. When demand explodes, the system should degrade intentionally, keeping the most important BI experiences alive.
Why Warehouses Melt Down Under Concurrency
When concurrency explodes, four things usually happen at once:
- Queues form everywhere: Even if you have enough compute, shared bottlenecks start to dominate: contention on resources, compilation, storage and network IO, and metadata calls.
- Mixed workloads: Executive dashboards compete with scheduled jobs, notebooks, and bulk reports in the same pool.
- Retry storms: timeouts cause automatic retries, which create a second wave of load.
- One click becomes many queries: A dashboard isn't one query. It's often 10 to 40 queries multiplied by 300 viewers, and you are suddenly pushing thousands of queries.
Without traffic rules, the warehouse becomes first-come, first-served, which under stress becomes the noisiest workload wins.
The 'Super Bowl Standard' for BI Platforms
In distributed systems, there's a concept I love: when you expect a huge surge, you don't rely only on reactive scaling. You decide what must work, what can degrade, and what should pause. You make the behavior predictable.
For data platforms, the 'must work' path is usually
- Incident dashboards
- Tier-0 operational/executive dashboards
- Critical refresh jobs (only the ones that keep Tier-0 accurate)
Exploration can slow down. Background jobs can pause. Exports can wait.
The goal isn't for everything to work perfectly, but the goal is to keep the right things working.
The Concurrency Playbook
The playbook consists of four parts:
- Classify queries
- Control admission
- Prioritize fairly
- Shed load gracefully
Step 1: Classify Queries
Most warehouses don't die because of one bad query. They die because the system treats all queries as equal.
So, the first step is labelling: every query should fall into many classes. Here is the practical set you can use:
The Signature Table: Query Class -> Limits -> Fallbacks
Class A: Tier-0 Dashboards (Must Stay Up)
Some examples are to check orders/minute, today's revenue, and incident health
- Limits: Reserved concurrency, retries off, short timeouts, highest priority
- Fallbacks: Precomputed rollups/materialized views, cached results with 'as-of timestamp'
Class B: Standard Dashboards (Should Mostly Work)
Some examples are team reports and weekly org KPI dashboards
- Limits: Limited retries, concurrency cap, small queue allowed, medium priority
- Fallbacks: Reduced dimensions, cached results for recent windows, top-N outputs
Class C: Ad Hoc Exploration (Allowed to Slow)
Some examples are adhoc cohort slicing and analyst notebooks
- Limits: Strict concurrency cap, fail fast, low priority during spikes, short queue
- Fallbacks: Forces filters, sampling, async execution
Class D: Background Jobs (Can Pause)
Some examples are transforms, non-critical exports, and scheduled refreshes
- Limits: Shifted off-peak, throttled by default, separate pool if possible
- Fallbacks: Run later, backfill, skip non-critical
Step 2: Admission Control
This steps answers: should this query start now? The minimum controls that work are:
- Queue limits
- Concurrency caps by tenant/team
- Start-time budget (if it can't start soon, fail fast or degrade)
- Concurrency caps by class
A simple policy here is:
- Class A: admit immediately unless system is in hard outage
- Class B: admit if queue less than threshold and warehouse health is good
- Class C: admit only if spare capacity exists, otherwise queue briefly then fail fast with guidance
- Class D/E: only admit in off-peak windows or if explicitly authorized
- Class F: sandbox only
This is how you prevent the worst failure mode which is the warehouse becoming slow for everyone.
Step 3: Prioritization
Once you have queues, ordering matters. Two rules are:
- Priority across classes: A>B>C>D>E>F
- Fairness within a class: don't let one dashboard or one team consume the whole lane.
Fairness is what prevents a single popular dashboard from starving everything else.
Step 4: Load Shedding
Load shedding is not denying everything. It is a controlled degradation strategy. Good load shedding options for BI are:
- Sampling for exploration queries
- Pre-aggregated rollups (swap to a smaller table or to a materialized view)
- Async execution
- Reduced fidelity (fewer dimensions, top N only, coarser time buckets)
- Fail fast with guidance (tell the user what to change)
- Cached results with an explicit as-of timestamp
Note: Load shedding should never violate guidance. If a user is not allowed to see raw data, do not degrade into exposing it. The fallback must be policy aware.
Sample Policy Config
# concurrency_policy.yaml (example)
classes:
A_tier0:
priority: 100
max_running: 60
max_queue: 200
start_deadline_ms: 2000
timeout_ms: 8000
retries: 0
fallback: cache_or_rollup
B_standard:
priority: 70
max_running: 250
max_queue: 800
start_deadline_ms: 8000
timeout_ms: 20000
retries: 1
fallback: cache_recent_or_reduce_dims
C_explore:
priority: 30
max_running: 40
max_queue: 100
start_deadline_ms: 1500
timeout_ms: 12000
retries: 0
fallback: sample_or_async
D_background:
priority: 20
max_running: 25
max_queue: 100
start_deadline_ms: 30000
timeout_ms: 60000
retries: 1
fallback: defer_to_window
E_bulk_extract:
priority: 10
max_running: 5
max_queue: 20
start_deadline_ms: 0
timeout_ms: 90000
retries: 0
fallback: require_approval_or_offpeak
tenants:
default:
max_running_per_tenant: 40
exec_dashboards:
max_running_per_tenant: 80
global:
hard_reject_when_unhealthy: true
unhealthy_signals:
- queue_depth_p99_gt: 5000
- compilation_latency_p95_gt_ms: 5000
- retry_rate_gt: 0.05
Sample Admission And Fallback Logic
def handle_request(req):
cls = classify(req) # A/B/C/D/E/F
tenant = req.tenant_id
if unhealthy() and policy.global.hard_reject_when_unhealthy:
if cls == "A_tier0":
# Tier-0 still gets a shot, but we try the safest path first
return serve_fallback(req, cls, reason="unhealthy_fast_path")
return reject(req, reason="warehouse_unhealthy")
if running_count(cls) >= policy[cls].max_running:
if queue_count(cls) >= policy[cls].max_queue:
return serve_fallback(req, cls, reason="queue_full")
enqueue(req, cls)
if not started_within(req, policy[cls].start_deadline_ms):
dequeue(req)
return serve_fallback(req, cls, reason="start_deadline_exceeded")
# Admitted
result = execute(req, timeout=policy[cls].timeout_ms, retries=policy[cls].retries)
if result.timed_out or result.over_budget:
return serve_fallback(req, cls, reason="timeout_or_budget")
return result
What 'Good' Looks Like During A Spike
When 300 people open the same dashboard, here is how it works:
- It is class A or B, so it runs in a protected lane
- Rollups/caching absorb repeated refreshes
- Exploration (class C) slows, samples, or becomes async
- Background jobs (class D) pause temporarily
- Bulk exports (class E) move off-peak
- Unknown clients (class F) are sandboxed
The result is Tier-0 stays usable, the platform stays alive, and on-call isn't fighting retry storms.
What to Measure
- Queue depth over time
- Top dashboards by fanout
- P95/p99 latency by class
- Bytes scanned/cost by class
- Retry rate
- Admitted vs. rejected vs. shed counts
Common Failure Modes
- Retry storms:
- Cause: timeouts trigger auto retries; load doubles
- Fix: retries for Class A; fail fast for Class C; capped retries elsewhere
- Unknown clients/backdoor load
- Cause: misconfigured tools or bots hammer the warehouse
- Fix: default Class F sandbox, registration, and quotas
- Dashboard bombs:
- Cause: one dashboard triggers many queries and hundreds of viewers amplify it
- Fix: caching or rollups, class A/B priority lanes, per dashboard caps
- Background jobs complete with humans:
- Cause: Scheduled refreshes saturate shared resources during peak
- Fix: Class D throttling, off-peak windows, and keep the lights on subset
Conclusion
Concurrent surges are not rare. Successful platforms attract them. The question is whether your warehouse behaves like a panicked crowd or a managed stadium.
With query classes, admission control, prioritization, and load shedding, you can keep tier-0 alive under extreme concurrency and turn the 'Super Bowl Moment' from an outage into an operating mode.
Opinions expressed by DZone contributors are their own.
Comments