The Data Warehouse Concurrency Playbook: Surviving the "Super Bowl" Moment

Classify requests (dashboards vs exploration/jobs), cap and prioritize concurrency, and fall back to cache/rollups so critical dashboards stay responsive during spikes.

Anusha Kovi

CORE ·

May. 08, 26 · Analysis

Likes (0)

Comment

Save

2.2K Views

It was a normal Tuesday until someone dropped a real-time dashboard link into a big team group. A few people opened it, and then a few hundred did. Within minutes, a slack pattern appeared: queries timing out, dashboards spinning, and the inevitable 'Is the data broken?'.

The confusing part here is that the CPU wasn't paged, the warehouse didn't look obviously maxed out, and nothing was 'red.' Yet the platform was unusable. That's what concurrency incidents look like in data: not a clean failure but a slow collapse into queues and retries.

This article is a practical playbook to make spikes boring. When demand explodes, the system should degrade intentionally, keeping the most important BI experiences alive.

Why Warehouses Melt Down Under Concurrency

When concurrency explodes, four things usually happen at once:

Queues form everywhere: Even if you have enough compute, shared bottlenecks start to dominate: contention on resources, compilation, storage and network IO, and metadata calls.
Mixed workloads: Executive dashboards compete with scheduled jobs, notebooks, and bulk reports in the same pool.
Retry storms: timeouts cause automatic retries, which create a second wave of load.
One click becomes many queries: A dashboard isn't one query. It's often 10 to 40 queries multiplied by 300 viewers, and you are suddenly pushing thousands of queries.

Without traffic rules, the warehouse becomes first-come, first-served, which under stress becomes the noisiest workload wins.

The 'Super Bowl Standard' for BI Platforms

In distributed systems, there's a concept I love: when you expect a huge surge, you don't rely only on reactive scaling. You decide what must work, what can degrade, and what should pause. You make the behavior predictable.

For data platforms, the 'must work' path is usually

Incident dashboards
Tier-0 operational/executive dashboards
Critical refresh jobs (only the ones that keep Tier-0 accurate)

Exploration can slow down. Background jobs can pause. Exports can wait.

The goal isn't for everything to work perfectly, but the goal is to keep the right things working.

The Concurrency Playbook

The playbook consists of four parts:

Classify queries
Control admission
Prioritize fairly
Shed load gracefully

Step 1: Classify Queries

Most warehouses don't die because of one bad query. They die because the system treats all queries as equal.

So, the first step is labelling: every query should fall into many classes. Here is the practical set you can use:

The Signature Table: Query Class -> Limits -> Fallbacks

Class A: Tier-0 Dashboards (Must Stay Up)

Some examples are to check orders/minute, today's revenue, and incident health

Limits: Reserved concurrency, retries off, short timeouts, highest priority
Fallbacks: Precomputed rollups/materialized views, cached results with 'as-of timestamp'

Class B: Standard Dashboards (Should Mostly Work)

Some examples are team reports and weekly org KPI dashboards

Limits: Limited retries, concurrency cap, small queue allowed, medium priority
Fallbacks: Reduced dimensions, cached results for recent windows, top-N outputs

Class C: Ad Hoc Exploration (Allowed to Slow)

Some examples are adhoc cohort slicing and analyst notebooks

Limits: Strict concurrency cap, fail fast, low priority during spikes, short queue
Fallbacks: Forces filters, sampling, async execution

Class D: Background Jobs (Can Pause)

Some examples are transforms, non-critical exports, and scheduled refreshes

Limits: Shifted off-peak, throttled by default, separate pool if possible
Fallbacks: Run later, backfill, skip non-critical

Step 2: Admission Control

This steps answers: should this query start now? The minimum controls that work are:

Queue limits
Concurrency caps by tenant/team
Start-time budget (if it can't start soon, fail fast or degrade)
Concurrency caps by class

A simple policy here is:

Class A: admit immediately unless system is in hard outage
Class B: admit if queue less than threshold and warehouse health is good
Class C: admit only if spare capacity exists, otherwise queue briefly then fail fast with guidance
Class D/E: only admit in off-peak windows or if explicitly authorized
Class F: sandbox only

This is how you prevent the worst failure mode which is the warehouse becoming slow for everyone.

Step 3: Prioritization

Once you have queues, ordering matters. Two rules are:

Priority across classes: A>B>C>D>E>F
Fairness within a class: don't let one dashboard or one team consume the whole lane.

Fairness is what prevents a single popular dashboard from starving everything else.

Step 4: Load Shedding

Load shedding is not denying everything. It is a controlled degradation strategy. Good load shedding options for BI are:

Sampling for exploration queries
Pre-aggregated rollups (swap to a smaller table or to a materialized view)
Async execution
Reduced fidelity (fewer dimensions, top N only, coarser time buckets)
Fail fast with guidance (tell the user what to change)
Cached results with an explicit as-of timestamp

Note: Load shedding should never violate guidance. If a user is not allowed to see raw data, do not degrade into exposing it. The fallback must be policy aware.

Sample Policy Config

    YAML
   
 

   # concurrency_policy.yaml (example)
classes:
  A_tier0:
    priority: 100
    max_running: 60
    max_queue: 200
    start_deadline_ms: 2000
    timeout_ms: 8000
    retries: 0
    fallback: cache_or_rollup

  B_standard:
    priority: 70
    max_running: 250
    max_queue: 800
    start_deadline_ms: 8000
    timeout_ms: 20000
    retries: 1
    fallback: cache_recent_or_reduce_dims

  C_explore:
    priority: 30
    max_running: 40
    max_queue: 100
    start_deadline_ms: 1500
    timeout_ms: 12000
    retries: 0
    fallback: sample_or_async

  D_background:
    priority: 20
    max_running: 25
    max_queue: 100
    start_deadline_ms: 30000
    timeout_ms: 60000
    retries: 1
    fallback: defer_to_window

  E_bulk_extract:
    priority: 10
    max_running: 5
    max_queue: 20
    start_deadline_ms: 0
    timeout_ms: 90000
    retries: 0
    fallback: require_approval_or_offpeak

tenants:
  default:
    max_running_per_tenant: 40
  exec_dashboards:
    max_running_per_tenant: 80

global:
  hard_reject_when_unhealthy: true
  unhealthy_signals:
    - queue_depth_p99_gt: 5000
    - compilation_latency_p95_gt_ms: 5000
    - retry_rate_gt: 0.05
  

Sample Admission And Fallback Logic

    Python
   
 

   def handle_request(req):
    cls = classify(req)              # A/B/C/D/E/F
    tenant = req.tenant_id

    if unhealthy() and policy.global.hard_reject_when_unhealthy:
        if cls == "A_tier0":
            # Tier-0 still gets a shot, but we try the safest path first
            return serve_fallback(req, cls, reason="unhealthy_fast_path")
        return reject(req, reason="warehouse_unhealthy")

    if running_count(cls) >= policy[cls].max_running:
        if queue_count(cls) >= policy[cls].max_queue:
            return serve_fallback(req, cls, reason="queue_full")
        enqueue(req, cls)
        if not started_within(req, policy[cls].start_deadline_ms):
            dequeue(req)
            return serve_fallback(req, cls, reason="start_deadline_exceeded")

    # Admitted
    result = execute(req, timeout=policy[cls].timeout_ms, retries=policy[cls].retries)

    if result.timed_out or result.over_budget:
        return serve_fallback(req, cls, reason="timeout_or_budget")

    return result
  

What 'Good' Looks Like During A Spike

When 300 people open the same dashboard, here is how it works:

It is class A or B, so it runs in a protected lane
Rollups/caching absorb repeated refreshes
Exploration (class C) slows, samples, or becomes async
Background jobs (class D) pause temporarily
Bulk exports (class E) move off-peak
Unknown clients (class F) are sandboxed

The result is Tier-0 stays usable, the platform stays alive, and on-call isn't fighting retry storms.

What to Measure

Queue depth over time
Top dashboards by fanout
P95/p99 latency by class
Bytes scanned/cost by class
Retry rate
Admitted vs. rejected vs. shed counts

Common Failure Modes

Retry storms:
- Cause: timeouts trigger auto retries; load doubles
- Fix: retries for Class A; fail fast for Class C; capped retries elsewhere
Unknown clients/backdoor load
- Cause: misconfigured tools or bots hammer the warehouse
- Fix: default Class F sandbox, registration, and quotas
Dashboard bombs:
- Cause: one dashboard triggers many queries and hundreds of viewers amplify it
- Fix: caching or rollups, class A/B priority lanes, per dashboard caps
Background jobs complete with humans:
- Cause: Scheduled refreshes saturate shared resources during peak
- Fix: Class D throttling, off-peak windows, and keep the lights on subset

Conclusion

Concurrent surges are not rare. Successful platforms attract them. The question is whether your warehouse behaves like a panicked crowd or a managed stadium.

With query classes, admission control, prioritization, and load shedding, you can keep tier-0 alive under extreme concurrency and turn the 'Super Bowl Moment' from an outage into an operating mode.

Data warehouse Data (computing) systems

Opinions expressed by DZone contributors are their own.

Related

Trending