DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • Stop Running Two Data Systems for One Agent Query
  • Scaling Cloud Data Automation: A Practical Guide to Open Table Formats
  • Why SAP S/4HANA Landscape Design Impacts Cloud TCO More Than Compute Costs
  • Beyond SOLID: Embracing CUPID for Modern Software Craftsmanship

Trending

  • The Hidden Cost of Overprivileged Tokens: Designing Messaging Platforms That Assume Compromise
  • LLM Agents and Getting Started with Them
  • Designing API-First EMR Architectures in .NET: Enabling Modular Growth in Compliance-Driven Systems
  • S3 Vectors: How to Build a RAG Without a Vector Database
  1. DZone
  2. Data Engineering
  3. Data
  4. The Data Warehouse Concurrency Playbook: Surviving the "Super Bowl" Moment

The Data Warehouse Concurrency Playbook: Surviving the "Super Bowl" Moment

Classify requests (dashboards vs exploration/jobs), cap and prioritize concurrency, and fall back to cache/rollups so critical dashboards stay responsive during spikes.

By 
Anusha Kovi user avatar
Anusha Kovi
DZone Core CORE ·
May. 08, 26 · Analysis
Likes (0)
Comment
Save
Tweet
Share
1.9K Views

Join the DZone community and get the full member experience.

Join For Free

It was a normal Tuesday until someone dropped a real-time dashboard link into a big team group. A few people opened it, and then a few hundred did. Within minutes, a slack pattern appeared: queries timing out, dashboards spinning, and the inevitable 'Is the data broken?'.

The confusing part here is that the CPU wasn't paged, the warehouse didn't look obviously maxed out, and nothing was 'red.' Yet the platform was unusable. That's what concurrency incidents look like in data: not a clean failure but a slow collapse into queues and retries.

This article is a practical playbook to make spikes boring. When demand explodes, the system should degrade intentionally, keeping the most important BI experiences alive.

Why Warehouses Melt Down Under Concurrency

When concurrency explodes, four things usually happen at once:

  1. Queues form everywhere: Even if you have enough compute, shared bottlenecks start to dominate: contention on resources, compilation, storage and network IO, and metadata calls.
  2. Mixed workloads: Executive dashboards compete with scheduled jobs, notebooks, and bulk reports in the same pool.
  3. Retry storms: timeouts cause automatic retries, which create a second wave of load.
  4. One click becomes many queries: A dashboard isn't one query. It's often 10 to 40 queries multiplied by 300 viewers, and you are suddenly pushing thousands of queries.

Without traffic rules, the warehouse becomes first-come, first-served, which under stress becomes the noisiest workload wins. 

The 'Super Bowl Standard' for BI Platforms

In distributed systems, there's a concept I love: when you expect a huge surge, you don't rely only on reactive scaling. You decide what must work, what can degrade, and what should pause. You make the behavior predictable.

For data platforms, the 'must work' path is usually

  • Incident dashboards
  • Tier-0 operational/executive dashboards
  • Critical refresh jobs (only the ones that keep Tier-0 accurate)

Exploration can slow down. Background jobs can pause. Exports can wait.

The goal isn't for everything to work perfectly, but the goal is to keep the right things working.

The Concurrency Playbook

The playbook consists of four parts:

  1. Classify queries 
  2. Control admission
  3. Prioritize fairly
  4. Shed load gracefully

Step 1: Classify Queries

Most warehouses don't die because of one bad query. They die because the system treats all queries as equal. 

So, the first step is labelling: every query should fall into many classes. Here is the practical set you can use:

The Signature Table: Query Class -> Limits -> Fallbacks

Class A: Tier-0 Dashboards (Must Stay Up)

Some examples are to check orders/minute, today's revenue, and incident health

  • Limits: Reserved concurrency, retries off, short timeouts, highest priority
  • Fallbacks: Precomputed rollups/materialized views, cached results with 'as-of timestamp'

Class B: Standard Dashboards (Should Mostly Work)

Some examples are team reports and weekly org KPI dashboards

  • Limits: Limited retries, concurrency cap, small queue allowed, medium priority
  • Fallbacks: Reduced dimensions, cached results for recent windows, top-N outputs

Class C: Ad Hoc Exploration (Allowed to Slow)

Some examples are adhoc cohort slicing and analyst notebooks

  • Limits: Strict concurrency cap, fail fast, low priority during spikes, short queue
  • Fallbacks: Forces filters, sampling, async execution

Class D: Background Jobs (Can Pause)

Some examples are transforms, non-critical exports, and scheduled refreshes

  • Limits: Shifted off-peak, throttled by default, separate pool if possible
  • Fallbacks: Run later, backfill, skip non-critical

Step 2: Admission Control

This steps answers: should this query start now? The minimum controls that work are:

  • Queue limits
  • Concurrency caps by tenant/team
  • Start-time budget (if it can't start soon, fail fast or degrade)
  • Concurrency caps by class

A simple policy here is:

  • Class A: admit immediately unless system is in hard outage
  • Class B: admit if queue less than threshold and warehouse health is good
  • Class C: admit only if spare capacity exists, otherwise queue briefly then fail fast with guidance
  • Class D/E: only admit in off-peak windows or if explicitly authorized
  • Class F: sandbox only

This is how you prevent the worst failure mode which is the warehouse becoming slow for everyone.

Step 3: Prioritization

Once you have queues, ordering matters. Two rules are:

  1. Priority across classes: A>B>C>D>E>F
  2. Fairness within a class: don't let one dashboard or one team consume the whole lane.

Fairness is what prevents a single popular dashboard from starving everything else.

Step 4: Load Shedding 

Load shedding is not denying everything. It is a controlled degradation strategy. Good load shedding options for BI are:

  1. Sampling for exploration queries
  2. Pre-aggregated rollups (swap to a smaller table or to a materialized view)
  3. Async execution 
  4. Reduced fidelity (fewer dimensions, top N only, coarser time buckets)
  5. Fail fast with guidance (tell the user what to change)
  6. Cached results with an explicit as-of timestamp

Note: Load shedding should never violate guidance. If a user is not allowed to see raw data, do not degrade into exposing it. The fallback must be policy aware.

Sample Policy Config

YAML
 
# concurrency_policy.yaml (example)
classes:
  A_tier0:
    priority: 100
    max_running: 60
    max_queue: 200
    start_deadline_ms: 2000
    timeout_ms: 8000
    retries: 0
    fallback: cache_or_rollup

  B_standard:
    priority: 70
    max_running: 250
    max_queue: 800
    start_deadline_ms: 8000
    timeout_ms: 20000
    retries: 1
    fallback: cache_recent_or_reduce_dims

  C_explore:
    priority: 30
    max_running: 40
    max_queue: 100
    start_deadline_ms: 1500
    timeout_ms: 12000
    retries: 0
    fallback: sample_or_async

  D_background:
    priority: 20
    max_running: 25
    max_queue: 100
    start_deadline_ms: 30000
    timeout_ms: 60000
    retries: 1
    fallback: defer_to_window

  E_bulk_extract:
    priority: 10
    max_running: 5
    max_queue: 20
    start_deadline_ms: 0
    timeout_ms: 90000
    retries: 0
    fallback: require_approval_or_offpeak

tenants:
  default:
    max_running_per_tenant: 40
  exec_dashboards:
    max_running_per_tenant: 80

global:
  hard_reject_when_unhealthy: true
  unhealthy_signals:
    - queue_depth_p99_gt: 5000
    - compilation_latency_p95_gt_ms: 5000
    - retry_rate_gt: 0.05


Sample Admission And Fallback Logic

Python
 
def handle_request(req):
    cls = classify(req)              # A/B/C/D/E/F
    tenant = req.tenant_id

    if unhealthy() and policy.global.hard_reject_when_unhealthy:
        if cls == "A_tier0":
            # Tier-0 still gets a shot, but we try the safest path first
            return serve_fallback(req, cls, reason="unhealthy_fast_path")
        return reject(req, reason="warehouse_unhealthy")

    if running_count(cls) >= policy[cls].max_running:
        if queue_count(cls) >= policy[cls].max_queue:
            return serve_fallback(req, cls, reason="queue_full")
        enqueue(req, cls)
        if not started_within(req, policy[cls].start_deadline_ms):
            dequeue(req)
            return serve_fallback(req, cls, reason="start_deadline_exceeded")

    # Admitted
    result = execute(req, timeout=policy[cls].timeout_ms, retries=policy[cls].retries)

    if result.timed_out or result.over_budget:
        return serve_fallback(req, cls, reason="timeout_or_budget")

    return result


What 'Good' Looks Like During A Spike

When 300 people open the same dashboard, here is how it works:

  • It is class A or B, so it runs in a protected lane
  • Rollups/caching absorb repeated refreshes
  • Exploration (class C) slows, samples, or becomes async
  • Background jobs (class D) pause temporarily
  • Bulk exports (class E) move off-peak
  • Unknown clients (class F) are sandboxed

The result is Tier-0 stays usable, the platform stays alive, and on-call isn't fighting retry storms.

What to Measure

  • Queue depth over time
  • Top dashboards by fanout
  • P95/p99 latency by class 
  • Bytes scanned/cost by class
  • Retry rate
  • Admitted vs. rejected vs. shed counts

Common Failure Modes

  1. Retry storms:
    • Cause: timeouts trigger auto retries; load doubles
    • Fix: retries for Class A; fail fast for Class C; capped retries elsewhere
  2. Unknown clients/backdoor load
    • Cause: misconfigured tools or bots hammer the warehouse
    • Fix: default Class F sandbox, registration, and quotas
  3. Dashboard bombs: 
    • Cause: one dashboard triggers many queries and hundreds of viewers amplify it
    • Fix: caching or rollups, class A/B priority lanes, per dashboard caps
  4. Background jobs complete with humans:
    • Cause: Scheduled refreshes saturate shared resources during peak
    • Fix: Class D throttling, off-peak windows, and keep the lights on subset

Conclusion

Concurrent surges are not rare. Successful platforms attract them. The question is whether your warehouse behaves like a panicked crowd or a managed stadium. 

With query classes, admission control, prioritization, and load shedding, you can keep tier-0 alive under extreme concurrency and turn the 'Super Bowl Moment' from an outage into an operating mode.

Data warehouse Data (computing) systems

Opinions expressed by DZone contributors are their own.

Related

  • Stop Running Two Data Systems for One Agent Query
  • Scaling Cloud Data Automation: A Practical Guide to Open Table Formats
  • Why SAP S/4HANA Landscape Design Impacts Cloud TCO More Than Compute Costs
  • Beyond SOLID: Embracing CUPID for Modern Software Craftsmanship

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook