DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • AI Paradigm Shift: Analytics Without SQL
  • Beyond Partitioning and Z-Order: A Deep Dive into Liquid Clustering for Unity Catalog Managed Tables
  • One Query, Four GPUs: Tracing a Distributed Training Stall Across Nodes
  • Why We Chose Iceberg Over Delta After Evaluating Both at Scale

Trending

  • How to Set Up and Run PostgreSQL Change Data Capture
  • Throughput vs Goodput: The Performance Metric You Are Probably Ignoring in LLM Testing
  • Key Takeaways From Integrating a RAG Application With LangSmith
  • Your AI Agent Tests Are Passing, But Your Agent Is Still Broken
  1. DZone
  2. Data Engineering
  3. Databases
  4. Detecting Plan Regression in SQL Server Using Query Store

Detecting Plan Regression in SQL Server Using Query Store

A practical weighted baseline approach for detecting regressions by comparing recent workload performance against historical execution patterns.

By 
Deepesh Dhake user avatar
Deepesh Dhake
·
Jun. 09, 26 · Analysis
Likes (0)
Comment
Save
Tweet
Share
51 Views

Join the DZone community and get the full member experience.

Join For Free

Weighted Baseline Regression Pattern for Query Store

Plan regression in SQL Server usually shows up quietly. A query that has been stable for a long time suddenly becomes slower, even though there were no deployments, schema changes, or obvious infrastructure problems. In many cases, the issue is tied to an execution plan change caused by statistics updates, parameter sensitivity, changing data distribution, or normal optimizer behavior.

Problem

A common starting point is comparing "yesterday vs today" using average duration. While quick, this approach often does not reflect the full behavior of the workload

Query Store data is already summarized. By the time you hit sys.query_store_runtime_stats, SQL Server has grouped executions into time windows.

So a few things get lost pretty easily:

  • A couple of bad executions don’t stand out
  • Low traffic queries can look more important than they are
  • Plan changes get mixed with normal variation

There are cases where the average duration appears stable, but individual executions show spikes of 20-30x. You won't see that unless you dig deeper.

Another thing — multiple plans don’t automatically mean trouble. This is normal behavior in SQL Server. Statistics updates, parameter sniffing, and memory pressure can all introduce plan changes without indicating a regression. The key question is whether performance actually degraded after a plan change.

Detection Pattern

The approach is to compare recent performance against a historical baseline, while filtering out low-frequency noise. 

Detection pattern

Architecture View

Start by looking at older data as your baseline:

MS SQL
 
SELECT
    q.query_id,
    p.plan_id,
    SUM(rs.avg_duration * rs.count_executions) * 1.0/ NULLIF(SUM(rs.count_executions), 0) AS baseline_duration
FROM sys.query_store_runtime_stats rs
JOIN sys.query_store_plan p ON rs.plan_id = p.plan_id
JOIN sys.query_store_query q ON p.query_id = q.query_id
WHERE rs.last_execution_time >= DATEADD(day, -7, getdate())
  AND rs.last_execution_time < DATEADD(day, -1, getdate())
GROUP BY q.query_id, p.plan_id


Then compare that with what’s happening now:

MS SQL
 
SELECT
    q.query_id,
    p.plan_id,
    SUM(rs.avg_duration * rs.count_executions) * 1.0/ NULLIF(SUM(rs.count_executions), 0) AS current_duration,
    SUM(rs.count_executions) AS exec_count
FROM sys.query_store_runtime_stats rs
JOIN sys.query_store_plan p ON rs.plan_id = p.plan_id
JOIN sys.query_store_query q ON p.query_id = q.query_id
WHERE rs.last_execution_time >= DATEADD(day, -1, getdate())
GROUP BY q.query_id, p.plan_id;


And then just combine the two queries and look for queries that got significantly slower:

MS SQL
 
WITH baseline AS
(
 		SELECT
        q.query_id,
        p.plan_id,
        SUM(rs.avg_duration * rs.count_executions) * 1.0/ NULLIF(SUM(rs.count_executions), 0) AS baseline_duration,
        SUM(rs.count_executions) AS baseline_exec_count,
        SUM(rs.avg_duration * rs.count_executions) AS baseline_total_duration
    FROM sys.query_store_runtime_stats rs
    JOIN sys.query_store_plan p
        ON rs.plan_id = p.plan_id
    JOIN sys.query_store_query q
        ON p.query_id = q.query_id
    WHERE rs.last_execution_time >= DATEADD(day, -7, getdate())
      AND rs.last_execution_time < DATEADD(day, -1, getdate())
    GROUP BY q.query_id, p.plan_id
),
current_perf AS
(
    SELECT
        q.query_id,
        p.plan_id,
        SUM(rs.avg_duration * rs.count_executions) * 1.0/ NULLIF(SUM(rs.count_executions), 0) AS current_duration,
        SUM(rs.count_executions) AS current_exec_count,
        SUM(rs.avg_duration * rs.count_executions) AS current_total_duration
    FROM sys.query_store_runtime_stats rs
    JOIN sys.query_store_plan p
        ON rs.plan_id = p.plan_id
    JOIN sys.query_store_query q
        ON p.query_id = q.query_id
    WHERE rs.last_execution_time >= DATEADD(day, -1, getdate())
    GROUP BY q.query_id, p.plan_id
)
SELECT
    c.query_id,
    c.plan_id,
    b.baseline_duration,
    c.current_duration,
    CAST(c.current_duration * 1.0 / NULLIF(b.baseline_duration, 0) AS DECIMAL(10,2)) AS slowdown_factor,
    b.baseline_exec_count,
    c.current_exec_count,
    b.baseline_total_duration,
    c.current_total_duration
FROM current_perf c
JOIN baseline b
    ON c.query_id = b.query_id
   AND c.plan_id = b.plan_id
WHERE
    c.current_duration > b.baseline_duration * 1.5
    AND c.current_exec_count > 20
    AND c.current_total_duration > 100000
ORDER BY slowdown_factor DESC;


Check the Plans

Once a query is identified, the next step is to verify whether the execution plan has changed.

MS SQL
 
SELECT
    q.query_id,
    p.plan_id,
    p.last_execution_time
FROM sys.query_store_plan p
JOIN sys.query_store_query q
    ON p.query_id = q.query_id
ORDER BY p.last_execution_time DESC;


If a new plan appears around the same time the performance degradation began, it is often the root cause.

Real-World Example

I've seen this in reporting systems. Query runs fine for weeks, around 150 ms.

Stats update happens, nothing else changes, and suddenly it’s over a second.

Same query, just a different plan. Query Store showed a join strategy change and worse estimates.

Forcing the old plan fixed it immediately. Then we went back and checked why stats pushed the optimizer in that direction.

MS SQL
 
EXEC sp_query_store_force_plan
    @query_id = 42,
    @plan_id = 10;


Conclusion

You don't really need a complex setup for this.

If you understand baseline vs current, and you actually look at execution plans instead of just averages, most regressions are pretty easy to spot.

Query Store already has everything. The key is to avoid overinterpreting simple averages and focus on meaningful workload changes.

sql

Opinions expressed by DZone contributors are their own.

Related

  • AI Paradigm Shift: Analytics Without SQL
  • Beyond Partitioning and Z-Order: A Deep Dive into Liquid Clustering for Unity Catalog Managed Tables
  • One Query, Four GPUs: Tracing a Distributed Training Stall Across Nodes
  • Why We Chose Iceberg Over Delta After Evaluating Both at Scale

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook