I recently wrote an introductory post about the importance of statistics. I just received a reinforcement on how important they are during my own work.
I hit a weird problem while I was setting up a query to illustrate a point (blog to be published next week). Let’s take the basis of the problem and explain it. I wanted data with distribution skew, so I ran this query to find out if there was a wide disparity between the top and bottom of the range:
SELECT i.BillToCustomerID, COUNT(i.BillToCustomerID) AS TestCount FROM Sales.Invoices AS i GROUP BY i.BillToCustomerID ORDER BY TestCount ASC;
Sure enough, the bottom of the range returned three (3) rows and the top returned 21,551. If I then run a query to retrieve just a few rows like this:
SELECT * FROM Sales.Invoices AS i WHERE i.BillToCustomerID = 1048;
I get the following execution plan:
I’m happy because this is the plan I expected. With this plan in hand, I don’t bother looking at anything else.
Creating a Problem
I expand out the query initially as follows:
SELECT i.InvoiceID, il.InvoiceLineID, si.StockItemName FROM Sales.Invoices AS i JOIN Sales.InvoiceLines AS il ON il.InvoiceID = i.InvoiceID JOIN Warehouse.StockItems AS si ON si.StockItemID = il.StockItemID WHERE i.BillToCustomerID = 1048;
The execution plan now looks like this:
Frankly, I’m puzzled. Why on earth did we go from a key lookup operation to a scan on the Invoices table? I rebuild the query a couple of times and it keeps going to a scan. Finally, I pause a moment and look at the row estimate (you know, like I should have done the first moment I was puzzled):
258 rows? Wait, that’s wrong. The number of rows for this value is three. Why on earth would it be showing 258? There’s no reason. I haven’t done any kinds of calculations on the columns. I double check the structures. No hidden views or constraints, or anything that would explain why the estimate was so wrong. However, it’s clear that the estimate of 258.181 is causing the loops join and key lookup to go away in favor of a hash join and scan when I add complexity to the row estimate needed by the optimizer.
After thinking about it a while, I finally ran DBCC SHOW_STATISTICS:
Note the highest point on the histogram, 1047. Yet I’m passing in 1048.
So, What’s Happening?
While the number of rows for 1048 was the lowest, at 3, unfortunately it seems that the 1048 values were added to the table after the statistics for the index had been updated. Instead of using something from the histogram, my value fell outside the values in the histogram. When the value is outside histogram the Cardinality Estimator uses the average value across the entire histogram, 258.181 (at least for any database that’s in SQL Server 2014 or greater and not running in a compatibility mode), as the row estimate.
I then change the query to use the value 1047, the execution plan then changed to look like this:
The new plan reflects the behavior I was going for when I was setting up the test. The row estimates are now accurate, and small, therefore I get a key lookup operation instead of a scan.
Statistics drive the decisions made by the optimizer. The very first moment you’re looking at an execution plan and you’re seeing a scan where you thought, for sure, you should have seen a seek, check the row estimates (OK, not the first moment, it could be a coding issue, structural issue, etc.). It could be that your statistics are off. I just received my own reminder to pay more attention to the row estimates and the statistics.
I love playing with statistics and execution plans and queries. As a result, I also like teaching how to do this stuff. If you’re interested, I’m putting on a class in Rhode Island, December 2016. Sign up here.