DZone
Database Zone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
  • Refcardz
  • Trend Reports
  • Webinars
  • Zones
  • |
    • Agile
    • AI
    • Big Data
    • Cloud
    • Database
    • DevOps
    • Integration
    • IoT
    • Java
    • Microservices
    • Open Source
    • Performance
    • Security
    • Web Dev
DZone > Database Zone > Reinforcing the Importance of Statistics on Row Estimate

Reinforcing the Importance of Statistics on Row Estimate

Statistics drive the decisions your Optimizer makes. See how to take that into consideration to make sure your queries work how you want them to.

Grant Fritchey user avatar by
Grant Fritchey
·
Nov. 04, 16 · Database Zone · Tutorial
Like (5)
Save
Tweet
3.82K Views

Join the DZone community and get the full member experience.

Join For Free

i recently wrote an introductory post about the importance of statistics . i just received a reinforcement on how important they are during my own work.

bad estimate

i hit a weird problem while i was setting up a query to illustrate a point (blog to be published next week). let’s take the basis of the problem and explain it. i wanted data with distribution skew, so i ran this query to find out if there was a wide disparity between the top and bottom of the range:

select i.billtocustomerid,
   count(i.billtocustomerid) as testcount
from sales.invoices as i
group by i.billtocustomerid
order by testcount asc;

sure enough, the bottom of the range returned three (3) rows and the top returned 21,551. if i then run a query to retrieve just a few rows like this:

select *
from sales.invoices as i
where i.billtocustomerid = 1048;

i get the following execution plan:

simpleplan

i’m happy because this is the plan i expected. with this plan in hand, i don’t bother looking at anything else.

creating a problem

i expand out the query initially as follows:

select i.invoiceid,
   il.invoicelineid,
   si.stockitemname
from sales.invoices as i
join sales.invoicelines as il
   on il.invoiceid = i.invoiceid
join warehouse.stockitems as si
on si.stockitemid = il.stockitemid
where i.billtocustomerid = 1048;

the execution plan now looks like this:

complexplan

frankly, i’m puzzled. why on earth did we go from a key lookup operation to a scan on the invoices table? i rebuild the query a couple of times and it keeps going to a scan. finally, i pause a moment and look at the row estimate (you know, like i should have done the first moment i was puzzled):

row estimate

258 rows? wait, that’s wrong. the number of rows for this value is three. why on earth would it be showing 258? there’s no reason. i haven’t done any kinds of calculations on the columns. i double check the structures. no hidden views or constraints, or anything that would explain why the estimate was so wrong. however, it’s clear that the estimate of 258.181 is causing the loops join and key lookup to go away in favor of a hash join and scan when i add complexity to the row estimate needed by the optimizer.

after thinking about it a while, i finally ran dbcc show_statistics:

statistics

note the highest point on the histogram, 1047. yet i’m passing in 1048.

so, what’s happening?

while the number of rows for 1048 was the lowest, at 3, unfortunately it seems that the 1048 values were added to the table after the statistics for the index had been updated. instead of using something from the histogram, my value fell outside the values in the histogram. when the value is outside histogram the cardinality estimator uses the average value across the entire histogram, 258.181 (at least for any database that’s in sql server 2014 or greater and not running in a compatibility mode), as the row estimate.

i then change the query to use the value 1047, the execution plan then changed to look like this:

cleanplan

the new plan reflects the behavior i was going for when i was setting up the test. the row estimates are now accurate, and small, therefore i get a key lookup operation instead of a scan.

conclusion

statistics drive the decisions made by the optimizer. the very first moment you’re looking at an execution plan and you’re seeing a scan where you thought, for sure, you should have seen a seek, check the row estimates (ok, not the first moment, it could be a coding issue, structural issue, etc.). it could be that your statistics are off. i just received my own reminder to pay more attention to the row estimates and the statistics.

i love playing with statistics and execution plans and queries. as a result, i also like teaching how to do this stuff. if you’re interested, i’m putting on a class in rhode island, december 2016. sign up here .

sql Database Statistics

Published at DZone with permission of Grant Fritchey, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Practice on Pushing Messages to Devices of Different Manufacturers
  • Anypoint CLI Commands in MuleSoft
  • How To Use Open Source Cadence for Polling
  • Caching Across Layers in Software Architecture

Comments

Database Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • MVB Program
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends:

DZone.com is powered by 

AnswerHub logo