DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
  1. DZone
  2. Data Engineering
  3. Data
  4. Determining the Cost Threshold for Parallelism

Determining the Cost Threshold for Parallelism

If you want to know when your parallel processing should begin, there's a bit of math involved. But setting up your standard deviations can make your choices much easier.

Grant Fritchey user avatar by
Grant Fritchey
·
Mar. 01, 17 · Tutorial
Like (3)
Save
Tweet
Share
4.76K Views

Join the DZone community and get the full member experience.

Join For Free

in the previous post, i showed how you can get full listings of your execution plan costs . knowing what the values you’re dealing with for the estimated costs on your execution plans can help you determine what the cost threshold on your system should be. however, we don’t want to just take the average and use that. you need to understand the data you’re looking at. let’s explore this just a little using r.

mean, median, range, and standard deviation

i’ve used the queries in the previous blog post to generate a full listing of costs for my plans. with that, i can start to query the information. here’s how i could use r to begin to explore the data:

library("rodbc", lib.loc="~/r/win-library/3.2")

query <- "select * from dbo.querycost;"

dbhandle <-
  odbcdriverconnect(
    'driver={sql server};server=win-3srg45gbf97\\dojo;database=adventureworks2014;trusted_connection=true'
  )
data <- sqlquery(dbhandle,query)

##data

mean(data$estimatedcost)

median(sort(data$estimatedcost))

maxcost = max(data$estimatedcost)
mincost = min(data$estimatedcost)
costrange = maxcost - mincost
costrange

sd(data$estimatedcost)


the mean function is going to get me my average value, which, in this case, is 0.8755985. if i just accept the average as a starting point for determining my cost threshold for parallelism, i guess i can just leave it at the default value of 5 and feel quite comfortable. this is further supported by the median value of .0544886 from my data. however, let’s check out the costrange value. knowing an average, a mean, or even a median (literally, the middle number of the set), doesn’t give you an indication of just how distributed the data is. my costrange, the max minus the min, comes out to 165.567. in other words, there is a pretty wide variation on costs and suddenly, i’m less convinced that i know what my cost threshold should be.

the next value that matters is the standard deviation . this gives you an idea of how distributed your data is. i’m not going to get into explaining the math behind it. my standard deviation value is 8.301819. with this, i know that a pretty healthy chunk of all my values are less than a cost estimated value of 8, since a single standard deviation would be 8.301819 on top of my average value of .8755985.

with this knowledge, i can start to make informed choices. i’m not relying simply on an average. i can begin to think through the process using statistics.

just to help out with the thought process, let’s plot the values too.

histogram

my first thought for any kind of data is statistics, so let’s see what a histogram would look like. this is really easy to do using r:

hist(data$estimatedcost)


the output looks like this:

image title

clearly, this doesn’t give me enough to work on. most of my data, nearly 1500 distinct values, is at one end of the distribution, and all the rest is elsewhere. i can’t use this to judge any kind of decision around my cost threshold.

scatter plot

the histogram isn’t telling me enough, so let’s try throwing the data into a scatter plot. again, this is silly easy in r:

plot(data$estimatedcost)


the output is a lot more useful:

image title

now i can begin to visually see what the standard deviation value was telling me. the vast majority of my costs are well below two standard deviations, or approximately 16. however, let’s clean up the data just a little bit and make this as clear as we can.

density plus values

instead of just plotting the values, let’s get the density, or more precisely, a kernel density estimation , basically a smooth graph of the distribution of the data, and plot that:

plot(density(data$estimatedcost))
rug(data$estimatedcost,col='red')


i went ahead and added the values down below so that you can see how the distribution goes as well as showing the smooth curve:

image title

that one pretty much tells the tale. the vast majority of the values are clumped up at one end, along with a scattering of cost estimates above the value of 5, but not by huge margins.

conclusion

with the standard deviation in hand, and a quick rule of thumb that says 68% of all values are going to be within two standard deviations of the data set, i can determine that a value of 16 on my cost threshold for parallelism is going to cover most cases, and will ensure that only a small percentage of queries go parallel on my system, but that those which do go parallel are actually costly queries, not some that just fall outside the default value of 5.

i’ve made a couple of assumptions that are not completely held up by the data. using the two, or even three, standard deviations to cover just enough of the data isn’t actually supported in this case because i don’t have a normal distribution of data. in fact, the distribution here is quite heavily skewed to one end of the chart. there’s also no data on the frequency of these calls. you may want to add that into your plans for setting your cost threshold.

however, using the math that i can quickly take advantage of, and the ability to plot out the data, i can, with a much higher degree of confidence, make choices on how i want my cost threshold for parallelism to be set on my servers. from there, i measure the effects of my choices and adjust as necessary.

Data (computing) Deviation (statistics)

Published at DZone with permission of Grant Fritchey, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Memory Debugging: A Deep Level of Insight
  • Building a Scalable Search Architecture
  • How to Secure Your CI/CD Pipeline
  • Secrets Management

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: