Oh, the things you read on the Internet. For example, “Don’t use a sub-query because that hurts performance.”
Where do people get these things?
Let’s Test It
I’ve written before about the concept of cargo cult data professionals. They see one issue, one time, and consequently extrapolate that to all issues, all the time. It’s the best explanation I have for why someone would suggest that a sub-query is flat out wrong and will hurt performance.
Let me put a caveat up front (which I will reiterate in the conclusion, just so we’re clear), there’s nothing magically good about sub-queries just like there is nothing magically evil about sub-queries. You can absolutely write a sub-query that performs horribly, does horrible things, runs badly, and therefore absolutely screws up your system. Just as you can with any kind of query. I am addressing the bad advice that a sub-query is to be avoided because they will inherently lead to poor performance.
Let’s start with a simple test, just to validate the concept of how a sub-query performs within SQL Server:
SELECT sd.OrderQty, pr.Name FROM (SELECT * FROM Sales.SalesOrderDetail AS sod ) AS sd JOIN (SELECT * FROM Production.Product AS p ) AS pr ON pr.ProductID = sd.ProductID WHERE sd.SalesOrderID = 52777; SELECT sod.OrderQty, p.Name FROM Sales.SalesOrderDetail AS sod JOIN Production.Product AS p ON p.ProductID = sod.ProductID WHERE sod.SalesOrderID = 52777;
If there is something inherently wrong with a sub-query, then there is something twice as wrong with two sub-queries. Here are the resulting execution plans:
Huh. Looks sort of, I don’t know, almost identical. Let’s compare the plans using the new SSMS plan comparison utility:
Well, darn. Displayed in pink are the common sets of operations between the two plans. In other words, for these plans, everything except the properties of the SELECT operator are exactly the same. Let’s take a look at those properties:
OK. Now we have some interesting differences and some especially interesting similarities. Let’s start with the similarities. First of all, we have exactly the same QueryPlanHash value in both plans. In addition, we also have identical estimated rows and costs. In short, the optimizer created two identical execution plans. Now, this is where things get a little bit interesting. See, the optimizer actually worked a little harder to create the first plan than the second. It took an extra tic on the CPU and just a little more CompileMemory and CompileTime. Interesting.
What about execution times? With a few runs on average, the execution times were identical at about 149mc with 11 reads. However, running a query once or twice isn’t testing. Let’s get a few thousand runs of both queries. The average results from the Extended Events sql_batch_completed event were 75.9 microseconds for both queries.
However, what about that extra little bit of compile time in the query that used sub-queries? Let’s add in a statement to free the procedure cache on each run and retry the queries. There is a measurable difference now:
More work is done by the optimizer on the sub-query to compile the same execution plan. We’re adding work to the optimizer, requiring it to unpack the, admittedly, silly query written above. When we refer only to the compile time and not the execution time, there is a performance hit. Once the query is compiled, the performance is identical. Whether or not you get a performance hit from a sub-query then, in part, depends on the degree to which you’re experiencing compiles or recompiles. Without the recompile, there is no performance hit. At least in this example.
Let’s Test It Again, Harder
I firmly believe in the old adage; if you ain’t cheatin’, you ain’t fightin’. It’s time to put the boot in.
Let’s go with much more interesting queries that are more likely to be written than the silly example above. Let’s assume some versioned data like in this article on Simple-Talk. We could express a query to bring back a single version of one of the documents in one of three ways from the article. We’re just going to mess with two of them. One that uses a sub-query, and one that does not:
--no sub-query SELECT TOP 1 d.DocumentName, d.DocumentID, v.VersionDescription, v.VersionID, ROW_NUMBER() OVER (ORDER BY v.VersionID DESC) AS RowNum FROM dbo.Document d JOIN dbo.Version v ON d.DocumentID = v.DocumentID WHERE d.DocumentID = 9729; --sub-query SELECT d.[DocumentName], d.[DocumentId], v.[VersionDescription], v.[VersionId] FROM dbo.[Document] d CROSS APPLY (SELECT TOP (1) v2.VersionId, v2.VersionDescription FROM dbo.[Version] v2 WHERE v2.DocumentId = d.DocumentId ORDER BY v2.DocumentId, v2.VersionId DESC ) v WHERE d.[DocumentId] = 9729;
As per usual, we can run these once and compare results, but that’s not really meaningful. We’ll run them thousands of times. Also, to be sure we’re comparing apples to apples, we’ll force a recompile on every run, just like in the first set of tests. The results this time:
You’ll note that, even with the compile on each execution, the query using a sub-query actually out-performed the query that was not using a sub-query. The results are even more dramatic when we take away the compile time:
We can also look to the execution plans to get an understanding of how these queries are being resolved:
The plan on top is the sub-query plan, and the plan on the bottom is the plan for just the plain query. You can see that the regular query is doing a lot more work to arrive at an identical set of data. The differences are visible in the average execution time, about a 20% improvement.
You could argue that we’re comparing two completely different queries, but that’s not true. Both queries return exactly the same result set. It just so happens that the query using the sub-query performs better overall in this instance. In short, there’s no reason to be scared of using a sub-query.
Is it possible for you to write horrid code inside of a sub-query that seriously negatively impacts performance? Yes. Absolutely. I’m not arguing that you can’t screw up your system with poor coding practices. You absolutely can. The query optimization process within SQL Server deals well with common coding practices. Therefore, the queries you write can be fairly sophisticated before, by nature of that sophistication, you begin to get serious performance degradation.
You need to have a method of validation for some of what you read on the internet. People should provide both the queries they are testing with and the numbers that their tests showed. If you’re just seeing completely unsupported, wildly egregious statements, they’re probably not true.
In conclusion, it’s safe to use sub-queries. Just be careful with them.