An Overview of Adaptive Joins
An Overview of Adaptive Joins
An Adaptive Join helps database performance quite a lot. You can optimizing joins on the fly, with no code or structure changes needed.
Join the DZone community and get the full member experience.Join For Free
MariaDB TX, proven in production and driven by the community, is a complete database solution for any and every enterprise — a modern database for modern applications.
I was surprised to find out that a lot of people hadn't heard about the new join type, Adaptive Join. So, I figured I could do a quick overview.
Adaptive Join Behavior
Currently, the Adaptive Join only works with columnstore indexes, but according to Microsoft, at some point, they will also work with rowstore. The concept is simple. For larger datasets, frequently (but not always... let's not try to cover every possible caveat — it depends, right?), a hash join is much faster than a loops join. For smaller datasets, frequently, a loops join is faster. Wouldn't it be nice if we could change the join type on the fly so that the most effective join is used depending on the data in the query? Ta-da! Enter: the Adaptive Join.
First, the statistics are used at compile time for the tables we're joining. Based on those statistics, a row target it set. Below that threshold, a loops join will be used. Above that threshold, a hash join. The way the row count is determined is that the operator will always build the hash table. With the hash table built and loaded, it will know how many rows it has. If it's going to do a loops join, the hash table is tossed and a loops join commences. If the threshold has been passed on the row counts and it's going to do a hash join, it already has the hash table built and proceeds to do a hash join. It's easy to understand. Let's see it in action.
Adaptive Join in Action
First, let's create a columnstore index inside AdventureWorks:
CREATE NONCLUSTERED COLUMNSTORE INDEX csTest ON Production.TransactionHistory (ProductID, Quantity, ActualCost);
With that in place, we can run a query that looks like this:
DECLARE @quantity INT; SET @quantity = 9726; SELECT p.NAME, Count(th.productid) AS CountProductID, Sum(th.quantity) AS SumQuantity, Avg(th.actualcost) AS AvgActualCost FROM production.transactionhistory AS th JOIN production.product AS p ON p.productid = th.productid WHERE th.quantity = @quantity GROUP BY th.productid, p.NAME; SET @quantity = 1; SELECT p.NAME, Count(th.productid) AS CountProductID, Sum(th.quantity) AS SumQuantity, Avg(th.actualcost) AS AvgActualCost FROM production.transactionhistory AS th JOIN production.product AS p ON p.productid = th.productid WHERE th.quantity = @quantity GROUP BY th.productid, p.NAME;
What we have are two instances of the same query, just passing in two different values to get two different result sets. The first returns one row and the second returns 229 rows. Let's look at the estimated execution plan for the query:
Reading the plan in the logical processing order, the Adaptive Join operator is the first operator on the left after the
SELECT operator (which, by the way, technically isn't an operator, but doesn't really have any other name, so...). The key to understanding how to read an Adaptive Join is to know that the two paths displayed below it represent, going down, the hash join path and the loops join path. In this case, then, the hash join path would have an Index Scan and a Filter operator while the loops path would have just a Clustered Index Seek.
To see the threshold, we just have to go to the properties of the Adaptive Join operator:
In this instance, then, the threshold is 11.4837 rows. Basically, 11 rows or less, you'll get a loops join (and the lower path), 12 rows or greater and you get a hash join (and the upper path).
Looking at an estimated plan, you can't see which path it took. You have to look at the actual plan. So, if we capture the plans for the query above, they look like this:
Just looking at the graphical plan, you only have one clue as to which path may have been taken. In the second lower plan, you can see that the pipes to the top input of the Adaptive Join operator are much thicker than they are in the first plan. That indicates data being moved, so you can assume that was the path. However, to know for sure, you have to look at the properties to see the actual number of rows being processed by the operators. If it's zero for one of the paths, then the other path was chosen. You can't get a situation where there will be values in both paths. It will always use a single path.
There are quite a bit more details to the implantation, but you can now see the basics. An Adaptive Join helps performance quite a lot. The nature of having to build the hash table before the path is picked does mean that the loops join path won't be as fast as a natural loops join, but who cares? We're optimizing joins on the fly now, with no code or structure changes needed. That's pretty cool and very handy.
Published at DZone with permission of Grant Fritchey , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.