Introduction to the Join Ordering Problem
In this article, we define the join ordering problem and walk you through how to estimate the complexity of join planning to ensure good overall performance.
Join the DZone community and get the full member experience.Join For Free
A typical database may execute an SQL query in multiple ways, depending on the selected operators' order and algorithms. One crucial decision is the order in which the optimizer should join relations. The difference between optimal and non-optimal join order might be orders of magnitude. Therefore, the optimizer must choose the proper order of joins to ensure good overall performance. In this blog post, we define the join ordering problem and estimate the complexity of join planning.
Consider the TPC-H schema. The
customer may have
orders. Every order may have several positions defined in the
lineitem table. The
customer table has 150,000 records, the
orders table has 1,500,000 records, and the
lineitem table has 6,000,000 records. Intuitively, every customer places approximately 10 orders, and every order contains four positions on average.
Suppose that we want to retrieve all
lineitem positions for all
orders placed by the given
SELECT lineitem.* FROM customer, orders, lineitem WHERE c_custkey = ? AND c_custkey = o_custkey AND o_orderkey = l_orderkey
Assume that we have a cost model where an operator's cost is proportional to the number of processed tuples.
We consider two different join orders. We can join
orders and then with
lineitem. This join order is very efficient because most customers are filtered early, and we have a tiny intermediate relation.
Alternatively, we can join
lineitem and then with
customer. It produces a large intermediate relation because we map every
lineitem to an
order only to discard most of the produced tuples in the second join.
The two join orders produce plans with very different costs. The first join strategy is highly superior to the second.
A perfect optimizer would need to construct all possible equivalent plans for a given query and choose the best plan. Let's now see how many options the optimizer would need to consider.
We model an n-way join as a sequence of
n-1 two-way joins that form a full binary tree. Leaf nodes are original relations, and internal nodes are join relations. For three relations there are 12 valid join orders:
We count the number of possible join orders for
N relations in two steps. First, we count the number of different orders of leaf nodes. For the first leaf, we choose one of
N relations; for the second leaf, we choose one of the remaining
N-1 relations, etc. This gives us
N! different orders.
Second, we need to calculate the number of all possible shapes of a full binary tree with
N leaves, which is the number of ways of associating
N-1 applications of a binary operator. This number is known to be equal to the Catalan number
C(N-1). Intuitively, for the given fixed order of
N leaf nodes, we need to find the number of ways to set
N-1 pairs of open and close parenthesis. E.g., for the four relations
[a,b,c,d], we have five different parenthesizations:
Multiplying the two parts, we get the final equation:
The number of join orders grows exponentially. For example, for three tables, the number of all possible join plans is
12; for five tables, it is
1,680; for ten tables, it is
17,643,225,600. Practical optimizers use different techniques to ensure the good enough performance of the join enumeration.
Second, optimizers may use various heuristics to limit the search space instead of doing an exhaustive search. A common heuristic is to prune the join orders that yield cross-products. While good enough in the general case, this heuristic may lead to non-optimal plans, e.g., for some star joins. A more aggressive pruning approach is to enumerate only left- or right-deep trees. This significantly reduces planning complexity but degrades the plan quality even further. Probabilistic algorithms might be used (e.g., genetic algorithms or simulated annealing), also without any guarantees on the plan optimality.
In this post, we took a sneak peek at the join ordering problem and got a bird's-eye view of its complexity. In further posts, we will explore the complexity of join order planning for different graph topologies, dive into details of concrete enumeration techniques, and analyze existing and potential strategies of join planning in Apache Calcite. Stay tuned!
Published at DZone with permission of Alexey Goncharuk. See the original article here.
Opinions expressed by DZone contributors are their own.