Less Is More: Why Cost-Based Optimizer?
Find out here!
Less is more.
— Ludwig Mies van der Rohe
This is no truer statement on the goals of a query optimizer. Do less: Less memory, less CPU, less disk, less IO, fewer instructions, fewer partitions, less overflow. Less everything for the query plan it creates. This is the guiding light for SQL and NoSQL optimizer.
You may also like: Cost-Based Optimzer for Couchbase N1QL (SQL for JSON)
In Couchbase 6.5, we announced the cost-based optimizer (CBO-preview) for N1QL in query service. Here, I’ve tried to answer the questions from NoSQL users unfamiliar with the benefits of CBO.
- Why do we need a CBO?
- What are the performance implications without a CBO?
The topic is vast. Answers here are brief and nonexhaustive.
In 2019, when it matters — like getting to your kid’s recital or a ballgame on time — would you use a static direction map that doesn’t account for the traffic? Google Maps’ route optimizer will optimize for time. The optimizers try to come up with a plan to execute the query with the least resources: CPU, memory. Knowing this, why would you accept a static rule (or query shape!) based optimization on your business-critical database workload?
The database optimizer makes decisions. These decisions have major implications on query performance, system throughput and your ability to meet the SLAs. Databases with a better optimizer will make it easier to develop, manage and meet the SLAs.
SQL is the most successful 4th generation language. As a language, it’s extraordinarily flexible, even when the underlying schema isn’t. You can select, join, project any relation (table or intermediate relations) without planning all of the combinations ahead. This benefits app development and data analysis. This article explains how major NoSQL databases have implemented various elements of SQL. Therefore even NoSQL databases have to care about optimization.
With the extraordinary flexibility of a query language, it comes extraordinary responsibility to optimize and run the queries efficiently. Initial implementations of SQL used rule-based optimizers. This leads to the complexity of rules, user-defined optimizer hints and the query plan efficiency issues for complex queries. The cost-based optimizer changed everything.
The CBO optimizer would correctly optimize the query for a variety of data, data skew, and workloads. It’s not an exaggeration to say RDBMS would not have been so successful in handling such rich use cases at such low cost without a cost-based optimizer. The same is true for NoSQL systems with optimizers.
The database optimizer makes decisions. Bad decisions have huge negative performance implications. For real-world workloads, decisions based on statistics are much better than rule-based decisions. Statistics show that!
The optimizer, broadly speaking, does the following:
Rewrite the query to its optimal equivalent form to make the optimization easier. This includes evaluating constant filters, converting joins, subquery flattening, folding the subqueries and more. The type of rewrites depends on the specific capabilities, nuances of the optimizer in subsequent phases.
Select from available indexes or full scan (primary index scan in case of Couchbase) for each keyspace (equivalent to tables). Here we select one or more indexes for each keyspace, decide the predicates (spans) for each scan request, decide whether it’s covering or not.
The purpose is to limit the size of the intermediate result set. JOINS are performed on two keyspaces (tables) at a time. Depending on the type of join, we can change the order without changing the meaning and result of the query. For example, ((t1 INNER JOIN t2) INNER JOIN t3) is the same as ((t3 INNER JOIN t2) INNER JOIN t1). Here, we select the sequence in which the joins are performed. N1QL Optimizer doesn’t reorder the joins yet.
Each query engine is capable of certain types of joins. Couchbase query service and analytics service both support the nested loop(NLJ) and hash join (HJ). For query service, the nested loop is the default and for analytics service, the hash join is the default. Once the join type is chosen, additional decisions have to be made in order within the join. For NLJ, we need to decide which table is the outer table in which one is the inner table. Typically, we want to choose the table (keyspace) with a smaller result set to be the outer table. For HJ, we need to decide which table is the (hash table) build side and the other becomes the probe side of the plan.
There are additional considerations for optimizations (e.g. first-row optimization when the LIMIT clause is specified).
Create Execution Tree
Finally, create the query execution tree (plan) with the operators and the parameter values that represent the decisions in the earlier phases.
SELECT id, address FROM customer WHERE postalcode = 57020;
The same query can operate on a single row, millions of rows or billions of rows. This is possibly as simple as a query gets, but complexity is hiding just beneath the surface. The optimizer may have many options to get to the data.
- A full table scan is always an option. If the customer table has only a few rows fitting in a database page or two, a full table scan may be the most efficient path to get to the data.
- Imagine you had an index on the table.
- CREATE INDEX i1 ON customer (postalcode).
You would think the index path, where you first scan the index to find the rowid of the rows matching the predicate and then get the rows to project the addition columns (id, address) will be the best. Not so fast. What if the table has a million rows and ALL of them had the same postalcode – 57020? Then the index access path is expensive than a table scan.
Now consider a slight modification to the query.
SELECT id, address FROM customer WHERE postalcode = 57020 and yob < 1980;
Consider you have the following indexes:
CREATE INDEX i1 ON customer(postalcode); CREATE INDEX i2 ON customer(yob); CREATE INDEX i3 ON customer(postalcode, yob); CREATE INDEX i4 ON customer(yob, postalcode); CREATE INDEX i5 ON customer(postalcode, id, address); CREATE INDEX i6 ON customer(yob, id, address); CREATE INDEX i7 ON customer(postalcode, yob, id, address); CREATE INDEX i8 ON customer(yob, postalcode, id, address);
The choice of a valid access path for the optimizer will be:
- Each index i1 through i8 is a valid access path.
- A table scan is always an option.
- Multiple indexes combined.
Suddenly, it’s not easy to choose the best index for the query — even for this simple query. A rule-based optimizer keeps a set of rules and follows those rules consistently to come up with the best plan. The set of rules followed by the N1QL rule-based optimizer is well documented.
These rules were not set in stone from day one. You start preferring index paths, indexes with most keys, etc. Even then you’ll have conflicts.
Query: SELECT id, address FROM customer WHERE postalcode = 57020 and yob < 1980;
CREATE INDEX i7 ON customer(postalcode, yob, id, address); CREATE INDEX i8 ON customer(yob, postalcode, id, address);
A rule-based optimizer can’t figure out which of these indexes is the most efficient. It all comes down to data skew: The index selection in one database won’t be optimal in another database.
SELECT c.state, d.status, SUM(o.sale_amt) FROM order o INNER JOIN customer c ON (o.cid = c.id) INNER JOIN demo d ON (c.did = d.did) WHERE d.edu = “college” AND d.mstatus = “married” GROUP BY c.state, d.status
In the given FROM clause, all the following orders are valid. Which one of them should the optimizer choose?
- ((order INNER JOIN customer) INNER JOIN demo).
- ((customer INNER JOIN order) INNER JOIN demo).
- ((order INNER JOIN demo) INNER JOIN customer).
- ((customer INNER JOIN demo) INNER JOIN order).
- ((demo INNER JOIN order) INNER JOIN customer).
- ((demo INNER JOIN customer) INNER JOIN order).
Choices increase and selection becomes more complicated as the number of keyspaces (or tables) increases in the FROM clause. Having the wrong order means the intermediate results could be huge, only to discard most of that later on. For example, in the query above, joining order with the customer first will create the huge intermediate result set because we’re only interested in “college” educated, married customers. A bad join order negatively affects both the latency of your query and the throughput of the system
SELECT c.state, c.zip, SUM(sale_amt) FROM order o INNER JOIN customer c ON (o.cid = c.id) WHERE o.year = “2018” GROUP BY state, zip;
Two decisions to be made here. JOIN type and order of the tables. Without knowing the statistics on each, it’s impossible to decide intelligently. Hence, rule-based optimizers will simply default to one method and depend on the user to change from the default. This is inefficient and infeasible for large queries. The performance implications of this are huge — from seconds to minutes or from minutes to hours.
Again, the statistical estimates come to the rescue. In enterprise applications, queries with many keyspaces (tables) and complex predicates are common.
For real-world workloads, decisions based on statistics are much better than rule-based decisions. Period. That’s the reason N1QL implemented the cost-based optimizer. Download Couchbase 6.5 now and try it.
And before you decide on a NoSQL database, ask the vendor: Do you have a cost-based optimizer?
- The Unreasonable Effectiveness of SQL in NoSQL Databases: A Comparative Study. https://blog.couchbase.com/the-unreasonable-effectiveness-of-sql-in-nosql-databases/
- The Unreasonable Effectiveness of SQL https://blog.couchbase.com/unreasonable-effectiveness-of-sql/
- Download Couchbase 6.5: https://couchbase.com/downloads?family=server&product=couchbase-server-developer
- An Overview of Query Optimization in Relational Systems. https://cs.stanford.edu/people/chrismre/cs345/rl/chaudhuri98.pdf
Approaches to Query Optimization in NoSQL
Cool SQL Optimizations That Do Not Depend on the Cost Model (Part 1)