Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Cool SQL Optimizations That Do Not Depend on the Cost Model (Part 1)

DZone's Guide to

Cool SQL Optimizations That Do Not Depend on the Cost Model (Part 1)

Learn about five simple optimizations that can be implemented purely based on metadata (i.e. constraints) and the query itself.

· Database Zone
Free Resource

Learn how to create flexible schemas in a relational database using SQL for JSON.

Cost-based optimization is the de-facto standard way to optimize SQL queries in most modern databases. It is the reason why it is really hard to implement a complex, hand-written algorithm in a 3GL (third-generation programming language) such as Java that outperforms a dynamically calculated database execution plan that has been generated from a modern optimizer. I’ve recently delivered a talk about that topic:

Today, we don’t want to talk about cost-based optimization, i.e. optimizations that depend on a database’s cost model. We’ll look into much simpler optimizations that can be implemented purely based on metadata (i.e. constraints) and the query itself. They’re usually no-brainers for a database to optimize because the optimization will always lead to a better execution plan, independently of whether there are any indexes, or how much data you have, or how skewed your data distribution is.

So, they’re not no-brainers in the sense whether they’re easy for the optimiser teams to implement, but they’re no-brainers in the sense whether they should be done.

These optimizations remove needless, optional work (as opposed to needless, mandatory work, which I’ve blogged about before)

Where Do These Optimizations Apply?

Most of these optimizations are applied to:

  • Fix mistakes in queries
  • Allow for reusing complex views without actually executing the entire logic from the view

In the first case, you could claim: “Well, then fix the stupid SQL already,” but then again, who never makes any mistakes, right?

Specifically, the second case is really cool, as these optimizations allow us to build complex libraries of views and table-valued functions, which we can reuse in several layers.

Databases Being Used

This post will evaluate 10 SQL optimizations on the five most popular RDBMS (according to the db-engines ranking):

  • Oracle 12.2
  • MySQL 8.0.2
  • SQL Server 2014
  • PostgreSQL 9.6
  • DB2 LUW 10.5

In all of this article, I will be using queries against the Sakila database — as always.

Sakila database

These will be the ten optimization types:

  1. Transitive closure
  2. Impossible predicates and unneeded table accesses
  3. JOIN elimination
  4. Removing “silly” predicates
  5. Projections in EXISTS subqueries
  6. Predicate merging
  7. Provably empty sets
  8. CHECK constraints
  9. Unneeded self JOIN
  10. Predicate pushdown

We'll talk about 1-5 today, and 6-10 in Part 2.

1. Transitive Closure

Let’s start with something simple: transitive closure. It’s a really trivial concept that applies to a variety of math operations, i.e. to the equality operator. It can be said that if A = B and B = C, then A = C.

Duh, right? But this has some nice implications on SQL optimizers.

Let’s look at an example. Let’s get all films for ACTOR_ID = 1:

SELECT first_name, last_name, film_id
FROM actor a
JOIN film_actor fa ON a.actor_id = fa.actor_id
WHERE a.actor_id = 1;

The result being:

FIRST_NAME      LAST_NAME  FILM_ID
PENELOPE        GUINESS    1
PENELOPE        GUINESS    23
PENELOPE        GUINESS    25
PENELOPE        GUINESS    106
PENELOPE        GUINESS    140
PENELOPE        GUINESS    166
...

Now, observe the execution plan if we run this query in Oracle:

--------------------------------------------------------------
| Id  | Operation                    | Name          | Rows  |
--------------------------------------------------------------
|   0 | SELECT STATEMENT             |               |       |
|   1 |  NESTED LOOPS                |               |    19 |
|   2 |   TABLE ACCESS BY INDEX ROWID| ACTOR         |     1 |
|*  3 |    INDEX UNIQUE SCAN         | PK_ACTOR      |     1 |
|*  4 |   INDEX RANGE SCAN           | PK_FILM_ACTOR |    19 |
--------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - access("A"."ACTOR_ID"=1)
   4 - access("FA"."ACTOR_ID"=1)

Specifically, the predicate section is really interesting. The predicate ACTOR_ID = 1 is applied to both the ACTOR and FILM_ACTOR tables because of transitive closure. If:

  • A.ACTOR_ID = 1 (from the WHERE predicate) and…
  • A.ACTOR_ID = FA.ACTOR_ID (from the ON predicate)

Then:

  • FA.ACTOR_ID = 1

This has a few nice effects on more complex queries. In particular, the cardinality estimates will be much more precise this way, as we can pick the estimate based on a concrete, constant predicate value, rather than, for example, the average number of films per actor as in this query (which returns the same result):

SELECT first_name, last_name, film_id
FROM actor a
JOIN film_actor fa ON a.actor_id = fa.actor_id
WHERE first_name = 'PENELOPE'
AND last_name = 'GUINESS'

The plan being:

----------------------------------------------------------------------------
| Id  | Operation                            | Name                | Rows  |
----------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |                     |       |
|   1 |  NESTED LOOPS                        |                     |     2 |
|*  2 |   TABLE ACCESS BY INDEX ROWID BATCHED| ACTOR               |     1 |
|*  3 |    INDEX RANGE SCAN                  | IDX_ACTOR_LAST_NAME |     3 |
|*  4 |   INDEX RANGE SCAN                   | PK_FILM_ACTOR       |    27 |
----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("A"."FIRST_NAME"='PENELOPE')
   3 - access("A"."LAST_NAME"='GUINESS')
   4 - access("A"."ACTOR_ID"="FA"."ACTOR_ID")

As you can see, the estimate for the number of of FILM_ACTOR rows is too high, and the estimate for the NESTED LOOP result is too low. Here are some interesting numbers:

SELECT count(*) FROM film_actor WHERE actor_id = 1;

SELECT avg(c) FROM (
  SELECT count(*) c FROM film_actor GROUP BY actor_id
);

Resulting in:

19
27.315

That’s where those estimates come from. If the database knows we’re dealing with ACTOR_ID = 1, it can pick the statistics on the number of films for that actor. If it doesn’t know this (because our standard statistics don’t correlate FIRST_NAME / LAST_NAME with ACTOR_ID), then we get the average number of films for any actor. Simple, insignificant error in this particular case, but when that error propagates in a complex query, it can add up and lead to the wrong choice of JOIN down the line (or up the plan).

So, when you can, always design JOIN and ordinary predicates to profit from transitive closure.

What other databases support this?

DB2

Yes!

Explain Plan                                               
-----------------------------------------------------------
ID | Operation              |                 Rows | Cost  
 1 | RETURN                 |                      |   13  
 2 |  NLJOIN                |              27 of 1 |   13  
 3 |   FETCH ACTOR          |     1 of 1 (100.00%) |    6  
 4 |    IXSCAN PK_ACTOR     |   1 of 200 (   .50%) |    0  
 5 |   IXSCAN PK_FILM_ACTOR | 27 of 5462 (   .49%) |    6  

Predicate Information                                      
 4 - START (Q2.ACTOR_ID = 1)                               
      STOP (Q2.ACTOR_ID = 1)                               
 5 - START (1 = Q1.ACTOR_ID)                               
      STOP (1 = Q1.ACTOR_ID)                               

Btw, want cool execution plans like the above on DB2 LUW? Go visit Markus Winand’s script.

MySQL

Unfortunately, MySQL explain plans are not very useful for such analyses. We don’t really see the predicate itself in this output:

ID  SELECT TYPE  TABLE  TYPE   REF    ROWS
------------------------------------------
1   SIMPLE       a      const  const  1 
1   SIMPLE       fa     ref    const  19

But the fact that the REF column is two times “const” indicates that we’re scanning for a constant value in both tables. Conversely, the plan that queries for FIRST_NAME / LAST_NAME looks like this:

ID  SELECT TYPE  TABLE  TYPE   REF         ROWS
-----------------------------------------------
1   SIMPLE       a      ref    const       3 
1   SIMPLE       fa     ref    a.actor_id  27

And you can see that REF has now switched to a column reference from the JOIN predicate. The cardinality estimate is now almost the same as in Oracle.

So, yes, MySQL supports transitive closure, too.

PostgreSQL

Yes!

QUERY PLAN                                                                          
------------------------------------------------------------------------------------
Nested Loop  (cost=4.49..40.24 rows=27 width=15)                                    
  ->  Seq Scan on actor a  (cost=0.00..4.50 rows=1 width=17)                        
        Filter: (actor_id = 1)                                                      
  ->  Bitmap Heap Scan on film_actor fa  (cost=4.49..35.47 rows=27 width=4)         
        Recheck Cond: (actor_id = 1)                                                
        ->  Bitmap Index Scan on film_actor_pkey  (cost=0.00..4.48 rows=27 width=0) 
              Index Cond: (actor_id = 1)                                            

SQL Server

Yes!

  |--Nested Loops(Inner Join)
       |--Nested Loops(Inner Join)
       |    |--Index Seek (SEEK:([a].[actor_id]=(1)))
       |    |--RID Lookup
       |--Index Seek (SEEK:([fa].[actor_id]=(1)))

Summary

All databases can do transitive closure:

Database Transitive closure
DB2 LUW 10.5 Yep
MySQL 8.0.2 Yep
Oracle 12.2.0.1 Yep
PostgreSQL 9.6 Yep
SQL Server 2014 Yep

Stay tuned, though, for #6 in the next post. There are more complex cases of transitive closure, where not all databases get it right.

2. Impossible Predicates and Unneeded Table Accesses

This optimization is really silly, but hey, why not? If users write impossible predicates, then why even execute them? Here are some examples:

-- "Obvious"
SELECT * FROM actor WHERE 1 = 0

-- "Subtle"
SELECT * FROM actor WHERE NULL = NULL

The first query should obviously never return any results, but the same is true for the second one — because while NULL IS NULL yields TRUE, always, NULL = NULL evaluates to NULL, which has the same effect as FALSE according to three-valued logic.

This doesn’t need much explanation, so let’s immediately jump to see which databases optimize this.

DB2

Yes!

Explain Plan                       
-----------------------------------
ID | Operation      |   Rows | Cost
 1 | RETURN         |        |    0
 2 |  TBSCAN GENROW | 0 of 0 |    0

As you can see, the table access to the ACTOR table is completely eliminated from the plan. There’s only a GENROW operation, which generates zero rows. Perfect.

MySQL

Yes!

ID  SELECT TYPE  TABLE   EXTRAS
-----------------------------------------
1   SIMPLE         Impossible WHERE

This time, MySQL has been so kind to indicate that the WHERE clause is impossible. Thanks! That’s helpful when analyzing — more so than the other databases.

Oracle

Yes!

---------------------------------------------------------------
| Id  | Operation          | Name  | Starts | E-Rows | A-Rows |
---------------------------------------------------------------
|   0 | SELECT STATEMENT   |       |      1 |        |      0 |
|*  1 |  FILTER            |       |      1 |        |      0 |
|   2 |   TABLE ACCESS FULL| ACTOR |      0 |    200 |      0 |
---------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter(NULL IS NOT NULL)

Now, observe that the plan still shows the table access to the ACTOR table, and the estimated number of rows is still 200, but there’s a FILTER operation on Id=1, which can never be true. Because Oracle really really doesn’t like the SQL standard BOOLEAN type, they display NULL IS NOT NULL in the plan, rather than simply FALSE. Oh well

But seriously, do check for this predicate. I’ve been debugging 1,000-line-long execution plan subtrees with super high costs before noticing that the entire subtree was “cut off” by NULL IS NOT NULL. A bit misleading, if you ask me.

PostgreSQL

Yes!

QUERY PLAN                                 
-------------------------------------------
Result  (cost=0.00..0.00 rows=0 width=228) 
  One-Time Filter: false                   

That’s nicer. No noisy ACTOR access and a nice little FALSE predicate.

SQL Server

Yes!

  |--Constant Scan

SQL Server calls this a “constant scan,” i.e. a scan where nothing happens — just like DB2.

All databases can eliminate impossible predicates:

Database Impossible predicates Unneeded table access
DB2 LUW 10.5 Yep Yep
MySQL 8.0.2 Yep Yep
Oracle 12.2.0.1 Yep Yep
PostgreSQL 9.6 Yep Yep
SQL Server 2014 Yep Yep

3. JOIN Elimination

In the previous section, we’ve seen unneeded table access for single table queries. But what happens if one out of several table accesses is unneeded in a JOIN?

I’ve already blogged about JOIN elimination in a previous blog post. SQL engines can determine, based on the way a query is written, and based on the presence of PRIMARY KEYs and FOREIGN KEYs, whether any given JOIN is really required in a query, or whether it could be eliminated without affecting the semantics of the query.

In all of the following three examples, the JOIN is unnecessary.

To-one INNER JOINs can be removed if there’s a NOT NULL FOREIGN KEY:

Instead of this:

SELECT first_name, last_name
FROM customer c
JOIN address a ON c.address_id = a.address_id

The database can run this:

SELECT first_name, last_name
FROM customer c

To-one INNER JOINs can be replaced if there’s a nullable FOREIGN KEY:

The above works if there’s also a NOT NULL constraint on the FOREIGN KEY. If there isn’t, for example, as in this query:

SELECT title
FROM film f
JOIN language l ON f.original_language_id = l.language_id

The JOIN can still be eliminated, but there needs to be a replacement NOT NULLpredicate, as such:

SELECT title
FROM film
WHERE original_language_id IS NOT NULL

To-one OUTER JOINs can be removed if there’s a UNIQUE KEY:

Instead of this:

SELECT first_name, last_name
FROM customer c
LEFT JOIN address a ON c.address_id = a.address_id

The database can again run this:

SELECT first_name, last_name
FROM customer c

…even if there is no FOREIGN KEY on CUSTOMER.ADDRESS_ID.

Tto-many DISTINCT OUTER JOINs  can be removed:

Instead of this:

SELECT DISTINCT first_name, last_name
FROM actor a
LEFT JOIN film_actor fa ON a.actor_id = fa.actor_id

The database can run this:

SELECT DISTINCT first_name, last_name
FROM actor a

All of these examples are explained in detail in the previous article, so I’m not going to repeat it, but here’s again the summary of what each database can eliminate:

Database INNER JOIN:
to-one
INNER JOIN nullable:
to-one
OUTER JOIN:
to-one
OUTER JOIN DISTINCT:
to-many
DB2 LUW 10.5 Yep Yep Yep Yep
MySQL 8.0.2 Nope Nope Nope Nope
Oracle 12.2.0.1 Yep Yep Yep Nope
PostgreSQL 9.6 Nope Nope Yep Nope
SQL Server 2014 Yep Nope Yep Yep

Unfortunately, not all databases can eliminate all joins. DB2 and SQL Server are the clear winners here!

4. Removing “Silly” Predicates

Equally silly are predicates that are (almost) always true. As you can imagine, if you search for:

SELECT * FROM actor WHERE 1 = 1;

...then databases will not actually evaluate the predicate but ignore it. This was a recent Stack Overflow question that I’ve answered, which actually gave me the idea to write this blog post.

I’ll leave it to you to check this, but what happens if the predicate is just slightly less silly? For example:

SELECT * FROM film WHERE release_year = release_year;

Do we actually have to compare the value with itself on each row? No, there’s no value where this can be FALSE, right? Right. But we still have to do a check. While the predicate can never be FALSE, it can totally be NULL, again because of three-valued logic. The RELEASE_YEAR column is a nullable column, and if RELEASE_YEAR IS NULL for any given row, then NULL = NULL yields NULL, and the row must be excluded.

So, the query is transformed into this:

SELECT * FROM film WHERE release_year IS NOT NULL;

Which databases do this?

DB2

Yes!

Explain Plan                                     
-------------------------------------------------
ID | Operation    |                   Rows | Cost
 1 | RETURN       |                        |   49
 2 |  TBSCAN FILM | 1000 of 1000 (100.00%) |   49

Predicate Information                            
 2 - SARG Q1.RELEASE_YEAR IS NOT NULL            

MySQL

Very regrettably, again, because MySQL doesn’t display predicates in their execution plans, it’s a bit hard to find out whether MySQL performs this particular optimization. We could benchmark things and see if some really big string comparisons are executed or not. Or, we add an index:

CREATE INDEX i_release_year ON film (release_year);

And get the plans for these queries instead:

SELECT * FROM film WHERE release_year = release_year;
SELECT * FROM film WHERE release_year IS NOT NULL;

If the optimization works, then both queries should produce exactly the same plan. But they don’t in this case:

ID  TABLE  POSSIBLE_KEYS   ROWS  FILTERED  EXTRA
------------------------------------------------------
1   film             1000  10.00           Using where

ID  TABLE  POSSIBLE_KEYS   ROWS  FILTERED  EXTRA
------------------------------------------------------
1   film   i_release_year  1000  100.00    Using where

As you can see, the two queries differ substantially in that the POSSIBLE_KEYS and FILTERED columns yield different values. I’m making an educated guess and say MySQL does not optimise this.

Oracle

Yes!

----------------------------------------------------
| Id  | Operation         | Name | Starts | E-Rows |
----------------------------------------------------
|   0 | SELECT STATEMENT  |      |      1 |        |
|*  1 |  TABLE ACCESS FULL| FILM |      1 |   1000 |
----------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("RELEASE_YEAR" IS NOT NULL)

PostgreSQL

Disappointingly, no!

QUERY PLAN                                                    
--------------------------------------------------------------
Seq Scan on film  (cost=0.00..67.50 rows=5 width=386)         
  Filter: ((release_year)::integer = (release_year)::integer) 

The plans and the costs are different. Specifically, observe the cardinality estimate, which is totally off, when this predicate:

SELECT * FROM film WHERE release_year IS NOT NULL;

…yields much better results:

QUERY PLAN                                               
---------------------------------------------------------
Seq Scan on film  (cost=0.00..65.00 rows=1000 width=386) 
  Filter: (release_year IS NOT NULL)                     

Bummer!

SQL Server

Surprisingly, also SQL Server doesn’t seem to do this:

  |--Table Scan(OBJECT:([film]), WHERE:([release_year]=[release_year]))

However, the cardinality estimate is correct when looking at the visual plan, and the costs are all correct as well. From what I’ve seen in the past on SQL Server, though, I’m going to say that in this case, the optimization is not taking place because SQL server would display the actually executed predicate in the plan (look at the CHECK constraint examples below to see why).

What about “silly” predicates on NOT NULL columns?

The above transformation was only needed, because RELEASE_YEAR is a nullable column. What if we did the same silly query with, for example, FILM_ID?

SELECT * FROM film WHERE film_id = film_id

This is now the same as not putting a predicate at all. Or at least it should be. Is it, though?

DB2

Yes!

Explain Plan                                     
-------------------------------------------------
ID | Operation    |                   Rows | Cost
 1 | RETURN       |                        |   49
 2 |  TBSCAN FILM | 1000 of 1000 (100.00%) |   49

No predicate is applied at all, and we’re selecting all the films.

MySQL

Yes! (Educated guess, again.)

ID  TABLE  POSSIBLE_KEYS   ROWS  FILTERED  EXTRA
------------------------------------------------------
1   film                   1000  100.00

Observe how now the EXTRA column is empty as if we didn’t have any WHERE clause!

Oracle

Yes!

----------------------------------------------------
| Id  | Operation         | Name | Starts | E-Rows |
----------------------------------------------------
|   0 | SELECT STATEMENT  |      |      1 |        |
|   1 |  TABLE ACCESS FULL| FILM |      1 |   1000 |
----------------------------------------------------

Again, no predicates are applied.

PostgreSQL

Gee, still no!

QUERY PLAN                                            
------------------------------------------------------
Seq Scan on film  (cost=0.00..67.50 rows=5 width=386) 
  Filter: (film_id = film_id)                         

The filter is applied and the cardinality estimate is still 5. Bummer!

SQL Server

Also, still no!

  |--Table Scan(OBJECT:([film]), WHERE:([film_id]=[film_id]))

Summary

This appears like a simple optimization, but it is not applied in all databases, surprisingly not in SQL Server!

Database Silly but needed predicates
(NULL semantics)
Silly unneeded predicates
(no NULL semantics)
DB2 LUW 10.5 Yep Yep
MySQL 8.0.2 Nope Yep
Oracle 12.2.0.1 Yep Yep
PostgreSQL 9.6 Nope Nope
SQL Server 2014 Nope Nope

5. Projections in EXISTS Subqueries

Interestingly, this one, I get asked all the time in my SQL Masterclass where I advocate that SELECT * is mostly bad.

The question, then, is, Is it okay to use SELECT * in an EXISTS subquery? For instance, if we wanted to find actors who have played in films...

SELECT first_name, last_name
FROM actor a
WHERE EXISTS (
  SELECT * -- Is this OK?
  FROM film_actor fa
  WHERE a.actor_id = fa.actor_id
)

And the answer is... yes. It is okay. The asterisk has no impact on the query. How can we “prove” this? Consider the following query:

-- DB2
SELECT 1 / 0 FROM sysibm.dual

-- Oracle
SELECT 1 / 0 FROM dual

-- PostgreSQL, SQL Server
SELECT 1 / 0

-- MySQL
SELECT pow(-1, 0.5);

All databases report a division by zero error. Note that interestingly, in MySQL, dividing by zero yields NULL, not an error, so we’re doing something else that’s illegal.

Now, what happens if we do this, instead?

-- DB2
SELECT CASE WHEN EXISTS (
  SELECT 1 / 0 FROM sysibm.dual
) THEN 1 ELSE 0 END
FROM sysibm.dual

-- Oracle
SELECT CASE WHEN EXISTS (
  SELECT 1 / 0 FROM dual
) THEN 1 ELSE 0 END
FROM dual

-- PostgreSQL
SELECT EXISTS (SELECT 1 / 0)

-- SQL Server
SELECT CASE WHEN EXISTS (
  SELECT 1 / 0
) THEN 1 ELSE 0 END

-- MySQL
SELECT EXISTS (SELECT pow(-1, 0.5));

Now, none of the databases fail the query. All of them return TRUE or 1. This means that none of the databases actually evaluated the projection (i.e. the SELECT clause) of the EXISTS subquery.

SQL Server, for instance, shows the following plan:

  |--Constant Scan(VALUES:((CASE WHEN (1) THEN (1) ELSE (0) END)))

As you can see, the CASE expression was transformed to a constant, the subquery has been eliminated. Other databases still have the subquery in their plan and don’t mention anything about a projection, so let’s again look at the original query’s plan in Oracle:

SELECT first_name, last_name
FROM actor a
WHERE EXISTS (
  SELECT *
  FROM film_actor fa
  WHERE a.actor_id = fa.actor_id
)

The plan for the above is:

------------------------------------------------------------------
| Id  | Operation             | Name                    | E-Rows |
------------------------------------------------------------------
|   0 | SELECT STATEMENT      |                         |        |
|*  1 |  HASH JOIN SEMI       |                         |    200 |
|   2 |   TABLE ACCESS FULL   | ACTOR                   |    200 |
|   3 |   INDEX FAST FULL SCAN| IDX_FK_FILM_ACTOR_ACTOR |   5462 |
------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("A"."ACTOR_ID"="FA"."ACTOR_ID")

Column Projection Information (identified by operation id):
-----------------------------------------------------------

   1 - (#keys=1) LAST_NAME, FIRST_NAME
   2 - (rowset=256) A.ACTOR_ID, FIRST_NAME, LAST_NAME
   3 - FA.ACTOR_ID

Observe the projection information on the FILM_ACTOR access in Id=3. In fact, we’re not even accessing the FILM_ACTOR table, because we don’t have to. The EXISTS predicate can be executed using the foreign key index on the ACTOR_ID column only, that’s all we need for this query — despite us having written SELECT *

Summary

Luckily, all databases can remove the projection in EXISTS subqueries:

Database EXISTS projection
DB2 LUW 10.5 Yep
MySQL 8.0.2 Yep
Oracle 12.2.0.1 Yep
PostgreSQL 9.6 Yep
SQL Server 2014

Yep

Stay tuned for Part 2, where we'll talk about more cool SQL optimizations

Create flexible schemas using dynamic columns for semi-structured data. Learn how.

Topics:
database ,sql ,optimization ,cost-based optimization

Published at DZone with permission of Lukas Eder, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}