Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Cool SQL Optimizations That Do Not Depend on the Cost Model (Part 2)

DZone's Guide to

Cool SQL Optimizations That Do Not Depend on the Cost Model (Part 2)

Learn about five (more!) simple optimizations that can be implemented purely based on metadata (i.e. constraints) and the query itself.

· Database Zone
Free Resource

Read why times series is the fastest growing database category.

Be sure to check out Part 1 here first!

Predicate Merging

This one is interesting and has bitten me in the past when I erroneously assumed that a given database, I could do it.

Consider the following query:

SELECT * 
FROM actor
WHERE actor_id IN (2, 3, 4)
AND actor_id IN (1, 2, 3);

Obviously, the two predicates overlap and can be merged. I would expect the database to transform the above into:

SELECT * 
FROM actor
WHERE actor_id IN (2, 3);

Looks obvious, right? It is a more sophisticated case of transitive closure. Which database can do it?

DB2

Yes!

Explain Plan                                      
--------------------------------------------------
ID | Operation         |               Rows | Cost
 1 | RETURN            |                    |   11
 2 |  FETCH ACTOR      |   2 of 2 (100.00%) |   11
 3 |   IXSCAN PK_ACTOR | 2 of 200 (  1.00%) |    0

Predicate Information                             
 3 - SARG Q3.ACTOR_ID IN (2, 3)                   

MySQL

Again, unfortunately, MySQL doesn’t display the predicate information very nicely. We get the same plan for both queries:

ID  TABLE  TYPE   KEY      ROWS  FILTERED  EXTRA
------------------------------------------------------
1   actor  range  PRIMARY  2     100.00    Using where

2x the same cardinalities, 2x Using where with no indication what exactly is being done inside of where, but given the cardinality, we can assume that the transformation happened correctly. We can look at it differently; let’s try this query:

SELECT * FROM actor
WHERE actor_id IN (3, 4, 5)
AND actor_id IN (1, 2, 3);

Which should be transformed into this one:

SELECT * FROM actor
WHERE actor_id = 3;

And indeed, it happens:

ID  TABLE  TYPE   KEY      ROWS  FILTERED  EXTRA
------------------------------------------------------
1   actor  const  PRIMARY  1     100.00

Observe how TYPE=range changed to TYPE=const.

So, we can conclude that yes, MySQL implements this optimization.

Oracle

Yes!

----------------------------------------------------------
| Id  | Operation                    | Name     | E-Rows |
----------------------------------------------------------
|   0 | SELECT STATEMENT             |          |        |
|   1 |  INLIST ITERATOR             |          |        |
|   2 |   TABLE ACCESS BY INDEX ROWID| ACTOR    |      2 |
|*  3 |    INDEX UNIQUE SCAN         | PK_ACTOR |      2 |
----------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - access(("ACTOR_ID"=2 OR "ACTOR_ID"=3))

The predicate that is being applied only includes the values 2 and 3, so the transformation has worked out correctly.

PostgreSQL

Regrettably, no — this is not optimized!

QUERY PLAN                                                                                     
-----------------------------------------------------------------------------------------------
Seq Scan on actor  (cost=0.00..5.50 rows=1 width=25)                                           
  Filter: ((actor_id = ANY ('{2,3,4}'::integer[])) AND (actor_id = ANY ('{1,2,3}'::integer[])))

Both predicates are still present in the execution plan, and the cardinality estimate is wrong; it should be 2, not 1. If I manually transform the query, I’m getting this plan instead:

QUERY PLAN                                           
-----------------------------------------------------
Seq Scan on actor  (cost=0.00..4.50 rows=2 width=25) 
  Filter: (actor_id = ANY ('{2,3}'::integer[]))      

In particular, we can see the wrong plan if the two predicates do not overlap, in case of which, an impossible predicate is formed:

SELECT * 
FROM actor
WHERE actor_id IN (2, 3, 4)
AND actor_id IN (7, 8, 9)

Still, this yields a “wrong” plan:

QUERY PLAN                                                                                     
-----------------------------------------------------------------------------------------------
Seq Scan on actor  (cost=0.00..5.50 rows=1 width=25)                                           
  Filter: ((actor_id = ANY ('{2,3,4}'::integer[])) AND (actor_id = ANY ('{7,8,9}'::integer[])))

Bummer!

SQL Server

Yes, this works:

  |--Nested Loops(Inner Join)
       |--Index Seek(SEEK:([actor_id]=(2) OR [actor_id]=(3)))
       |--RID Lookup(OBJECT:([actor]))

Summary

Database Predicate merging
DB2 LUW 10.5 Yep
MySQL 8.0.2 Yep
Oracle 12.2.0.1 Yep
PostgreSQL 9.6 Nope
SQL Server 2014 Yep

Provably Empty Sets

This one is really cool. We’ve seen Impossible predicates and unneeded table accesses before. What if we do this again, but this time with a JOIN? Can JOIN elimination kick in, too?

We’re trying the following queries.

IS NULL on NOT NULL column:

The predicate in the WHERE clause cannot be TRUE because we have a NOT NULLconstraint on the FILM_ID column.

SELECT first_name, last_name
FROM actor a
JOIN (
  SELECT *
  FROM film_actor
  WHERE film_id IS NULL
) fa ON a.actor_id = fa.actor_id;

The derived table FA cannot return any rows because of that NOT NULL constraint on the FA.FILM_ID column, so it is provably empty. Because an INNER JOIN with an empty table cannot produce any rows, either, this should save us from accessing the ACTOR table, so the above query should be rewritten to something like this:

SELECT NULL AS first_name, NULL AS last_name
WHERE 1 = 0;

The predicate is never evaluated and the JOIN is eliminated.

INTERSECT NULL and NOT NULL columns:

In principle, this is the same as the previous example, but using a bit more sophisticated syntax:

SELECT *
FROM actor a
JOIN (
  SELECT actor_id, film_id
  FROM film_actor
  INTERSECT
  SELECT NULL, NULL
  FROM dual
) fa ON a.actor_id = fa.actor_id;

Because of the NOT NULL constraints on both FA.ACTOR_ID and FA.FILM_ID, an INTERSECToperation with a (NULL, NULL) tuple should not yield any results, and thus the derived table is provably empty, and thus the INNER JOIN can be eliminated.

Funky, but why not?

Let’s repeat, with EXISTS:

Finally, let’s repeat the above type of query, but this time with a SEMI JOIN instead of an INNER JOIN. First with an impossible predicate

SELECT *
FROM actor a
WHERE a.actor_id IN (
  SELECT actor_id
  FROM film_actor
  WHERE actor_id IS NULL
);

…then again with an intersection.

SELECT *
FROM actor a
WHERE a.actor_id IN (
  SELECT actor_id
  FROM film_actor
  INTERSECT
  SELECT NULL
  FROM sysibm.dual
)

Let’s go. Which database can do which optimization?

DB2

Joining a provably empty set (IS NULL predicate):

Explain Plan                       
-----------------------------------
ID | Operation      |   Rows | Cost
 1 | RETURN         |        |    0
 2 |  TBSCAN GENROW | 0 of 0 |    0

Predicate Information              
 2 - RESID (1 = 0)                 

Joining a provably empty set (INTERSECT):

Explain Plan                       
-----------------------------------
ID | Operation      |   Rows | Cost
 1 | RETURN         |        |    0
 2 |  TBSCAN GENROW | 0 of 0 |    0

Predicate Information              
 2 - RESID (1 = 0)                 

Semi joining a provably empty set (IS NULL predicate):

Explain Plan                       
-----------------------------------
ID | Operation      |   Rows | Cost
 1 | RETURN         |        |    0
 2 |  TBSCAN GENROW | 0 of 0 |    0

Predicate Information              
 2 - RESID (1 = 0)                 

Semi-joining a provably empty set (INTERSECT):

Explain Plan                       
-----------------------------------
ID | Operation      |   Rows | Cost
 1 | RETURN         |        |    0
 2 |  TBSCAN GENROW | 0 of 0 |    0

Predicate Information              
 2 - RESID (1 = 0)                 

Wow, cool! Looks like a winner!

MySQL

Joining a provably empty set (IS NULL predicate):

ID  TABLE   EXTRA
----------------------------
1           Impossible WHERE

Cool! I didn’t expect this!

MySQL doesn’t support INTERSECT, regrettably.

Semi-joining a provably empty set (IS NULL predicate):

ID  TABLE   EXTRA
----------------------------
1           Impossible WHERE

That’s still a great result for MySQL!

Oracle

Joining a provably empty set (IS NULL predicate):

---------------------------------------------------------------------------
| Id  | Operation              | Name          | Starts | E-Rows | A-Rows |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |               |      1 |        |      0 |
|*  1 |  FILTER                |               |      1 |        |      0 |
|*  2 |   HASH JOIN            |               |      0 |   5462 |      0 |
|   3 |    TABLE ACCESS FULL   | ACTOR         |      0 |    200 |      0 |
|   4 |    INDEX FAST FULL SCAN| PK_FILM_ACTOR |      0 |   5462 |      0 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter(NULL IS NOT NULL)
   2 - access("A"."ACTOR_ID"="FILM_ACTOR"."ACTOR_ID")

Again, a very confusing execution plan in Oracle, but the NULL IS NOT NULL filter is there, and it happens before all the other operations, which are not executed.

Joining a provably empty set (INTERSECT):

---------------------------------------------------------------------------------
| Id  | Operation                    | Name          | Starts | E-Rows | A-Rows |
---------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |               |      1 |        |      0 |
|   1 |  NESTED LOOPS                |               |      1 |      1 |      0 |
|   2 |   NESTED LOOPS               |               |      1 |      1 |      0 |
|   3 |    VIEW                      |               |      1 |      1 |      0 |
|   4 |     INTERSECTION             |               |      1 |        |      0 |
|   5 |      SORT UNIQUE             |               |      1 |   5462 |   5463 |
|   6 |       INDEX FAST FULL SCAN   | PK_FILM_ACTOR |      1 |   5462 |   5463 |
|   7 |      FAST DUAL               |               |      1 |      1 |      1 |
|*  8 |    INDEX UNIQUE SCAN         | PK_ACTOR      |      0 |      1 |      0 |
|   9 |   TABLE ACCESS BY INDEX ROWID| ACTOR         |      0 |      1 |      0 |
---------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   8 - access("A"."ACTOR_ID"="FA"."ACTOR_ID")

Interesting. This plan will indeed access the entire FILM_ACTOR primary key. It can save accesses to the ACTOR table and primary key index because it does the derived table first (which yields no rows), but still those Ids 5 and 6 should not be there. Bummer!

Semi joining a provably empty set (IS NULL predicate):

This works again:

-------------------------------------------------------------------------------------
| Id  | Operation              | Name                    | Starts | E-Rows | A-Rows |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |                         |      1 |        |      0 |
|*  1 |  FILTER                |                         |      1 |        |      0 |
|*  2 |   HASH JOIN SEMI       |                         |      0 |    200 |      0 |
|   3 |    TABLE ACCESS FULL   | ACTOR                   |      0 |    200 |      0 |
|   4 |    INDEX FAST FULL SCAN| IDX_FK_FILM_ACTOR_ACTOR |      0 |   5462 |      0 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter(NULL IS NOT NULL)
   2 - access("A"."ACTOR_ID"="ACTOR_ID")

…with the same confusing plan that keeps around the unexecuted subtree.

Semi-joining a provably empty set (INTERSECT):

Again, no optimization here:

-------------------------------------------------------------------------------------------
| Id  | Operation                    | Name                    | Starts | E-Rows | A-Rows |
-------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |                         |      1 |        |      0 |
|   1 |  NESTED LOOPS                |                         |      1 |      1 |      0 |
|   2 |   NESTED LOOPS               |                         |      1 |      1 |      0 |
|   3 |    VIEW                      | VW_NSO_1                |      1 |      1 |      0 |
|   4 |     INTERSECTION             |                         |      1 |        |      0 |
|   5 |      SORT UNIQUE             |                         |      1 |   5462 |    200 |
|   6 |       INDEX FAST FULL SCAN   | IDX_FK_FILM_ACTOR_ACTOR |      1 |   5462 |   5463 |
|   7 |      FAST DUAL               |                         |      1 |      1 |      1 |
|*  8 |    INDEX UNIQUE SCAN         | PK_ACTOR                |      0 |      1 |      0 |
|   9 |   TABLE ACCESS BY INDEX ROWID| ACTOR                   |      0 |      1 |      0 |
-------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   8 - access("A"."ACTOR_ID"="ACTOR_ID")

Not so good!

PostgreSQL

Disappointingly, PostgreSQL doesn’t fare well in this experiment!

Joining a provably empty set (IS NULL predicate):

Nope:

QUERY PLAN                                                                                  
--------------------------------------------------------------------------------------------
Hash Join  (cost=8.31..13.07 rows=1 width=13)                                               
  Hash Cond: (a.actor_id = film_actor.actor_id)                                             
  ->  Seq Scan on actor a  (cost=0.00..4.00 rows=200 width=17)                              
  ->  Hash  (cost=8.30..8.30 rows=1 width=2)                                                
        ->  Index Scan using idx_fk_film_id on film_actor  (cost=0.28..8.30 rows=1 width=2) 
              Index Cond: (film_id IS NULL)                                                 

Joining a provably empty set (INTERSECT):

Even worse:

QUERY PLAN                                                                                         
---------------------------------------------------------------------------------------------------
Hash Join  (cost=166.60..171.36 rows=1 width=29)                                                   
  Hash Cond: (a.actor_id = fa.actor_id)                                                            
  ->  Seq Scan on actor a  (cost=0.00..4.00 rows=200 width=25)                                     
  ->  Hash  (cost=166.59..166.59 rows=1 width=4)                                                   
        ->  Subquery Scan on fa  (cost=0.00..166.59 rows=1 width=4)                                
              ->  HashSetOp Intersect  (cost=0.00..166.58 rows=1 width=8)                          
                    ->  Append  (cost=0.00..139.26 rows=5463 width=8)                              
                          ->  Subquery Scan on "*SELECT* 2"  (cost=0.00..0.02 rows=1 width=8)      
                                ->  Result  (cost=0.00..0.01 rows=1 width=4)                       
                          ->  Subquery Scan on "*SELECT* 1"  (cost=0.00..139.24 rows=5462 width=8) 
                                ->  Seq Scan on film_actor  (cost=0.00..84.62 rows=5462 width=4)   

Semi joining a provably empty set (IS NULL predicate) — same as inner-join:

QUERY PLAN                                                                                       
-------------------------------------------------------------------------------------------------
Hash Semi Join  (cost=6.06..10.60 rows=1 width=25)                                               
  Hash Cond: (a.actor_id = film_actor.actor_id)                                                  
  ->  Seq Scan on actor a  (cost=0.00..4.00 rows=200 width=25)                                   
  ->  Hash  (cost=6.05..6.05 rows=1 width=2)                                                     
        ->  Index Only Scan using film_actor_pkey on film_actor  (cost=0.28..6.05 rows=1 width=2)
              Index Cond: (actor_id IS NULL)                                                     

Semi-joining a provably empty set (INTERSECT) — unsurprisingly:

QUERY PLAN                                                                                        
--------------------------------------------------------------------------------------------------
Hash Semi Join  (cost=152.94..157.48 rows=1 width=25)                                             
  Hash Cond: (a.actor_id = "ANY_subquery".actor_id)                                               
  ->  Seq Scan on actor a  (cost=0.00..4.00 rows=200 width=25)                                    
  ->  Hash  (cost=152.93..152.93 rows=1 width=2)                                                  
        ->  Subquery Scan on "ANY_subquery"  (cost=0.00..152.93 rows=1 width=2)                   
              ->  HashSetOp Intersect  (cost=0.00..152.92 rows=1 width=6)                         
                    ->  Append  (cost=0.00..139.26 rows=5463 width=6)                             
                          ->  Subquery Scan on "*SELECT* 2"  (cost=0.00..0.02 rows=1 width=6)     
                                ->  Result  (cost=0.00..0.01 rows=1 width=2)                      
                          ->  Subquery Scan on "*SELECT* 1"  (cost=0.00..139.24 rows=5462 width=6)
                                ->  Seq Scan on film_actor  (cost=0.00..84.62 rows=5462 width=2)  

SQL Server

SQL Server shines, like DB2.

Joining a provably empty set (IS NULL predicate):

  |--Constant Scan

Joining a provably empty set (INTERSECT):

  |--Constant Scan

Semi-joining a provably empty set (IS NULL predicate):

  |--Constant Scan

Semi-joining a provably empty set (INTERSECT):

  |--Constant Scan

Summary

Database JOIN / NULL JOIN / INTERSECT SEMI JOIN / NULL SEMI JOIN / INTERSECT
DB2 LUW 10.5 Yep Yep Yep Yep
MySQL 8.0.2 Yep Not supported Yep Not supported
Oracle 12.2.0.1 Yep Nope Yep Nope
PostgreSQL 9.6 Nope Nope Nope Nope
SQL Server 2014 Yep Yep Yep Yep

On a side note, this could be done in thousands of other ways. Feel free to comment with your own ideas on how to create “provably empty sets” to see if this is optimised by any of the databases.

CHECK Constraints

Oh, this is cool! Our Sakila database has a CHECK constraint on the FILM.RATING table:

CREATE TABLE film (
  ..
  RATING varchar(10) DEFAULT 'G',
  ..
  CONSTRAINT check_special_rating 
    CHECK (rating IN ('G','PG','PG-13','R','NC-17')),
  ..
);

Seriously, use CHECK constraints for data integrity. The cost of adding them is super low — much less than other constraints like PRIMARYUNIQUE, and FOREIGN KEY constraints, as they do not profit from an index to enforce them, so you get them almost for “free.”

But there’s also an interesting optimization aspect here! Check out these queries.

Impossible Predicate

We’ve seen impossible predicates before, even with NOT NULL constraints (which are special types of CHECK constraints, in fact), but this is even more powerful:

SELECT *
FROM film
WHERE rating = 'N/A';

There can be no such film because the CHECK constraint prevents its insertion (or update). This should again be transformed into a NOOP. Now, what about this?

CREATE INDEX idx_film_rating ON film (rating);

SELECT count(*)
FROM film
WHERE rating NOT IN ('G','PG','PG-13','R');

With the above index, we should probably simply run a quick index scan to count all the films of rating = NC-17, because that’s the only remaining rating. So the query should be rewritten to this:

SELECT count(*)
FROM film
WHERE rating = 'NC-17';

It should be, regardless of the index, because comparing the column with a single value is faster than comparing it with four values.

So, which database can do these things?

DB2

Impossible predicate (rating = N/A) — cool!

Explain Plan                       
-----------------------------------
ID | Operation      |   Rows | Cost
 1 | RETURN         |        |    0
 2 |  TBSCAN GENROW | 0 of 0 |    0

Predicate Information              
 2 - RESID (1 = 0)                 

Inverse predicate (rating = NC-17) — nope…

Explain Plan                                                
------------------------------------------------------------
ID | Operation                |                  Rows | Cost
 1 | RETURN                   |                       |   34
 2 |  GRPBY (COMPLETE)        |    1 of 210 (   .48%) |   34
 3 |   IXSCAN IDX_FILM_RATING | 210 of 1000 ( 21.00%) |   34

Predicate Information                                       
 3 - SARG  NOT(Q1.RATING IN ('G', 'PG', 'PG-13', 'R'))      

While the index is used on ID=3 and while the cardinalities are correct, it is scanned entirely, as we do not have a range predicate but a SARG predicate. For more details, see Markus Winand’s overview here.

We can also show this by manually inverting the predicate to get:

Explain Plan                                                
------------------------------------------------------------
ID | Operation                |                  Rows | Cost
 1 | RETURN                   |                       |    7
 2 |  GRPBY (COMPLETE)        |    1 of 210 (   .48%) |    7
 3 |   IXSCAN IDX_FILM_RATING | 210 of 1000 ( 21.00%) |    7

Predicate Information                                       
 3 - START (Q1.RATING = 'NC-17')                            
      STOP (Q1.RATING = 'NC-17')                            

Now, we’re getting the desired range predicate.

MySQL

MySQL supports the CHECK constraint syntax but doesn’t enforce it for whatever reason. Try this:

CREATE TABLE x (a INT CHECK (a != 0));
INSERT INTO x VALUES (0);
SELECT * FROM x;

You’ll get:

A
-
0

Zero points for MySQL (really, why not just support CHECK constraints?)

Oracle

Impossible predicate (rating = N/A):

--------------------------------------------------------------
| Id  | Operation          | Name | Starts | E-Rows | A-Rows |
--------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |      1 |        |      0 |
|*  1 |  FILTER            |      |      1 |        |      0 |
|*  2 |   TABLE ACCESS FULL| FILM |      0 |     89 |      0 |
--------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter(NULL IS NOT NULL)
   2 - filter("RATING"='N/A')

Again, the super confusing NULL IS NOT NULL filter that cuts off the FULL TABLE SCAN, which might as well be removed entirely from the plan. But at least it works!

Inverse predicate (rating = ‘NC-17’)... oops:

----------------------------------------------------------------------------
| Id  | Operation             | Name            | Starts | E-Rows | A-Rows |
----------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |                 |      1 |        |      1 |
|   1 |  SORT AGGREGATE       |                 |      1 |      1 |      1 |
|*  2 |   INDEX FAST FULL SCAN| IDX_FILM_RATING |      1 |    415 |    210 |
----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter((RATING'PG-13' AND RATING'R' AND RATING'PG' AND RATING'G'))

The predicate could not be inversed; we get a much off cardinality estimate; we get an INDEX FAST FULL SCAN instead of an INDEX RANGE SCAN and a filter predicate rather than an access predicate. Here’s what we should have gotten, i.e. when manually inverting the predicate:

------------------------------------------------------------------------
| Id  | Operation         | Name            | Starts | E-Rows | A-Rows |
------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |                 |      1 |        |      1 |
|   1 |  SORT AGGREGATE   |                 |      1 |      1 |      1 |
|*  2 |   INDEX RANGE SCAN| IDX_FILM_RATING |      1 |    210 |    210 |
------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("RATING"='NC-17')

Bummer!

PostgreSQL

Note that the Sakila database in its PostgreSQL version uses an ENUM type instead of a CHECK constraint on the RATING column. I’ve duplicated the table to use a CHECKconstraint instead.

Impossible predicate (rating = N/A) doesn’t work:

QUERY PLAN                                            
------------------------------------------------------
Seq Scan on film2  (cost=0.00..67.50 rows=1 width=385)
  Filter: ((rating)::text = 'N/A'::text)              

Inverse predicate (rating = ‘NC-17’)... also nope:

QUERY PLAN                                                        
------------------------------------------------------------------
Aggregate  (cost=70.53..70.54 rows=1 width=8)                     
  ->  Seq Scan on film2  (cost=0.00..70.00 rows=210 width=0)      
        Filter: ((rating)::text  ALL ('{G,PG,PG-13,R}'::text[]))

Too bad!

SQL Server

Impossible predicate (rating = N/A) — yes!

  |--Constant Scan

Inverse predicate (rating = NC-17) — also yes!

  |--Compute Scalar
       |--Stream Aggregate
            |--Index Seek(OBJECT:([idx_film_rating]), SEEK:([rating]='NC-17'))

Summary

Database Impossible predicate Inverse predicate
DB2 LUW 10.5 Yep Nope
MySQL 8.0.2 Not supported Not supported
Oracle 12.2.0.1 Yep Nope
PostgreSQL 9.6 Nope Nope
SQL Server 2014 Yep Yep

Unneeded Self JOIN

When your queries get more complex, it might well happen that you’re going to self JOIN a table based on its primary key. Trust me, this is common practice when you build complex views and JOIN them to each other, so a database noticing this is a crucial step in optimising complex SQL. I won’t show a complex example, but a simple one, for example:

SELECT a1.first_name, a1.last_name
FROM actor a1
JOIN actor a2 ON a1.actor_id = a2.actor_id;

This could be considered a special case of JOIN elimination as we don’t really need the JOIN of A2, we can do everything with A1 only. Now, INNER JOIN elimination normally works in the presence of a FOREIGN KEY only, which we don’t have here. But because of the PRIMARY KEY on ACTOR_ID, we can prove that in fact A1 = A2. In a way, this is transitive closure all over again.

We can take this one step further and use columns from both A1 and A2:

SELECT a1.first_name, a2.last_name
FROM actor a1
JOIN actor a2 ON a1.actor_id = a2.actor_id;

In the classic JOIN elimination case, we could no longer eliminate the JOIN because we’re projecting from both tables. But since we’ve already proven that A1 = A2, we can use them interchangeably, so the expectation is for this query to be transformed into:

SELECT first_name, last_name
FROM actor;

Who can do this?

DB2

Projecting from A1 only, yes:

Explain Plan                                    
------------------------------------------------
ID | Operation     |                 Rows | Cost
 1 | RETURN        |                      |   20
 2 |  TBSCAN ACTOR | 200 of 200 (100.00%) |   20

Projecting from A1 and A2 …and yes:

Explain Plan                                    
------------------------------------------------
ID | Operation     |                 Rows | Cost
 1 | RETURN        |                      |   20
 2 |  TBSCAN ACTOR | 200 of 200 (100.00%) |   20

MySQL

Projecting from A1 only? Nope:

ID  TABLE  REF          EXTRA
-----------------------------------
1   a1
1   a2     a1.actor_id  Using index

Projecting from A1 and A2… and nope:

ID  TABLE  REF          EXTRA
-----------------------------------
1   a1
1   a2     a1.actor_id  

That’s disappointing…

Oracle

Projecting from A1 only, yes!

--------------------------------------------
| Id  | Operation         | Name  | E-Rows |
--------------------------------------------
|   0 | SELECT STATEMENT  |       |        |
|   1 |  TABLE ACCESS FULL| ACTOR |    200 |
--------------------------------------------

Projecting from A1 and A2... and yes!

--------------------------------------------
| Id  | Operation         | Name  | E-Rows |
--------------------------------------------
|   0 | SELECT STATEMENT  |       |        |
|   1 |  TABLE ACCESS FULL| ACTOR |    200 |
--------------------------------------------

PostgreSQL

Projecting from A1 only? Nope:

QUERY PLAN                                                          
--------------------------------------------------------------------
Hash Join  (cost=6.50..13.25 rows=200 width=13)                     
  Hash Cond: (a1.actor_id = a2.actor_id)                            
  ->  Seq Scan on actor a1  (cost=0.00..4.00 rows=200 width=17)     
  ->  Hash  (cost=4.00..4.00 rows=200 width=4)                      
        ->  Seq Scan on actor a2  (cost=0.00..4.00 rows=200 width=4)

Projecting from A1 and A2 — and nope:

QUERY PLAN                                                           
---------------------------------------------------------------------
Hash Join  (cost=6.50..13.25 rows=200 width=13)                      
  Hash Cond: (a1.actor_id = a2.actor_id)                             
  ->  Seq Scan on actor a1  (cost=0.00..4.00 rows=200 width=10)      
  ->  Hash  (cost=4.00..4.00 rows=200 width=11)                      
        ->  Seq Scan on actor a2  (cost=0.00..4.00 rows=200 width=11)

SQL Server

Projecting from A1 only... surprisingly, no! (But remember, this is SQL Server 2014; maybe this got fixed in a more recent version. I should definitely upgrade!)

  |--Merge Join(Inner Join, MERGE:([a2].[actor_id])=([a1].[actor_id]))
       |--Index Scan(OBJECT:([a2]))
       |--Sort(ORDER BY:([a1].[actor_id] ASC))
            |--Table Scan(OBJECT:([a1]))

Projecting from A1 and A2 — also no, and even with a different, worse plan:

  |--Hash Match(Inner Join, HASH:([a1].[actor_id])=([a2].[actor_id]))
       |--Table Scan(OBJECT:([sakila].[dbo].[actor] AS [a1]))
       |--Table Scan(OBJECT:([sakila].[dbo].[actor] AS [a2]))

Summary

I would have frankly expected this to work on all databases, but I was proven very wrong, which is a shame. Along with JOIN elimination, this is one of the most crucial optimizations to enable building huge SQL queries from reusable parts, such as views and table-valued functions. Unfortunately, this is not supported in three out of five of the most popular databases.

Database Self-join elimination,
single table projection
Self-join elimination,
complete projection
DB2 LUW 10.5 Yep Yep
MySQL 8.0.2 Nope Nope
Oracle 12.2.0.1 Yep Yep
PostgreSQL 9.6 Nope Nope
SQL Server 2014 Nope Nope

Predicate Pushdown

This optimization doesn’t belong here 100% because it is not entirely true to assume this transformation is not cost-based. But since I cannot think of a single obvious reason why an optimizer should not push down predicates into derived tables, I’m listing this here along with the other, non-cost-based optimizations.

Consider this query:

SELECT *
FROM (
  SELECT *
  FROM actor
) a
WHERE a.actor_id = 1;

The derived table has absolutely no value in this query and it should be eliminated, as well, by unnesting it. But let’s ignore that for a moment.

We’d expect the database to perform this query instead:

SELECT *
FROM (
  SELECT *
  FROM actor
  WHERE actor_id = 1
) a;

And then again, possibly, eliminate the outer query.

A more sophisticated example would be when using UNION:

SELECT *
FROM (
  SELECT first_name, last_name, 'actor' type
  FROM actor
  UNION ALL
  SELECT first_name, last_name, 'customer' type
  FROM customer
) people
WHERE people.last_name = 'DAVIS';

The result of this query is:

FIRST_NAME  LAST_NAME  TYPE
----------------------------
JENNIFER    DAVIS      actor
SUSAN       DAVIS      actor
SUSAN       DAVIS      actor
JENNIFER    DAVIS      customer

Now, we’d love the database optimiser to run this statement instead:

SELECT *
FROM (
  SELECT first_name, last_name, 'actor' type
  FROM actor
  WHERE last_name = 'DAVIS'
  UNION ALL
  SELECT first_name, last_name, 'customer' type
  FROM customer
  WHERE last_name = 'DAVIS'
) people;

...pushing down the predicate into the derived table, and from there on into the two UNION ALL subqueries, because after all, we have indexes on both ACTOR.LAST_NAME and CUSTOMER.LAST_NAME columns.

Again, this transformation might be motivated based on costs in most databases, but I still think it’s a no-brainer to do because it’s almost always better to reduce the number of processed tuples as early as possible in any algorithm. If you know a case where this transformation is a bad idea, please comment! I’d be very curious.

So, which databases can do this? (And please, this is so basic, yet important, let the answer be: all.)

DB2

Simple derived table... yes:

Explain Plan                                      
--------------------------------------------------
ID | Operation         |               Rows | Cost
 1 | RETURN            |                    |    6
 2 |  FETCH ACTOR      |   1 of 1 (100.00%) |    6
 3 |   IXSCAN PK_ACTOR | 1 of 200 (   .50%) |    0

Predicate Information                             
 3 - START (Q1.ACTOR_ID = 1)                      
      STOP (Q1.ACTOR_ID = 1)                      

UNION derived table. Yes, again:

Explain Plan                                                     
-----------------------------------------------------------------
ID | Operation                        |               Rows | Cost
 1 | RETURN                           |                    |   20
 2 |  UNION                           |             2 of 1 |   20
 3 |   FETCH CUSTOMER                 |   1 of 1 (100.00%) |   13
 4 |    IXSCAN IDX_CUSTOMER_LAST_NAME | 1 of 599 (   .17%) |    6
 5 |   FETCH ACTOR                    |   1 of 1 (100.00%) |    6
 6 |    IXSCAN IDX_ACTOR_LAST_NAME    | 1 of 200 (   .50%) |    0

Predicate Information                                            
 4 - START (Q1.LAST_NAME = 'DAVIS')                              
      STOP (Q1.LAST_NAME = 'DAVIS')                              
 6 - START (Q3.LAST_NAME = 'DAVIS')                              
      STOP (Q3.LAST_NAME = 'DAVIS')                              

Also, in both cases, the derived table (view) was removed from the plan, as it is not really necessary.

MySQL

Simple derived table; yes:

ID  TABLE  TYPE   KEY      REF    EXTRA
---------------------------------------
1   actor  const  PRIMARY  const

The usual PRIMARY KEY access by a constant value is applied.

UNION derived table — oops, nope:

ID  SELECT_TYPE  TABLE       TYPE  KEY          REF    ROWS  EXTRA
------------------------------------------------------------------
1   PRIMARY        ref   const  10
2   DERIVED      actor       ALL                       200
3   UNION        customer    ALL                       599

The manual transformation would yield:

ID  SELECT_TYPE  TABLE       TYPE  KEY                  REF    ROWS  EXTRA
--------------------------------------------------------------------------
1   PRIMARY        ALL                               5
2   DERIVED      actor       ref   idx_actor_last_name  const  3
3   UNION        customer    ref   idx_last_name        const  1

That’s really a problem if you want to nest complex queries in MySQL!

Oracle

Simple derived table — yes:

---------------------------------------------------------------------------
| Id  | Operation                   | Name     | Starts | E-Rows | A-Rows |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |          |      1 |        |      1 |
|   1 |  TABLE ACCESS BY INDEX ROWID| ACTOR    |      1 |      1 |      1 |
|*  2 |   INDEX UNIQUE SCAN         | PK_ACTOR |      1 |      1 |      1 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("ACTOR"."ACTOR_ID"=1)

The derived table has been unnested, too.

UNION derived table — works, as well:

---------------------------------------------------------------------------------
| Id  | Operation                             | Name                   | E-Rows |
---------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                      |                        |        |
|   1 |  VIEW                                 |                        |      4 |
|   2 |   UNION-ALL                           |                        |        |
|   3 |    TABLE ACCESS BY INDEX ROWID BATCHED| ACTOR                  |      3 |
|*  4 |     INDEX RANGE SCAN                  | IDX_ACTOR_LAST_NAME    |      3 |
|   5 |    TABLE ACCESS BY INDEX ROWID BATCHED| CUSTOMER               |      1 |
|*  6 |     INDEX RANGE SCAN                  | IDX_CUSTOMER_LAST_NAME |      1 |
---------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   4 - access("LAST_NAME"='DAVIS')
   6 - access("LAST_NAME"='DAVIS')

However, without unnesting the derived table. The Id=1 “VIEW” indicates that it’s still there. This isn’t a problem, in this case; just perhaps a bit cosmetic overhead.

PostgreSQL

Simple derived table; yes, it works:

QUERY PLAN                                          
----------------------------------------------------
Seq Scan on actor  (cost=0.00..4.50 rows=1 width=25)
  Filter: (actor_id = 1)                            

Note, interestingly, PostgreSQL sometimes doesn’t even use the PRIMARY KEY for a single row lookup but scans the entire table. In this case, 200 rows x 25 bytes per row (“width”) fits in a single block, so why bother reading the index, anyway, generating more I/O for this small table access?

UNION derived table; yes, this works as well:

QUERY PLAN                                                                         
-----------------------------------------------------------------------------------
Append  (cost=0.00..12.83 rows=4 width=45)                                         
  ->  Seq Scan on actor  (cost=0.00..4.50 rows=3 width=45)                         
        Filter: ((last_name)::text = 'DAVIS'::text)                                
  ->  Index Scan using idx_last_name on customer  (cost=0.28..8.29 rows=1 width=45)
        Index Cond: ((last_name)::text = 'DAVIS'::text)                            

Again, the index on ACTOR.LAST_NAME is not used, but the one on CUSTOMER.LAST_NAME is, as the CUSTOMER table is quite larger.

SQL Server

Simple derived table — yes, works:

  |--Nested Loops(Inner Join)
       |--Index Seek(SEEK:([actor_id]=(1)))
       |--RID Lookup(OBJECT:([actor]))

UNION derived table works, as well:

  |--Concatenation
       |--Compute Scalar(DEFINE:([Expr1003]='actor'))
       |    |--Nested Loops(Inner Join)
       |         |--Index Seek(SEEK:([actor].[last_name]='DAVIS'))
       |         |--RID Lookup(OBJECT:([actor]))
       |--Compute Scalar(DEFINE:([Expr1007]='customer'))
            |--Nested Loops(Inner Join)
                 |--Index Seek(SEEK:([customer].[last_name]='DAVIS'))
                 |--RID Lookup(OBJECT:([customer]))

Summary

My hopes were scattered. MySQL 8.0.2 doesn’t support this simple optimization completely yet. All others do, however:

Database Simple derived table pushdown UNION derived table pushdown
DB2 LUW 10.5 Yep Yep
MySQL 8.0.2 Yep Nope
Oracle 12.2.0.1 Yep Yep
PostgreSQL 9.6 Yep Yep
SQL Server 2014 Yep Yep

Conclusion

The list presented here is far from complete. There are many more of these simple SQL transformations that are (or should be) a no-brainer for a database to implement, even before the cost-based optimizer kicks in. They remove what I call unnecessary, optional work (as opposed to unnecessary, mandatory work). They are essential tools for:

  • Preventing silly mistakes from affecting SQL performance. Everyone makes mistakes, and as projects grow larger and SQL queries grow more complex, these mistakes might accumulate, yet hopefully without effect.
  • Enabling the reuse of complex building blocks, such as views and table-valued functions, which can be inlined into parent SQL queries, transformed, and parts removed or rewritten

These features are essential for the second part. Without them, it is very difficult to build 4,000 LOC SQL queries that still perform decently, based on a library of reusable SQL components.

Unfortunately for users of PostgreSQL and MySQL, these two popular open-source databases are still much behind their commercial counterparts DB2, Oracle, and SQL Server — where DB2 fared best in this article, Oracle and SQL Server being roughly on par.

SQL is a wonderful language because it is declarative and any statement can be rewritten to something simpler or more sophisticated, which performs much better than what the author has written.

Learn how to get 20x more performance than Elastic by moving to a Time Series database.

Topics:
database ,sql ,optimization ,cost-based model

Published at DZone with permission of Lukas Eder, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}