Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Why You Should (Sometimes) Avoid Expressions in SQL Predicates

DZone's Guide to

Why You Should (Sometimes) Avoid Expressions in SQL Predicates

In this post, we take a look at a performance issue in an Oracle database that just may affect your own databases. Read on to find out more.

· Database Zone
Free Resource

Whether you work in SQL Server Management Studio or Visual Studio, Redgate tools integrate with your existing infrastructure, enabling you to align DevOps for your applications with DevOps for your SQL Server databases. Discover true Database DevOps, brought to you in partnership with Redgate.

I’ve recently discovered a rather significant performance issue on a productive Oracle 11g customer database. And I’m sure you have this issue too, which is why I’m documenting it here.

This is a simplified representation of the setup at the customer site:

        ID PAYMENT_DATE TEXT                               
---------- ------------ -----------------------------------
     33803 21.05.16     DcTNBOrkQIgMtbietUWOsSFNMIqGLlDw...
     29505 09.03.16     VIuPaOAQqzCMlFBYPQtvqUSbWYPDndJD...
     10738 25.07.16     TUkxGpZPrGKaHzDRxczrktkFWvGmiwjR...

Let’s produce the above table with the following statement:

CREATE TABLE payment (
  id            NOT NULL PRIMARY KEY,
  payment_date  NOT NULL,
  text
) AS
SELECT
  level,
  SYSDATE - dbms_random.value(1, 500),
  dbms_random.string('a', 500)
FROM dual
CONNECT BY level <= 50000
ORDER BY dbms_random.value;

CREATE INDEX i_payment_date ON payment(payment_date);

EXEC dbms_stats.gather_table_stats('TEST', 'PAYMENT');

What we’re doing here: There’s a payment table with an ID primary key column, an indexed payment_date column and some “text” info. The real table is, of course, much bigger than this.

Now, this table needs infrequent housekeeping. In nightly batch jobs, the customer ran through the table and removed all payments that were older than some fixed amount of days, e.g.:

DELETE
FROM payment
WHERE payment_date < SYSDATE - 470 // Older than 470 days

This might look fine at first, because:

  1. We have an index on payment_date, and it can be used for the deletion
  2. SYSDATE is constant per query, so the expression SYSDATE - 470 should also be constant per query

We can prove both statements easily:

1. Index Usage

Let’s run:

EXPLAIN PLAN FOR
DELETE
FROM payment
WHERE payment_date < SYSDATE - 470;

SELECT * FROM TABLE(dbms_xplan.display);

And we’re getting:

----------------------------------------------------------------------------
| Id  | Operation         | Name           | Rows  | Cost (%CPU)| Time     |
----------------------------------------------------------------------------
|   0 | DELETE STATEMENT  |                |  3008 |    10   (0)| 00:00:01 |
|   1 |  DELETE           | PAYMENT        |       |            |          |
|*  2 |   INDEX RANGE SCAN| I_PAYMENT_DATE |  3008 |    10   (0)| 00:00:01 |
----------------------------------------------------------------------------

Predicate Information (identified by operation id):                         
---------------------------------------------------

   2 - access("PAYMENT_DATE"<SYSDATE@!-470)                                 

So, we get a relatively cheap access predicate on the I_PAYMENT_DATE index, and there’s this weird SYSDATE@!-470 constant in the plan.

2. SYSDATE Being Constant

There’s a lot of confusion about what the actual value of these non-deterministic, non-pure date/time functions is in each database. Sometimes, the function is evaluated on each row individually and produces a new value each time. In my opinion, that’s the worst, as such functions are completely unpredictable.

Sometimes, there’s a guarantee that these timestamps stay the same for the entire duration of a transaction. That’s a bit weird as a default, but OK why not.

In Oracle, it seems that SYSDATE (and SYSTIMESTAMP) are evaluated only a single time on a per-statement level. This would explain that weird SYSDATE@! notation in the execution plan, and it can be “proven” by doing something like this:

CREATE OR REPLACE FUNCTION sleep(i NUMBER) RETURN NUMBER 
AS
BEGIN
  dbms_lock.sleep(i);
  RETURN i;
END;
/

SELECT 
  min(CAST(sysdate AS TIMESTAMP)), 
  max(CAST(sysdate AS TIMESTAMP))
FROM dual
CONNECT BY level <= 5 AND sleep(1) > 0;

DROP FUNCTION sleep;
/

The result of this statement (which unsurprisingly takes around 4 seconds to execute) is:

MIN(CAST(SYSDATEASTIMESTAMP)) MAX(CAST(SYSDATEASTIMESTAMP))
----------------------------- -----------------------------
01.11.16 11:15:20.000000000   01.11.16 11:15:20.000000000  

Some background info on this OTN thread here. Unfortunately, the SYSDATE documentation fails to clearly specify this behaviour…

So, What’s the Problem?

Even if we’re probably relying on an undocumented feature, it looks like everything is fine, right? SYSDATE - 470 is more or less a constant, so we’re fine putting it in that WHERE clause.

Wrong!

Let’s run a benchmark, comparing 3 times an equivalent query running each query 100 times (and for repeatability, we use SELECT, not DELETE):

SET SERVEROUTPUT ON
DECLARE
  v_ts TIMESTAMP;
  v_repeat CONSTANT NUMBER := 100;
  v_range CONSTANT NUMBER := 470;
  v_date CONSTANT DATE := SYSDATE - v_range;
BEGIN
  v_ts := SYSTIMESTAMP;

  -- Original query with inline expression
  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT *
      FROM payment
      WHERE payment_date < SYSDATE - v_range
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;

  dbms_output.put_line('Statement 1 : ' || (SYSTIMESTAMP - v_ts));
  v_ts := SYSTIMESTAMP;

  -- Pre-calculated PL/SQL local variable
  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT *
      FROM payment
      WHERE payment_date < v_date
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;

  dbms_output.put_line('Statement 2 : ' || (SYSTIMESTAMP - v_ts));
  v_ts := SYSTIMESTAMP;

  -- Magical 11g scalar subquery caching
  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT *
      FROM payment
      WHERE payment_date < (SELECT SYSDATE - v_range FROM dual)
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;

  dbms_output.put_line('Statement 3 : ' || (SYSTIMESTAMP - v_ts));
END;
/

The above benchmark uses 3 techniques that all produce the same result but have different timing characteristics:

  1. is using an inline expression in the predicate
  2. is using a bind variable
  3. is using a scalar subquery

When we check out the estimated execution plans, nothing seems to indicate that any solution might be better than the other:

EXPLAIN PLAN FOR
SELECT *
FROM payment
WHERE payment_date < SYSDATE - :v_range;

SELECT * FROM TABLE(dbms_xplan.display);

EXPLAIN PLAN FOR
SELECT *
FROM payment
WHERE payment_date < :v_date;

SELECT * FROM TABLE(dbms_xplan.display);

-- CAST is there because of a "bug" in the EXPLAIN PLAN parser
EXPLAIN PLAN FOR
SELECT *
FROM payment
WHERE payment_date < (
  SELECT CAST(SYSDATE - :v_range AS DATE) FROM dual
);

SELECT * FROM TABLE(dbms_xplan.display);

The output is:

--------------------------------------------------------------------------------------
| Id  | Operation                   | Name           | Rows  | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |                |  2500 |   453   (0)| 00:00:06 |
|   1 |  TABLE ACCESS BY INDEX ROWID| PAYMENT        |  2500 |   453   (0)| 00:00:06 |
|*  2 |   INDEX RANGE SCAN          | I_PAYMENT_DATE |   450 |     3   (0)| 00:00:01 |
--------------------------------------------------------------------------------------

Predicate Information (identified by operation id):                                   
---------------------------------------------------                                   

   2 - access("PAYMENT_DATE"<SYSDATE@!-:V_RANGE)                                      


--------------------------------------------------------------------------------------
| Id  | Operation                   | Name           | Rows  | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |                |  2500 |   453   (0)| 00:00:06 |
|   1 |  TABLE ACCESS BY INDEX ROWID| PAYMENT        |  2500 |   453   (0)| 00:00:06 |
|*  2 |   INDEX RANGE SCAN          | I_PAYMENT_DATE |   450 |     3   (0)| 00:00:01 |
--------------------------------------------------------------------------------------

Predicate Information (identified by operation id):                                   

Predicate Information (identified by operation id):                                   
---------------------------------------------------                                   

   2 - access("PAYMENT_DATE"<:V_DATE)                                                 


--------------------------------------------------------------------------------------
| Id  | Operation                   | Name           | Rows  | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |                |  2500 |   455   (0)| 00:00:06 |
|   1 |  TABLE ACCESS BY INDEX ROWID| PAYMENT        |  2500 |   453   (0)| 00:00:06 |
|*  2 |   INDEX RANGE SCAN          | I_PAYMENT_DATE |   450 |     3   (0)| 00:00:01 |
|   3 |    FAST DUAL                |                |     1 |     2   (0)| 00:00:01 |
--------------------------------------------------------------------------------------


Predicate Information (identified by operation id):                                   
---------------------------------------------------                                   

   2 - access("PAYMENT_DATE"< (SELECT CAST(SYSDATE@!-:V_RANGE AS DATE) FROM           
              "SYS"."DUAL" "DUAL"))                                                   

The only exception is that in the 3rd plan, we have this FAST DUAL operation which hints at scalar subquery caching kicking in.

Let’s compare actual results, though! (As always, I’m not publishing the real results, just qualitative numbers in a fictive unit of measurement to comply with Oracle legal blah blah.)

-- Run 1
Statement 1 : 1.98 wombos
Statement 2 : 1.17 wombos
Statement 3 : 0.80 wombos

-- Run 2
Statement 1 : 1.49 blorfs
Statement 2 : 1.19 blorfs
Statement 3 : 0.78 blorfs

-- Run 3
Statement 1 : 1.46 gnarls
Statement 2 : 1.20 gnarls
Statement 3 : 0.75 gnarls

Wow! As you can see, using a bind variable certainly helps, but when you apply scalar subquery caching, THIS is where you get the most benefits. Now, this query is just a very simple example. The actual customer query was immensely more complex, and trust me – the performance improvement was 10x (I couldn’t believe it myself at first, and I still don’t know exactly why!).

Conclusion (for Oracle 11g)

Modern optimizers “recognize” a lot of expressions to be constant. For instance, in most databases, it doesn’t matter if you’re writing COUNT(1) or COUNT(*), they’re both translated to the same thing.

In this particular case, however, I was quite disappointed by the fact that there is a significant difference between these three perfectly equivalent queries as far as I’m concerned, and the least intuitive solution using scalar subquery caching performed the best on Oracle 11g.

But Wait (Oracle 12c)

I’ve tested the same behaviour on my Oracle 12c on Docker instance:

Statement 1 : 1.23 xmfs
Statement 2 : 1.11 xmfs
Statement 3 : 1.14 xmfs

Statement 1 : 1.27 xlorgs
Statement 2 : 1.07 xlorgs
Statement 3 : 1.30 xlorgs

Statement 1 : 1.23 glrls (I can invent units of measurement all day!)
Statement 2 : 1.00 glrls
Statement 3 : 1.39 glrls

… where now there seems to be a “regression” in the scalar subquery solution (it’s now the same as without the scalar subquery) and the bind variable solution now seems to be the fastest.

If anything at all, this just proves that with SQL, you will always have to measure stuff on your end, because your setup and database versions may differ. The bottom line is: If the performance of a single statement matters to you, chances are that you can improve things by 2-digit percentages with just some simple tricks, and in a batch job or under heavy load, this definitely matters!

Certainly, you should not use any expressions of the form SYSDATE - some_value in your predicates.

It’s easier than you think to extend DevOps practices to SQL Server with Redgate tools. Discover how to introduce true Database DevOps, brought to you in partnership with Redgate

Topics:
performance ,sql ,oracle ,database ,predicates ,expression

Published at DZone with permission of Lukas Eder, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

THE DZONE NEWSLETTER

Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

X

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}