Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

When Using Bind Variables Is Not Enough: Dynamic IN Lists

DZone's Guide to

When Using Bind Variables Is Not Enough: Dynamic IN Lists

In this article, I will show an example where regrettably, bind variables are not enough, and you can still run into significant performance issues in production. That’s when you create dynamic IN lists.

· Database Zone ·
Free Resource

RavenDB vs MongoDB: Which is Better? This White Paper compares the two leading NoSQL Document Databases on 9 features to find out which is the best solution for your next project.  

In a previous post, I wrote about why you should (almost) always default to using bind variables. There are some exceptions, which I will cover in another follow-up post, but by default, bind variables are the right choice, both from a performance and from a security perspective.

In this article, I will show an example where regrettably, bind variables are not enough, and you can still run into significant performance issues in production. That’s when you create dynamic IN lists.

What’s the Problem With Dynamic IN Lists?

Again, this post is applicable for most databases, but I’ll use Oracle as an example.

An IN predicate that takes a list is a handy abbreviation for an equivalent OR predicate. The following two queries are semantically equivalent:

-- IN predicate
SELECT *
FROM actor
WHERE actor_id IN (1, 2, 3)
 
-- OR predicate
SELECT *
FROM actor
WHERE actor_id = 1
OR actor_id = 2
OR actor_id = 3

The execution plans are, respectively:

IN list:

---------------------------------------------------------
| Id  | Operation                    | Name     | Rows  |
---------------------------------------------------------
|   0 | SELECT STATEMENT             |          |     3 |
|   1 |  INLIST ITERATOR             |          |       |
|   2 |   TABLE ACCESS BY INDEX ROWID| ACTOR    |     3 |
|*  3 |    INDEX UNIQUE SCAN         | PK_ACTOR |     3 |
---------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   3 - access("ACTOR_ID"=1 OR "ACTOR_ID"=2 OR "ACTOR_ID"=3)

OR predicate:

---------------------------------------------------------
| Id  | Operation                    | Name     | Rows  |
---------------------------------------------------------
|   0 | SELECT STATEMENT             |          |     3 |
|   1 |  INLIST ITERATOR             |          |       |
|   2 |   TABLE ACCESS BY INDEX ROWID| ACTOR    |     3 |
|*  3 |    INDEX UNIQUE SCAN         | PK_ACTOR |     3 |
---------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   3 - access("ACTOR_ID"=1 OR "ACTOR_ID"=2 OR "ACTOR_ID"=3)

We get the exact same plan. And the plan is also optimal. When looking for three actors, we’ll run tgree primary key lookups and get a cardinality estimate of 3. Perfect — what could be wrong about this?

Remember Execution Plan Caching

In that previous [ost, we’ve seen the detrimental effects we can get under load when execution plans are no longer cached in the execution plan cache (or in Oracle, cursor cache). But even when we’re using bind variables, we can run into these issues. Consider the following dynamic SQL queries:

SELECT count(*) FROM actor WHERE actor_id IN (?)
SELECT count(*) FROM actor WHERE actor_id IN (?, ?)
SELECT count(*) FROM actor WHERE actor_id IN (?, ?, ?)
SELECT count(*) FROM actor WHERE actor_id IN (?, ?, ?, ?)
SELECT count(*) FROM actor WHERE actor_id IN (?, ?, ?, ?, ?)

Now, let’s generate these SQL queries using dynamic SQL in PL/SQL. PL/SQL is not a very dynamic SQL friendly language as can be seen in the following snippet:

SET SERVEROUTPUT ON
DECLARE
  v_count NUMBER;
   
  FUNCTION binds (n NUMBER) RETURN VARCHAR2 IS
    v_result VARCHAR2(1000);
  BEGIN
    FOR i IN 1..n LOOP
      IF v_result IS NULL THEN
        v_result := ':b' || i;
      ELSE
        v_result := v_result || ', :b' || i;
      END IF;
    END LOOP;
     
    RETURN v_result;
  END binds;
   
  FUNCTION vals (n NUMBER) RETURN VARCHAR2 IS
    v_result VARCHAR2(1000);
  BEGIN
    FOR i IN 1..n LOOP
      IF v_result IS NULL THEN
        v_result := i;
      ELSE
        v_result := v_result || ', ' || i;
      END IF;
    END LOOP;
     
    RETURN v_result;
  END vals;
BEGIN
  FOR i IN 1..5 LOOP
    EXECUTE IMMEDIATE '
      DECLARE 
        v_count NUMBER;
      BEGIN
        EXECUTE IMMEDIATE ''
          SELECT count(*)
          FROM actor
          WHERE actor_id IN (' || binds(i) || ')
        ''
        INTO v_count
        USING ' || vals(i) || ';
         
        :v_count := v_count;
      END;
    '
    USING OUT v_count
    ;
     
    dbms_output.put_line(v_count);
  END LOOP;
END;
/

Running this snippet yields, as expected:

1
2
3
4
5

There are a few noteworthy things to say about the above:

  • The functions BINDS() and VALS() generate strings like :b1, :b2, :b3 and 1, 2, 3. We need these to generate dynamically a list of bind variables for the IN predicates, as well as a list of bind values for the USING clause of the nested EXECUTE IMMEDIATE statement.
  • The EXECUTE IMMEDIATE .. USING clause does not take dynamically sized bind value lists, so we have to generate that, as well, in a nested EXECUTE IMMEDIATEstatement.

An appropriate GIF at this point when nesting EXECUTE IMMEDIATE like we did is this one:

Or also:

So, now that we got this covered… let’s look at execution plans:

SELECT s.sql_id, p.*
FROM v$sql s, TABLE (
  dbms_xplan.display_cursor (
    s.sql_id, s.child_number, 'ALLSTATS LAST'
  )
) p
WHERE lower(s.sql_text) LIKE '%actor_id in %'
AND lower(s.sql_text) NOT LIKE '%v$sql%'
AND lower(s.sql_text) NOT LIKE '%execute_immediate%';

And the result is (some additional formatting applied):

SQL_ID  cwm09k8zqp2at, child number 0
-------------------------------------
SELECT count(*) FROM actor WHERE actor_id IN (:b1)

Plan hash value: 1971314150

------------------------------------------------
| Id  | Operation          | Name     | E-Rows |
------------------------------------------------
|   0 | SELECT STATEMENT   |          |        |
|   1 |  SORT AGGREGATE    |          |      1 |
|*  2 |   INDEX UNIQUE SCAN| PK_ACTOR |      1 |
------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("ACTOR_ID"=:B1)

SQL_ID  1wypurx1x0mrp, child number 0
-------------------------------------
SELECT count(*) FROM actor WHERE actor_id IN (:b1, :b2)

Plan hash value: 577114894

-------------------------------------------------
| Id  | Operation           | Name     | E-Rows |
-------------------------------------------------
|   0 | SELECT STATEMENT    |          |        |
|   1 |  SORT AGGREGATE     |          |      1 |
|   2 |   INLIST ITERATOR   |          |        |
|*  3 |    INDEX UNIQUE SCAN| PK_ACTOR |      2 |
-------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - access(("ACTOR_ID"=:B1 OR "ACTOR_ID"=:B2))

SQL_ID  2y06nwgpxqttn, child number 0
-------------------------------------
SELECT count(*) FROM actor WHERE actor_id IN (:b1, :b2, :b3)

Plan hash value: 577114894

-------------------------------------------------
| Id  | Operation           | Name     | E-Rows |
-------------------------------------------------
|   0 | SELECT STATEMENT    |          |        |
|   1 |  SORT AGGREGATE     |          |      1 |
|   2 |   INLIST ITERATOR   |          |        |
|*  3 |    INDEX UNIQUE SCAN| PK_ACTOR |      3 |
-------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - access(("ACTOR_ID"=:B1 OR "ACTOR_ID"=:B2 OR "ACTOR_ID"=:B3))

SQL_ID  d4nn8qf9n22yt, child number 0
-------------------------------------
SELECT count(*) FROM actor WHERE actor_id IN (:b1, :b2, :b3, :b4)

Plan hash value: 577114894

-------------------------------------------------
| Id  | Operation           | Name     | E-Rows |
-------------------------------------------------
|   0 | SELECT STATEMENT    |          |        |
|   1 |  SORT AGGREGATE     |          |      1 |
|   2 |   INLIST ITERATOR   |          |        |
|*  3 |    INDEX UNIQUE SCAN| PK_ACTOR |      4 |
-------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - access(("ACTOR_ID"=:B1 OR "ACTOR_ID"=:B2 OR "ACTOR_ID"=:B3 OR 
              "ACTOR_ID"=:B4))

SQL_ID  4n9b4zgxr5cwj, child number 0
-------------------------------------
SELECT count(*) FROM actor WHERE actor_id IN (:b1, :b2, :b3, :b4, :b5)

Plan hash value: 577114894

-------------------------------------------------
| Id  | Operation           | Name     | E-Rows |
-------------------------------------------------
|   0 | SELECT STATEMENT    |          |        |
|   1 |  SORT AGGREGATE     |          |      1 |
|   2 |   INLIST ITERATOR   |          |        |
|*  3 |    INDEX UNIQUE SCAN| PK_ACTOR |      5 |
-------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - access(("ACTOR_ID"=:B1 OR "ACTOR_ID"=:B2 OR "ACTOR_ID"=:B3 OR 
              "ACTOR_ID"=:B4 OR "ACTOR_ID"=:B5))

So, again, we’re down to having five different SQL_IDs and statements in the execution plan cache/cursor cache, even if they all derive from a single logical statement. Now, in PL/SQL, this is so hard to do, you’re probably not doing it at all.

But if you’re using a query builder (or have written your own), like JPA’s Criteria Query, or jOOQ, then you will be writing something like this (using jOOQ, because in fact, using the Criteria Query API, dynamic SQL is about as hard to do as in PL/SQL):

for (int i = 1; i <= 5; i++) {
    System.out.println(
        DSL.using(configuration)
           .select(count())
           .from(ACTOR)
           .where(ACTOR.ACTOR_ID.in(
                IntStream.rangeClosed(1, i)
                         .boxed()
                         .collect(toList())
           )
           .fetchOne(count())
    );
}

So, this query will generate the same queries as the PL/SQL code, and this really wasn’t that hard. Perhaps, in this particular case, it’s not that much of a good thing that this isn’t so hard to do!

Let’s Benchmark This Again

There are different ways to benchmark this problem. One mistake we could make here is to just benchmark small IN lists with large ones, but that’s not fair. If you have to send a large IN list to the server, you cannot easily change that.

Another option is to benchmark alternative approaches to generating those IN lists

  • The default way is to have a list of exactly size N
  • An alternative way would be to pad the list to a size M where M = 2 ^ X such that M(X) >= N and M(X - 1) < N

The latter approach would, instead of generating the original queries:

SELECT count(*) FROM actor WHERE actor_id IN (?)
SELECT count(*) FROM actor WHERE actor_id IN (?, ?)
SELECT count(*) FROM actor WHERE actor_id IN (?, ?, ?)
SELECT count(*) FROM actor WHERE actor_id IN (?, ?, ?, ?)
SELECT count(*) FROM actor WHERE actor_id IN (?, ?, ?, ?, ?)

Generate these ones:

SELECT count(*) FROM actor WHERE actor_id IN (?)
SELECT count(*) FROM actor WHERE actor_id IN (?, ?)
SELECT count(*) FROM actor WHERE actor_id IN (?, ?, ?, ?)
SELECT count(*) FROM actor WHERE actor_id IN (?, ?, ?, ?)
SELECT count(*) FROM actor WHERE actor_id IN (?, ?, ?, ?, ?, ?, ?, ?)

Where the last bind variable is always repeated (padded) until the list is filled up to a length that is a power of 2 (and in Oracle, of course, we’d split lists longer than 1000 items into several lists).

This padding is what jOOQ already offers.

And maybe, a future Hibernate will offer it too, after I’ve suggested the feature to Vlad Mihalcea from the Hibernate team.

Here’s the benchmark logic. It’s a bit more complex than previously:

SET SERVEROUTPUT ON
 
ALTER SYSTEM FLUSH SHARED_POOL;
ALTER SYSTEM FLUSH BUFFER_CACHE;
 
CREATE TABLE results (
  run     NUMBER(2),
  stmt    NUMBER(2),
  elapsed NUMBER
);
 
DECLARE
  v_ts TIMESTAMP WITH TIME ZONE;
  v_repeat CONSTANT NUMBER := 20000;
  v_first_name actor.first_name%TYPE;
  v_last_name  actor.last_name%TYPE;
   
  FUNCTION binds (n NUMBER, pad BOOLEAN) RETURN VARCHAR2 IS
    v_result VARCHAR2(32767);
    v_n_pad  NUMBER(10) := power(2, ceil(ln(n)/ln(2)));
  BEGIN
    FOR i IN 1..n LOOP
      IF v_result IS NULL THEN
        v_result := ':b' || i;
      ELSIF mod(i, 1000) = 0 THEN
        v_result := v_result || ') OR actor_id IN (:b' || i;
      ELSE
        v_result := v_result || ', :b' || i;
      END IF;
    END LOOP;
     
    IF pad THEN
      FOR i IN n + 1 .. v_n_pad LOOP
        IF mod(i, 1000) = 0 THEN
          v_result := v_result || ') OR actor_id IN (:b' || i;
        ELSE
          v_result := v_result || ', :b' || i;
        END IF;
      END LOOP;
    END IF;
     
    RETURN v_result;
  END binds;
   
  FUNCTION vals (n NUMBER, pad BOOLEAN) RETURN VARCHAR2 IS
    v_result VARCHAR2(32767);
    v_n_pad  NUMBER(10) := power(2, ceil(ln(n)/ln(2)));
  BEGIN
    FOR i IN 1..n LOOP
      IF v_result IS NULL THEN
        v_result := i;
      ELSE
        v_result := v_result || ', ' || i;
      END IF;
    END LOOP;
     
    IF pad THEN
      FOR i IN n + 1 .. v_n_pad LOOP
        v_result := v_result || ', ' || n;
      END LOOP;
    END IF;
     
    RETURN v_result;
  END vals;
   
  PROCEDURE run(i NUMBER, pad BOOLEAN) IS
  BEGIN
    EXECUTE IMMEDIATE '
      DECLARE 
        v_count NUMBER;
      BEGIN
        EXECUTE IMMEDIATE ''
          SELECT count(*)
          FROM actor
          WHERE actor_id IN (' || binds(i, pad) || ')
        ''
        INTO v_count
        USING ' || vals(i, pad) || ';
      END;
    ';
  END run;
BEGIN
 
  -- Repeat the whole benchmark several times to avoid warmup penalty
  FOR r IN 1..5 LOOP
    v_ts := SYSTIMESTAMP;
       
    FOR i IN 1..v_repeat LOOP
      run(mod(i, 100) + 1, FALSE);
    END LOOP;
   
    INSERT INTO results VALUES (r, 1, 
      SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE);
    v_ts := SYSTIMESTAMP;
       
    FOR i IN 1..v_repeat LOOP
      run(mod(i, 100) + 1, TRUE);
    END LOOP;
   
    INSERT INTO results VALUES (r, 2, 
      SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE);
  END LOOP;
   
  FOR rec IN (
    SELECT
      run, stmt, 
      CAST(elapsed / MIN(elapsed) OVER() AS NUMBER(10, 5)) ratio 
    FROM results
  )
  LOOP
    dbms_output.put_line('Run ' || rec.run || 
      ', Statement ' || rec.stmt || 
      ' : ' || rec.ratio);
  END LOOP;
END;
/
 
DROP TABLE results;

Bear in mind the following:

  • We run the benchmark five times.
  • In each run, we run 20,000 queries with each technique.
  • Those 20000 queries have between 1 and 100 elements in the IN list.

The results show that both approaches fare equally well:

Run 1, Statement 1 : 1.16944
Run 1, Statement 2 : 1.11204
Run 2, Statement 1 : 1.06086
Run 2, Statement 2 : 1.03591
Run 3, Statement 1 : 1.03589
Run 3, Statement 2 : 1.03605
Run 4, Statement 1 : 1.33935
Run 4, Statement 2 : 1.2822
Run 5, Statement 1 : 1
Run 5, Statement 2 : 1.04648

This means that there is no significant overhead caused by the padding in this benchmark. That’s good news.

Running the following query shows the number of executions per statement:

SELECT executions, regexp_count(sql_text, ':'), sql_text
FROM v$sql
WHERE lower(sql_text) LIKE '%actor_id in %'
AND lower(sql_text) NOT LIKE '%v$sql%'
AND lower(sql_text) NOT LIKE '%execute_immediate%'
ORDER BY sql_text;

The result being:

EXECS   BINDS   SQL_TEXT
20001       SELECT count(*) FROM actor WHERE actor_id IN (:b1)         
20002       SELECT count(*) FROM actor WHERE actor_id IN (:b1, :b2)         
10003       SELECT count(*) FROM actor WHERE actor_id IN (:b1, :b2, :b3)         
30004       ...
10005       ...
10006       ...
10007       ...
50008       ...
10009       ...
100010      ...
100011      ...
100012      ...
100013...
100014...
100015...
900016...
...
1000100...
36000128...

The version that creates exactly sized IN lists is always executed 1000 times. The version that has padded IN lists sized to the next power of 2 (M = 2 ^ X) is executed 1,000 * X times (plus another 1,000 times because there’s also an IN list of that exact size)

In the end, there are some “lone” 128-element sized IN lists. This is expected. The number of different SQL statements is:

  • 100 for the exactly sized IN lists
  • 7 for the padded IN lists (128 = 2 ^ 7)

So, thus far, we haven’t gained anything, but also not lost much.

What If the Lists Get Longer?

The real problem isn’t the individual query execution time, which is fine in both cases, but the effect that dynamic IN lists have on a system. Let’s generate IN lists of up to size 4096, and bearing in mind that Oracle won’t allow this, the lists will be of the form:

SELECT count(*)
FROM actor
WHERE actor_id IN (:b1, :b2, ..., :b1000)
OR actor_id IN (:b1001, :b1002, ..., :b2000)
OR actor_id IN (:b2001, ...)

Now, with a maximum size of 4,096, there are quite a few distinct statements that need to be parsed, as opposed to only 12 when we use the padded in lists. The padded size grows logarithmically, so we’ll never really run into this same problem again — at least, not for a single IN list.

How does the comparison fare in another benchmark? We must be careful not to overdo things. Running 2x 20,000 queries with variable-sized IN lists can easily run for more than an hour on Oracle. Perhaps it’s Oracle’s way of punishing you for doing something you shouldn’t: having super-long IN lists. I’ve played around with different sets of parameters to the benchmark and the clear message here is that having long bind variable lists creates a bottleneck per se (i.e. the binding of the variables takes time), so the difference in parse times isn’t as drastic as it was before, for individual query executions. The overall improvement is this:

Run 1, Statement 1 : 1.42696
Run 1, Statement 2 : 1

So, we get around a 1.5x increase of performance when repeating the bind values. Yet still, the fact that we have far less cached execution plans does matter, especially when we start exhausting the cursor cache, because with this approach, at least this exhaustion is not going to happen.

Alternative Options

The above shows that if you really cannot get around an IN list, a simple, viable workaround is to pad the lists. But often, you can get around them. In databases like Oracle and PostgreSQL, that offer array types, you can consider using those once your lists start growing. I’ve blogged about this in a previous post.

If your lists become huge, a bulk/batch inserting the IDs into a temporary table might be a far better option, and then semi-join that one instead:

SELECT count(*)
FROM actor
WHERE actor_id IN (
  SELECT id FROM temp_table
)

And ideally, if your list is really not dynamic but originates from some other table, then please, stop passing around IDs and semi-join the original query that produced the IDs in the first place:

SELECT count(*)
FROM actor
WHERE actor_id IN (
  SELECT id FROM other_table WHERE ...
)

That last option should always be your first choice as it will very likely outperform all the others.

Get comfortable using NoSQL in a free, self-directed learning course provided by RavenDB. Learn to create fully-functional real-world programs on NoSQL Databases. Register today.

Topics:
database ,tutorial ,bind variables ,dynamic ,database performance

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}