In their eternal attempts to circumvent the N+1 problem, Hibernate users often resort to IN predicates with ID lists. In this post, we’ll see how those users might just be replacing a horrible thing with a bad one, which is better but not yet good. Here’s why:
The N+1 Problem
SELECT id, name FROM albums SELECT id, name FROM songs WHERE album_id = 1 SELECT id, name FROM songs WHERE album_id = 2 SELECT id, name FROM songs WHERE album_id = 3 SELECT id, name FROM songs WHERE album_id = 4 SELECT id, name FROM songs WHERE album_id = 5
This set of queries is often produced by ORMs such as Hibernate, when entities are configured to be lazy fetched.
The article also tackles the problem by replacing the second set of N=5 queries by a single query with an IN predicate:
SELECT id, title, filename FROM songs WHERE album_id IN (1, 2, 3, 4, 5)
This will reduce the number of queries from N+1 to 1+1, which is certainly faster.
But let’s look at things from the SQL side
Is such an IN predicate with an ID List a good solution? It is certainly viable for very small lists. But when your list grows, consider these things:
IN list size
Not all databases support arbitrary lengths of IN lists. In particular the following limitations exist:
- Oracle IN predicate: 1000 elements
- Ingres: 1024 total bind values
- SQLite: 999 total bind values
- Sybase ASE: 2000 total bind values
- SQL Server 2008 R2: 2100 total bind values
- Large Oracle IN predicates are split into several
- Large amounts of bind values are detected at SQL rendering time and replaced by inline values
Variable binding speed
There’s quite a bit of work involved with variable binding in some JDBC drivers. Essentially, some database protocols will need to transfer many values one-by-one to the database. This doesn’t happen when you inline bind values, as the only thing transferred is a single SQL string. Having too many bind values (I’m talking about 10k or more), is certainly not a good idea.
Cursor cache misses
Sophisticated databases such as Oracle maintain cursor caches, which can be leveraged for cursor sharing. This means that subsequent executions of identical SQL statements will profit from expensive execution plan calculations having been done already, along with cursor statistics being collected. Think about it this way:
-- This is the first time Oracle encounters this -- query. The DB has to parse the query and -- calculate an execution plan, which can be quite -- expensive if you have lots of JOINs SELECT id, name FROM songs WHERE album_id IN (?, ?) -- This is the second time Oracle encounters this -- same query. The DB can now re-use the previous -- execution plan as it is likely to be optimal -- again SELECT id, name FROM songs WHERE album_id IN (?, ?) -- This is not the same query as the previous ones -- A new execution plan has to be calculated SELECT id, name FROM songs WHERE album_id IN (?, ?, ?)
As you can quickly see, the above example shows that an ID list in an IN predicate is a moving target, which is likely to remove the usefulness of bind values entirely, as each query is prone to produce new cursors and new execution plans in your database. You might as well have inlined your bind values, which would have even helped you prevent bind value peeking issues.
So what’s better than ID lists?
There are a number of things that are better. Note that not all of them may be suitable for your concrete problem, and not all of them will always outperform ID lists. Use common sense and maybe a load and/or performance test to be sure, which is the best query in your situation.
Explicit “eager” fetching, using JOINs
Sometimes, it would just be easier to denormalise the data in the database. Instead of fetching songs one by one, just fetch them along with the albums:
SELECT a.id a_id, a.name a_name, s.id s_id, s.name s_name FROM albums a JOIN songs s ON s.album_id = a.id
This will transfer more data over the wire (repeating album information) in exchange for executing only a single query (reducing N+1 to 1). This is only good for slight denormalisations. If you JOIN dozens of 1:N relationships, you might not be happy with this solution.
Semi-joining the original query
If you can access the original query’s SQL code, just semi-join it when fetching songs! It’s simple:
SELECT id, name FROM songs WHERE album_id IN ( SELECT id FROM albums ) -- Or using EXISTS SELECT s.id, s.name FROM songs s WHERE EXISTS ( SELECT 1 FROM albums a WHERE a.id = s.album_id )
This will require some SQL transformation. Again, using a typesafe query builder / SQL builder to compose queries, such as jOOQ, JaQu or Criteria API, you may be able to implement such SQL transformation / SQL composition more easily.
Note that this is probably the fastest solution that you can choose, at least in sophisticated databases with powerful query optimisers.
Using arrays for ID lists
If you really cannot query your songs without an ID list, at least, use a single array as a bind variable as such (Oracle dialect):
SELECT id, name FROM songs WHERE album_id IN ( SELECT * FROM TABLE(?) )
The above syntax is Oracle-specific. Check out this Stack Overflow question for other alternatives. Note that Oracle’s
TABLE types are strongly typed, i.e. you will have to have such a type, first:
CREATE TYPE numbers AS TABLE OF NUMBER(38);
Alternatively, you can use one of these “built-in” table types:
Creating discrete-sized IN lists
If your database doesn’t support arrays, and you need to rely on ID lists, there is one last option that you may have to avoid too many cursor cache misses and hard parses. Create discrete-sized IN lists, filling up the bind values to the next discrete length. Let’s assume lengths 2, 3, 5, 8, 13. This is best explained by example:
-- Of course, this only makes sense with bind values. Inlining is done -- for the purpose of the example only -- Two IDs fill up to 2 album_id IN (1, 2) -- Three IDs fill up to 3 album_id IN (1, 2, 3) -- Four IDs fill up to 5 album_id IN (1, 2, 3, 4, 4) -- Five IDs fill up to 5 album_id IN (1, 2, 3, 4, 5) -- Six IDs fill up to 8 album_id IN (1, 2, 3, 4, 5, 6, 6, 6)
There is no rule of thumb at what steps your IN list sizes should increase, so you might want to actually measure this.
Note!: You may use
NULL to fill up IN lists of an
IN predicate, but not of a
NOT IN predicate. To learn more about this, read this blog post about
NOT IN predicates.
TL;DR: Get back in control of your SQL
As soon as a decent amount of data is involved with your data processing, common ORM models may not be sufficient anymore, as it is very hard to tune such ORMs. You may need to resort to SQL and explicitly express your SQL statements in the most optimal way for your problem domain.