MySQL: When NOT IN Is Not Equal to NOT EXISTS
In MySQL, is "NOT IN" the same thing as "NOT EXISTS"? The answer is: not always. And, of course, nulls are at fault.
Join the DZone community and get the full member experience.
Join For FreeWhen you want to perform a difference operation between two tables, you have a choice: NOT EXISTS
with a correlated subquery, or NOT IN
. The latter is arguably simpler to write and makes the intent of the query more obvious. And modern database systems will optimize the two queries into similar execution plans, handling the correlation between outer and inner queries (I say “modern” because when I was working with Oracle 7.3 in the mid-90s I learned the hard way that it did not).
There is one key difference between the two constructs: if the subquery returns a NULL
in its results then the NOT IN
condition will fail, because null is neither equal-to nor not-equal-to any other value. But if you guard against that, they should be equivalent — indeed, some sources will tell you that NOT IN
is faster and therefore preferred.
This post is about one case where it's dramatically slower, and nulls are to blame.
Consider the following two tables, which might be used to track clickstream data. Since we track both anonymous and registered users, EVENTS.USER_ID
is nullable. However, when the user is not null, the secondary index has high cardinality.
create table USERS
(
ID integer auto_increment primary key,
...
)
create table EVENTS
(
ID integer auto_increment primary key,
TYPE smallint not null,
USER_ID integer
...
)
create index EVENTS_USER_IDX on EVENTS(USER_ID);
OK, now let's use these tables: From a small set of users, we want to find the ones that haven't had a particular event. Using a NOT IN
, and ensuring that nulls don't appear in the inner results, the query looks like this:
select ID
from USERS
where ID in (1, 7, 2431, 87142, 32768)
and ID not in
(
select USER_ID
from EVENTS
where TYPE = 7
and USER_ID is not null
);
For my test dataset, the USERS
table has 100,000 rows and the EVENTS
table has 10,000,000, of which approximately 75% have a null USER_ID
. I'm running on my laptop, which has a Core i7 processor, 12 GB of RAM, and an SSD.
And I consistently get runtimes of around 2 minutes, which is … wow.
Let's replace that NOT IN
with a NOT EXISTS
and correlated subuery:
select ID
from USERS
where ID in (1, 7, 2431, 87142, 32768)
and not exists
(
select 1
from EVENTS
where USER_ID = USERS.ID
and TYPE = 7
);
This version runs in 0.01 seconds, which is more what I expected.
Time to compare execution plans. The first plan is from the NOT IN
query, the second is from the NOT EXISTS
.
+----+--------------------+--------+------------+----------------+-----------------+-----------------+---------+------+------+----------+--------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+--------------------+--------+------------+----------------+-----------------+-----------------+---------+------+------+----------+--------------------------+
| 1 | PRIMARY | USERS | NULL | range | PRIMARY | PRIMARY | 4 | NULL | 5 | 100.00 | Using where; Using index |
| 2 | DEPENDENT SUBQUERY | EVENTS | NULL | index_subquery | EVENTS_USER_IDX | EVENTS_USER_IDX | 5 | func | 195 | 10.00 | Using where |
+----+--------------------+--------+------------+----------------+-----------------+-----------------+---------+------+------+----------+--------------------------+
+----+--------------------+--------+------------+-------+-----------------+-----------------+---------+------------------+------+----------+--------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+--------------------+--------+------------+-------+-----------------+-----------------+---------+------------------+------+----------+--------------------------+
| 1 | PRIMARY | USERS | NULL | range | PRIMARY | PRIMARY | 4 | NULL | 5 | 100.00 | Using where; Using index |
| 2 | DEPENDENT SUBQUERY | EVENTS | NULL | ref | EVENTS_USER_IDX | EVENTS_USER_IDX | 5 | example.USERS.ID | 97 | 10.00 | Using where |
+----+--------------------+--------+------------+-------+-----------------+-----------------+---------+------------------+------+----------+--------------------------+
Almost identical: both select rows from the USERS
table, and then use a nested-loops join (“dependent subquery&rquo;) to retrieve rows from the EVENTS
table. Both claim to use EVENTS_USER_IDX
to select rows in the subquery. And they estimate similar numbers of rows at each step.
But look more closely at the join types. The NOT IN
version uses index_subquery, while the NOT EXISTS
version uses ref. Also look at the ref
column: the NOT EXISTS
version uses an explicit reference to the outer column, while the NOT IN
uses a function. What's going on here?
The index_subquery
join type indicates that MySQL will scan the index to find relevant rows for the subquery. Could that be the problem? I don't think so, because the EVENTS_USER_IDX
is “narrow”: it only has one column, so the engine should not have to read a lot of blocks to find rows corresponding to the IDs from the outer query (indeed, I've tried a variety of queries to exercise this index, and all run in a few hundredths of a second).
For more information, I turned to the “extended” execution plan. To see this plan, prefix the query with explain extended
, and follow it with show warnings
. Here's what you get from the NOT IN
query (reformatted for clarity):
/* select#1 */ select `example`.`USERS`.`ID` AS `ID`
from `example`.`USERS`
where ((`example`.`USERS`.`ID` in (1,7,2431,87142,32768))
and (not(
(`example`.`USERS`.`ID`,
(
(
(`example`.`USERS`.`ID`) in EVENTS on EVENTS_USER_IDX checking NULL where ((`example`.`EVENTS`.`TYPE` = 7) and (`example`.`EVENTS`.`USER_ID` is not null)) having
(`example`.`EVENTS`.`USER_ID`)))))))
I haven't been able to find an explanation of “on EVENTS_USER_IDX checking NULL
” but here's what I think is happening: the optimizer believes that it is executing an IN
query that can have NULL
in the results; it does not consider the null check in the where
clause when making this decision. As a result, it will examine the 7.5 million rows where USER_ID
is null, along with the few dozen rows where it matches the values from the outer query. And by “examine,” I mean that it will read the table row and only then apply the is not null
condition. Moreover, based on the time taken to run the query, I think it's doing this for every one of the candidate values in the outer query.
So, bottom line: whenever you're thinking of using an IN
or NOT IN
subquery on a nullable column, rethink and use EXISTS
or NOT EXISTS
instead.
Published at DZone with permission of Keith Gregory, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments