DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
The Latest "Software Integration: The Intersection of APIs, Microservices, and Cloud-Based Systems" Trend Report
Get the report
  1. DZone
  2. Data Engineering
  3. Databases
  4. MySQL: When NOT IN Is Not Equal to NOT EXISTS

MySQL: When NOT IN Is Not Equal to NOT EXISTS

In MySQL, is "NOT IN" the same thing as "NOT EXISTS"? The answer is: not always. And, of course, nulls are at fault.

Keith Gregory user avatar by
Keith Gregory
·
Feb. 14, 17 · Tutorial
Like (0)
Save
Tweet
Share
12.88K Views

Join the DZone community and get the full member experience.

Join For Free

When you want to perform a difference operation between two tables, you have a choice: NOT EXISTS with a correlated subquery, or NOT IN. The latter is arguably simpler to write and makes the intent of the query more obvious. And modern database systems will optimize the two queries into similar execution plans, handling the correlation between outer and inner queries (I say “modern” because when I was working with Oracle 7.3 in the mid-90s I learned the hard way that it did not).

There is one key difference between the two constructs: if the subquery returns a NULL in its results then the NOT IN condition will fail, because null is neither equal-to nor not-equal-to any other value. But if you guard against that, they should be equivalent — indeed, some sources will tell you that NOT IN is faster and therefore preferred.

This post is about one case where it's dramatically slower, and nulls are to blame.

Consider the following two tables, which might be used to track clickstream data. Since we track both anonymous and registered users, EVENTS.USER_ID is nullable. However, when the user is not null, the secondary index has high cardinality.

create table USERS
(
  ID    integer auto_increment primary key,
  ...
)

create table EVENTS
(
  ID      integer auto_increment primary key,
  TYPE    smallint not null,
  USER_ID integer
  ...
)

create index EVENTS_USER_IDX on EVENTS(USER_ID);


OK, now let's use these tables: From a small set of users, we want to find the ones that haven't had a particular event. Using a NOT IN, and ensuring that nulls don't appear in the inner results, the query looks like this:

select  ID
from    USERS
where   ID in (1, 7, 2431, 87142, 32768)
and     ID not in
        (
        select  USER_ID
        from    EVENTS
        where   TYPE = 7
        and     USER_ID is not null
        );


For my test dataset, the USERS table has 100,000 rows and the EVENTS table has 10,000,000, of which approximately 75% have a null USER_ID. I'm running on my laptop, which has a Core i7 processor, 12 GB of RAM, and an SSD.

And I consistently get runtimes of around 2 minutes, which is … wow.

Let's replace that NOT IN with a NOT EXISTS and correlated subuery:

select  ID
from    USERS
where   ID in (1, 7, 2431, 87142, 32768)
and     not exists
        (
        select  1
        from    EVENTS
        where   USER_ID = USERS.ID
        and     TYPE = 7
        );


This version runs in 0.01 seconds, which is more what I expected.

Time to compare execution plans. The first plan is from the NOT IN query, the second is from the NOT EXISTS.

+----+--------------------+--------+------------+----------------+-----------------+-----------------+---------+------+------+----------+--------------------------+
| id | select_type        | table  | partitions | type           | possible_keys   | key             | key_len | ref  | rows | filtered | Extra                    |
+----+--------------------+--------+------------+----------------+-----------------+-----------------+---------+------+------+----------+--------------------------+
|  1 | PRIMARY            | USERS  | NULL       | range          | PRIMARY         | PRIMARY         | 4       | NULL |    5 |   100.00 | Using where; Using index |
|  2 | DEPENDENT SUBQUERY | EVENTS | NULL       | index_subquery | EVENTS_USER_IDX | EVENTS_USER_IDX | 5       | func |  195 |    10.00 | Using where              |
+----+--------------------+--------+------------+----------------+-----------------+-----------------+---------+------+------+----------+--------------------------+
+----+--------------------+--------+------------+-------+-----------------+-----------------+---------+------------------+------+----------+--------------------------+
| id | select_type        | table  | partitions | type  | possible_keys   | key             | key_len | ref              | rows | filtered | Extra                    |
+----+--------------------+--------+------------+-------+-----------------+-----------------+---------+------------------+------+----------+--------------------------+
|  1 | PRIMARY            | USERS  | NULL       | range | PRIMARY         | PRIMARY         | 4       | NULL             |    5 |   100.00 | Using where; Using index |
|  2 | DEPENDENT SUBQUERY | EVENTS | NULL       | ref   | EVENTS_USER_IDX | EVENTS_USER_IDX | 5       | example.USERS.ID |   97 |    10.00 | Using where              |
+----+--------------------+--------+------------+-------+-----------------+-----------------+---------+------------------+------+----------+--------------------------+


Almost identical: both select rows from the USERS table, and then use a nested-loops join (“dependent subquery&rquo;) to retrieve rows from the EVENTS table. Both claim to use EVENTS_USER_IDX to select rows in the subquery. And they estimate similar numbers of rows at each step.

But look more closely at the join types. The NOT IN version uses index_subquery, while the NOT EXISTS version uses ref. Also look at the ref column: the NOT EXISTS version uses an explicit reference to the outer column, while the NOT IN uses a function. What's going on here?

The index_subquery join type indicates that MySQL will scan the index to find relevant rows for the subquery. Could that be the problem? I don't think so, because the EVENTS_USER_IDX is “narrow”: it only has one column, so the engine should not have to read a lot of blocks to find rows corresponding to the IDs from the outer query (indeed, I've tried a variety of queries to exercise this index, and all run in a few hundredths of a second).

For more information, I turned to the “extended” execution plan. To see this plan, prefix the query with explain extended, and follow it with show warnings. Here's what you get from the NOT IN query (reformatted for clarity):

/* select#1 */  select `example`.`USERS`.`ID` AS `ID` 
                from    `example`.`USERS` 
                where   ((`example`.`USERS`.`ID` in (1,7,2431,87142,32768)) 
                        and (not(

    (`example`.`USERS`.`ID`,


     (

      (

       (`example`.`USERS`.`ID`) in EVENTS on EVENTS_USER_IDX checking NULL where ((`example`.`EVENTS`.`TYPE` = 7) and (`example`.`EVENTS`.`USER_ID` is not null)) having 

        (`example`.`EVENTS`.`USER_ID`))))))) 





I haven't been able to find an explanation of “on EVENTS_USER_IDX checking NULL” but here's what I think is happening: the optimizer believes that it is executing an IN query that can have NULL in the results; it does not consider the null check in the where clause when making this decision. As a result, it will examine the 7.5 million rows where USER_ID is null, along with the few dozen rows where it matches the values from the outer query. And by “examine,” I mean that it will read the table row and only then apply the is not null condition. Moreover, based on the time taken to run the query, I think it's doing this for every one of the candidate values in the outer query.

So, bottom line: whenever you're thinking of using an IN or NOT IN subquery on a nullable column, rethink and use EXISTS or NOT EXISTS instead.

Database MySQL

Published at DZone with permission of Keith Gregory, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • The Path From APIs to Containers
  • REST vs. Messaging for Microservices
  • How to Submit a Post to DZone
  • HTTP vs Messaging for Microservices Communications

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: