{{announcement.body}}
{{announcement.title}}

Understanding Hash Joins in MySQL 8

DZone 's Guide to

Understanding Hash Joins in MySQL 8

See when and how to use Hash Joins in MySQL 8.

· Database Zone ·
Free Resource

married-couple-holding-hands

In MySQL 8.0.18 there is a new feature called Hash Joins, and I wanted to see how it works, and in which situations it can help us. Here, you can find a nice detailed explanation about how it works under the hood.

The high-level basics are the following: if there is a join, it will create an in-memory hash table based on one of the tables and will read the other table row by row, calculate a hash, and do a lookup on the in-memory hash table.

Great, but Does This Give Us Any Performance Benefits?

First of all, this only works on fields that are not indexed, so that is an immediate table scan, and we usually do not recommend doing joins without indexes because it is slow. Here is where Hash Joins in MySQL can help because it will use an in-memory hash table instead of Nested Loop.

You may also like: An Overview of Adaptive Joins.

Let’s do some tests and see. First, I created the following tables:

CREATE TABLE `t1` (
`id` int(11) NOT NULL AUTO_INCREMENT ,
`c1` int(11) NOT NULL DEFAULT '0',
`c2` int(11) NOT NULL DEFAULT '0',
PRIMARY KEY (`id`),
KEY `idx_c1` (`c1`)
) ENGINE=InnoDB;

CREATE TABLE `t2` (
`id` int(11) NOT NULL AUTO_INCREMENT ,
`c1` int(11) NOT NULL DEFAULT '0',
`c2` int(11) NOT NULL DEFAULT '0',
PRIMARY KEY (`id`),
KEY `idx_c1` (`c1`)
) ENGINE=InnoDB;


I have inserted 131072 random rows into both tables.

mysql> select count(*) from t1;
+----------+
| count(*) |
+----------+
| 131072   |
+----------+


First test – Hash Joins

Run a join based on c2, which is not indexed.

mysql> explain format=tree select count(*) from t1 join t2 on t1.c2 = t2.c2\G
*************************** 1. row ***************************
EXPLAIN: -> Aggregate: count(0)
-> Inner hash join (t2.c2 = t1.c2) (cost=1728502115.04 rows=1728488704)
-> Table scan on t2 (cost=0.01 rows=131472)
-> Hash
-> Table scan on t1 (cost=13219.45 rows=131472)

1 row in set (0.00 sec)


We have to use explain format=tree to see if the Hash Join will be used or not. It is going to be a nested loop, which I think it is very misleading. I have already filed a bug report because of this, and in the ticket, you can see some comments from developers saying:

The solution is to stop using traditional EXPLAIN (it will eventually go away).

So, this is not going to be fixed in traditional explain and we should start using the new way.

Back to the query; we can see it is going to use Hash Join for this query, but how fast is it?

mysql> select count(*) from t1 join t2 on t1.c2 = t2.c2;
+----------+
| count(*) |
+----------+
| 17172231 |
+----------+
1 row in set (0.73 sec)


0.73s for a more than 17m rows join table. Looks promising.

Second Test – Non-Hash Joins

We can disable it with an optimizer switch or optimizer hint.

mysql> select /*+ NO_HASH_JOIN (t1,t2) */ count(*) from t1 join t2 on t1.c2 = t2.c2;
+----------+
| count(*) |
+----------+
| 17172231 |
+----------+
1 row in set (13 min 36.36 sec)


Now, the same query takes more than 13 minutes. That is a huge difference, and we can see a Hash Join helps a lot here.

Third Test – Joins Based on Indexes

Let’s create indexes and see how fast a join based on indexes is.

create index idx_c2 on t1(c2);
create index idx_c2 on t2(c2);

mysql> select count(*) from t1 join t2 on t1.c2 = t2.c2;
+----------+
| count(*) |
+----------+
| 17172231 |
+----------+
1 row in set (2.63 sec)
2.6s


 Hash Join is even faster than the Index-based join in this case.

However, I was able to force the optimizer to use Hash Joins even if an index is available by using ignore index:

mysql> explain format=tree select count(*) from t1 ignore index (idx_c2) join t2 ignore index (idx_c2) on t1.c2 = t2.c2 where t1.c2=t2.c2\G
*************************** 1. row ***************************
EXPLAIN: -> Aggregate: count(0)
-> Inner hash join (t2.c2 = t1.c2) (cost=1728502115.04 rows=17336898)
-> Table scan on t2 (cost=0.00 rows=131472)
-> Hash
-> Table scan on t1 (cost=13219.45 rows=131472)

1 row in set (0.00 sec)


I still think it would be nice if I can tell the optimizer with a hint to use Hash Joins even if an index is available. This way we do not have to ignore indexes on all the tables. I have created a feature request for this.

However, if you read my first bug report carefully, you can see a comment from a MySQL developer, which indicates this might not necessary:

BNL (Block Nested-Loop) will also go away entirely at some point, at which point this hint will be ignored.

That could mean they are planning to remove BNL joins in the future and maybe replace it with Hash join.

Limitations

We can see a Hash Join can be powerful, but there are limitations:

  • As I mentioned, it only works on columns that do not have indexes (or you have to ignore them).
  • It only works with equi-join conditions.
  • It does not work with LEFT or RIGHT JOIN.

I would like to see a status metric as well to monitor how many times Hash Join was used, and for this, I filled another feature request.

Conclusion

Hash Join seems like a very powerful new join option, and we should keep an eye on this because I would not be surprised if we get some other features in the future as well. In theory, it would be able to do Left and Right joins as well, and, as we can see, in the comments on the bug report, that Oracle has plans for it in the future.


Further Reading

Topics:
database ,mysql ,hash ,hash join ,left join ,right join ,crud ,sql server ,tutorial

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}