A Look at How Postgres Executes a Tiny Join - Part 2
A Look at How Postgres Executes a Tiny Join - Part 2
In this post, I’ll show you exactly how Postgres executed a tiny join consisting of just a few records using the hash join algorithm.
Join the DZone community and get the full member experience.Join For Free
Using a Bitmask to Choose a Hash Bucket
Now that it has a hash, what does Postgres do with it? You can see a clue above in the C comments:
The best hash table sizes are powers of 2. There is no need to do mod a prime (mod is sooo slow!). If you need less than 32 bits, use a bitmask.
A hash table consists of an array of “buckets,” which are a series of pointers to linked lists. Initially Postgres creates an empty array of bucket pointers just before starting to scan the publications:
As you can guess, Postgres saves each publication in one of the buckets in the hash table, based on the calculated hash value. Later, when it scans over the authors, it will be able to find the publications again quickly by recalculating the same hash values. Instead of scanning over all of the publications again, Postgres can just look up each publication’s author using the hash. The hash is a record of where each publication is saved in the hash table.
However, if two publications turn out to have the same author, as we have in our example, then Postgres will have to save them both in the same bucket. This is why each bucket is a linked list; each bucket has to save more than one publication.
Because our example has three publications, does Postgres use a hash table with three buckets? Or with two buckets, because of the repeated author value? No. It actually uses 1024 buckets! Why 1024? For two reasons: First, Postgres was designed to query large amounts of data. Its hash join algorithm was optimized to handle extremely large data sets, containing millions of records or even more. A table containing three records is truly tiny! Postgres doesn’t bother with small hash tables and uses a minimum size of 1024.
And, why a power of two? This makes it easier to decide which bucket to use for a given hash. Instead of trying to return hash values that match the number of buckets, it’s easier and faster to always return very large values. What Postgres does instead is distribute the large hash values evenly over the number of buckets it does have. By choosing a power of two for the bucket array size, Postgres can use a fast bitwise operation to decide which bucket to save each publication in, like this:
Above, you can see how Postgres decides where to put “Edgar Codd” in the hash table: It subtracts one from the number of buckets: 1024-1 = 1023. Written in binary this is 1111111111. Then using your microprocessor’s binary computing circuits, Postgres quickly masks out the left bits, and keeps just the 10 least significant or rightmost bits. This yields 0000001111 binary, or the number 15. Using this fast calculation, Postgres decides to save Edgar in bucket #15:
Postgres also saves the title string, because it will need it later to produce the final result set. Along with the two strings, Postgres saves the hash value and a “next” pointer that will form the linked list.
Building the Rest of the Hash Table
Postgres now continues to scan over the publications, arriving at the second publication.
We have Edgar again! Clearly, he was a central figure behind database theory. Calculating the hash again for the same string will always return the same value: 2130627599, yielding bucket #15 a second time. We know the Edgar Codd records will always appear in bucket 15.
Also notice that Postgres saves each new publication at the head of the linked list–this means we have the second Edgar publication first on the left, and Edgar’s first publication second on the right. As we’ll see next, this yields the reverse order of Edgar’s records we saw above in the conceptual algorithm.
Finally, Postgres continues scanning and saves the third publication in the hash table; this time Postgres calculates a hash for “Jim Gray:”
You can see this time the 10 rightmost bits of 3344886182 evaluate to 422. So, Postgres saves Jim in bucket #422. Drawing the bucket array more to scale it might look something like this:
After saving all the publications in the hash table, Postgres can now scan over the authors table:
Now, finding the matching publication is simple. Instead of scanning over all the publications, Postgres simply calls the hash function again on the name string from the authors table, and repeats the bitmask operation. Because the first author record is Edgar, Postgres knows the matching publications will be in bucket #15.
In our tiny example, the only records in bucket 15 will be for Edgar Codd. But, remember in a large SQL query there might be millions of publications. It’s possible that publications with different authors might appear in this bucket. This would happen because either:
- The hash function returned the same hash number for two different author strings. This is possible but unlikely. In Computer Science this would be known as a hash collision.
- The 10 least significant bits of the hash were the same. For millions of publications this would happen frequently. However, as the number of records in the join increases Postgres uses more and more bits in the bitmask. 1024 (10 bits) was the minimum number it uses for our tiny query. Still, hash table buckets in practice will contain multiple key values.
Therefore, Postgres has to check each author in the matching bucket to be sure that it’s a match. This process is known as scanning the bucket. To do this, Postgres first checks the hash values:
This is a simple numerical comparison and so is quite fast. And if the hashes are the same, Postgres checks the actual strings just in case the hash function did return the same hash for different strings:
Because the author names match, Postgres can finally perform the join! To do this, it projects the columns that our query selects into a single joined record, in the desired order:
This becomes the first record in our result set.
Returning Multiple Records: The Hash Join State Machine
One of the most beautiful and important aspects of the Postgres implementation is the way it orchestrates building up and searching the hash table in the midst of a larger enclosing SQL expression. To see this for yourself, take a look at the hash join implementation, in nodeHashJoin.c.
("ExecHashJoin" method - view on postgresql.org)
Postgres calls ExecHashJoin once for each record in the join result set. For our example with 3 result records, Postgres calls ExecHashJoin three times. ExecHashJoin keeps track of how many times it has been called, and what it needs to do next, using a state machine.
The best way to understand how this state machine works, and how it fits into the larger structure of Postgres’s architecture, is to imagine that we asked for one record at a time. For example, imagine that we select just a single record from the join:
select title, company from publications, authors where author = name limit 1
By appending limit 1 we tell Postgres to stop after 1 record. For this query, to return just one record, ExecHashJoin will use the following states in its state machine:
Here’s what ExecHashJoin does to obtain the first joined record:
- HJ_BUILD_HASHTABLE: This code builds the hash table by scanning over all the publications records, as we saw above. Postgres calls publications the “inner relation.”
- HJ_NEED_NEW_OUTER: This code starts scanning the “outer relation” or the authors table in this example, and returns a single record.
- HJ_SCAN_BUCKET: This code takes one outer relation record (an author) and looks for the matching inner relation records in the hash table (publications).
Now, imagine that I ask Postgres for two records, by using limit 2:
select title, company from publications, authors where author = name limit 2
The second time Postgres calls ExecHashJoin, it only executes HJ_NEED_NEW_OUTER and HJ_SCAN_BUCKET – it already created the hash table the first time it was called:
Postgres pays the large price of scanning the entire inner relation and building the hash table as soon as you ask for one record. Returning the second and all subsequent records is much faster because Postgres already has the hash table.
If you read the C code, you’ll see some interesting optimizations. For example, Postgres actually scans the outer relation first to get a single record, just in case it might be empty. (This is what the C comment above refers to.) There’s no need to build a hash table if we’re not going to look up any values! Also, the HJ_FILL_INNER and HJ_FILL_OUTER states handle executing right or left outer joins respectively. ExecHashJoin implements these as well.
By using a state machine like this Postgres can execute this join inside the context of a large, complex SQL statement. It could be that we are joining together result sets from complex inner SQL clauses, or that the result set from this join becomes part of a larger expression. The state inside of ExecHashJoin allows Postgres to keep track of what it was doing–and of what it needs to do next–in the appropriate place on the execution stack.
The last state value handled by ExecHashJoin, HJ_NEED_NEW_BATCH, handles the case where the hash table doesn’t fit into the server’s memory. In this case, Postgres will create a series of hash tables and save some of them out to disk in “batch files.” This algorithm is what the term Hybrid Hashjoin refers to.
When I have time, I’d love to write about how Postgres handles a large join instead of a tiny one: How do batch files work? What configuration settings have an effect on batch files and join performance? And there’s also an interesting optimization Postgres uses for frequently occurring join key values.
Postgres does some amazing things internally to speed up your queries; it’s time to shed some light on the great work the Postgres open source community has done over the years!
(Make sure to read part 1 if you haven't already.)
Published at DZone with permission of Pat Shaughnessy , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.