Aside from saving and retrieving data, the primary feature of a relational database is the ability to execute join queries, to relate data in one table with data from another. While many developers are turning to NoSQL solutions, joining one set of data values with another remains one of our most common and important use cases while writing server code.
But what does the word “join” actually mean? And what really happens when I execute a join query? Too often we take our tools for granted, without really understanding what they are doing. This month I decided to look at the source code for PostgreSQL, a popular open source database server, to see how it implements join queries.
Reading and experimenting with the Postgres source code turned out to be a great learning experience. Today I’d like to report my observations; I’ll show you exactly how Postgres executed a tiny join consisting of just a few records, using the hash join algorithm. In future articles I’ll show you some interesting optimizations Postgres uses for larger joins, and other computer science algorithms at work inside of Postgres.
What is a Join?
But, before we get to the Postgres source code, let’s start by reviewing what join queries are. Here’s an introduction from the excellent Postgres documentation:
The Postgres docs then explain how to use joins: inner vs. outer joins, joining a table with itself, etc. But, I’m intrigued by the highlighted disclaimer. What is that “more efficient manner”? And, how could Postgres possibly get away with not “actually comparing each possible pair of rows”?
A Tiny Join
As an example today, let’s work with two tables: publications, which contains three ground breaking computer science journal articles I never read in college, and authors, which records where each author worked.
> select * from publications; title | author | year --------------------------------------------------------+------------+------ A Relational Model of Data for Large Shared Data Banks | Edgar Codd | 1970 Relational Completeness of Data Base Sublanguages | Edgar Codd | 1972 The Transaction Concept: Virtues and Limitations | Jim Gray | 1981 (3 rows) > select * from authors; name | company ------------+------------------------- Edgar Codd | IBM Research Laboratory Jim Gray | Tandem Computers (2 rows)
Today’s goal is to understand exactly what happens when Postgres joins one table with the other:
> select title, company from publications, authors where author = name; title | company --------------------------------------------------------+------------------------- Relational Completeness of Data Base Sublanguages | IBM Research Laboratory A Relational Model of Data for Large Shared Data Banks | IBM Research Laboratory The Transaction Concept: Virtues and Limitations | Tandem Computers (3 rows)
A Conceptual Model for Joining Two Tables
Before we look at the algorithm Postgres actually uses, let’s review what join queries do conceptually. Above the documentation stated that Postgres implements joins by “comparing each possible pair of rows” and then selecting “the pairs of rows where these values match”.
Reading this, I imagine Postgres takes each publication and loops over all of the authors, looking for that publication’s author:
In blue on the left are the publications, and I show the author records on the right in green. This process of iterating over the rows in the authors table is known as a scan in the Postgres source code. We are scanning over all of the authors for the first publication, trying to find matching names.
What do we do with each publication-author pair? We have to evaluate the WHERE clause from my example SQL statement:
Do the names match? Yes. This pair should be included in the result set. What about the second pair?
Do these names match? This time they don’t–this pair of rows should be filtered out.
Once we have a matching pair of rows, we copy just the selected columns into a new, joined record and return that to the client:
A Nested Loop
What’s wrong with this conceptual model? It seems like a very simple, straightforward way of obtaining the values we need. If we proceed to scan through the rest of the publications, it produces the same result that Postgres does, although in a different order. (We’ll see why the order changes later.)
The problem is that it’s very inefficient. First we scan over all of the authors for the first row in the publications table:
And then, we repeat the same scan of the authors table for the second publication:
And again for the third row. To find all of the matching pairs, in fact, we need to loop over all the authors for each publication:
For my tiny query this isn’t a problem. There are 3*2 or 6 combinations of rows; comparing names 6 times would only take a few microseconds on a modern computer. However, as the number of rows increases in either table, the total number of comparisons will explode. If we have 1000 publications and 1000 authors, suddenly we would have to compare name strings 1000*1000 or 1 million times! Computer scientists describe this algorithm as O(n2).
But do we really need to search the entire authors table for each publication? “Edgar Codd” appears in the publications table twice–why do we need to scan the authors table for the same name more than once? After we find Edgar the first time, there should be some way of recording where he was so we can find him again. And, even if there were no repeated author names in publications, it still seems wasteful to loop over the authors table over and over again. There must be some way of avoiding all of these repeated scans.
And there is; we can use a hash table.
Avoiding Repeated Scans
The problem with our naive algorithm, the conceptual model from the Postgres documentation, is that we loop over the authors table over and over again. To avoid those repeated loops, imagine if we scanned the authors only once and then saved them in some kind of data structure:
Now that we have the author records, what do we need to do with them? Well, we have to scan the publications, obtain each publication’s author, and find the matching author records, if any. In other words, we need to be able to quickly and easily find the author record with a given name:
You’ve probably seen this data structure before; in fact, it might be something you use everyday in your own code. If you’re a Rubyist like me, you call this a hash. If you prefer Python it’s a dictionary, or in Clojure it’s hash map.
With all the authors organized by their names, we can scan over the publications and quickly find out if there’s a matching author record:
But what are hash tables, exactly? And, how do they work? If only we could go back in time and sneak back into our college Computer Science classroom again. But if you installed Postgres from source, using Homebrew or with some Linux package manager, you already have an open source, world class implementation of the hash table algorithm right on your computer! To learn more about it, all we have to do is read the Postgres source code.
It turns out that for this query Postgres actually hashes the publications and then iterates over the authors. Before starting to execute a query, Postgres first parses the SQL we give it and generates a “query plan.” Probably because the publications table is larger (I’m not sure), Postgres’s query planner decides to save the publications, not the authors, in the hash table.
To do this, Postgres has to scan the publications just as we did above:
And for each publication, Postgres selects just two of the three columns: author and title. Postgres refers to the query plan and finds out it will need the author for the WHERE join condition, and the title for the final SELECT returning the result set. It leaves the year values behind.
This process of selecting the desired attributes from the matching pair is known in the Postgres C source code as a projection. We “project” a few values from one set of columns to another. (The term project is actually much older even than Postgres; Edgar Codd first used it in this context in A Relational Model of Data for Large Shared Data Banks back in 1970.)
Next, Postgres calculates a hash based on the author string. A hash is some integer value that can be calculated quickly in a repeatable, reproducible way. For the same author string, e.g. “Edgar Codd,” Postgres always calculates the same hash number. As we’ll see in a moment, Postgres uses the hash value to decide where to save the author name in the hash table.
You can find Postgres’s hash algorithm in a C file called hashfunc.c. Even if you’re not a C developer, there are extensive code comments explaining what’s going on, along with a link to an article written by Bob Jenkins, who developed the algorithm in 1997.
("hash_any" method - view on postgresql.org)
In my example, Postgres passes “Edgar Codd,” the string value in the author column in the first publication record, to hash_any. The complex bitwise calculations in hash_any step over the characters in Edgar’s name and return this hash value:
(Stay tuned for part 2 of this entry.)