SQL DISTINCT Is Not a Function
SQL DISTINCT Is Not a Function
Let's see where people go wrong when using the DISTINCT keyword in SQL, how you should be using it, and some tips for PostgreSQL users.
Join the DZone community and get the full member experience.Join For Free
A very common misconception I often encounter with SQL users is the idea that
DISTINCT is something like a function, and that it can take parenthesised arguments. Just recently, I’ve seen this Stack Overflow question where the OP was looking for a way to express this in jOOQ:
Notice the parentheses around
(emp.id), which look as though this is some special kind of
DISTINCT usage, which is akin to a
DISTINCT function. The idea is often that:
- The behaviour is somewhat different from omitting the parentheses
- The performance is faster, because only the ID needs to be considered for distinctness
This Is Incorrect
These claims are incorrect, of course. There is no semantic or performance difference between the two. The parentheses are merely parentheses around a column expression, in a similar way as you would use parentheses to influence operator precedence. Think of it this way:
In the above example, we do not apply a “DISTINCT function” to the expression
emp.id + 1. We merely placed parentheses around a column expression
emp.id + 1 to make sure the addition happens before the multiplication. The
DISTINCT operator happens after the projection, always. If SQL had used a more logical syntax, rather than following English grammar (it was originally called Structured English QUEry Language, or SEQUEL), then we would write the OP’s statement like this:
DISTINCT operation always happens after the projection (
SELECT clause content), and is applied to the entirety of the projection. There is no way in standard SQL to apply distinctness only to parts of the projection (there is in PostgreSQL, see further down).
What Would it Mean Anyway?
We can revert the question back to the OP and ask ourselves, what would it mean for a
DISTINCT operation to apply to only one column, anyway? Let’s assume this data set:
|id |fname|name| |---|-----|----| |1 |A |A | |1 |B |B |
If we applied
DISTINCT only to the ID column (and didn’t project anything else), clearly, we’d only get one row as a result:
|id | |---| |1 |
But if we wanted to also project
NAME, which row would “win”? Would we display the first or the second row, or any random row? The behaviour would be undefined, and SQL doesn’t like undefined behaviour, so this is not possible. The only reasonable application of
DISTINCT is always on the entire projection.
Fortunately (or to add more to syntactic confusion: unfortunately), PostgreSQL has implemented an extension to the SQL standard. With
DISTINCT ON, it is effectively possible to apply distinctness only to parts of the projection:
The output is now what the OP desired (but couldn’t use, because they were using MySQL):
|id |fname|name | |-----------|-----|-----| |1 |A |A |
I personally don’t like
DISTINCT ON. While it is very useful, no doubt, it makes something that is already very difficult to explain to SQL beginners even more complicated. With a “more reasonable” syntax, the query would be written like this:
With this syntactic order of operation, there would be no doubt about the semantics of
Published at DZone with permission of Lukas Eder , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.