How SQL DISTINCT and ORDER BY are Related

These two pieces of SQL may seem similar, but just how similar are they? We take an in-depth look.

Lukas Eder

Jul. 16, 18 · Analysis

Likes (4)

Comment

Save

36.8K Views

One of the things that confuse SQL users all the time is how DISTINCT and ORDER BY are related in a SQL query.

The Basics

Running some queries against the Sakila database, most people quickly understand:

SELECT DISTINCT length FROM film

This returns results in an arbitrary order because the database can (and might apply hashing rather than ordering to remove duplicates):

length |
-------|
129    |
106    |
120    |
171    |
138    |
80     |
...

Most people also understand:

SELECT length FROM film ORDER BY length

This will give us duplicates, but in order:

length |
-------|
46     |
46     |
46     |
46     |
46     |
47     |
47     |
47     |
47     |
47     |
47     |
47     |
48     |
...

And, of course, we can combine the two:

SELECT DISTINCT length FROM film ORDER BY length

Resulting in...

length |
-------|
46     |
47     |
48     |
49     |
50     |
51     |
52     |
53     |
54     |
55     |
56     |
...

Then Why Doesn't This Work?

Maybe somewhat intuitively, we may want to order the lengths differently, e.g. by title:

SELECT DISTINCT length FROM film ORDER BY title

Most databases fail this query with an exception like Oracle's:

ORA-01791: not a SELECTed expression

At first sight, this seems funny, because this works after all:

SELECT length FROM film ORDER BY title

Yielding:

length |
-------|
86     |
48     |
50     |
117    |
130    |
...

We could add the title to illustrate the ordering:

length |title                       |
-------|----------------------------|
86     |ACADEMY DINOSAUR            |
48     |ACE GOLDFINGER              |
50     |ADAPTATION HOLES            |
117    |AFFAIR PREJUDICE            |
130    |AFRICAN EGG                 |

So, How Are These Different?

We have to rewind and check out the logical order of SQL operations (as opposed to the syntactic order). And always remember, this is the logical order, not the actual order, executed by the optimizer.

When we write something like this:

SELECT DISTINCT length FROM film ORDER BY length

The logical order of operations is:

FROM clause, loading the FILM table
SELECT clause, projecting the LENGTH column
DISTINCT clause, removing distinct tuples (with projected LENGTH columns)
ORDER BY clause, ordering by the LENGTH column

If we look at this step by step, we have:

Step 1: SELECT * FROM film

The intermediary data set is something like:

film_id |title                       |length | ...
--------|----------------------------|-------| ...
1       |ACADEMY DINOSAUR            |86     | ...
2       |ACE GOLDFINGER              |48     | ...
3       |ADAPTATION HOLES            |50     | ...
4       |AFFAIR PREJUDICE            |117    | ...
5       |AFRICAN EGG                 |130    | ...
...     |...                         |...    | ...

Step 2: SELECT length ...

The intermediary data set is something like:

length |
-------|
86     |
48     |
50     |
117    |
130    |
...
86     | <-- duplicate

Step 3: SELECT DISTINCT length ...

Now we're getting a new random order (due to hashing) and no duplicates anymore:

length |
-------|
129    |
106    |
120    |
171    |
138    |
...

Step 4: ... ORDER BY length

And we're getting:

length |
-------|
46     |
47     |
48     |
49     |
50     |
...

It seems obvious.

So Why Did This Work?

Remember, this query worked:

SELECT length FROM film ORDER BY title

Even if after projecting the LENGTH column, it seems as though it is no longer available for sorting, it really is, according to the SQL standard and to common sense. There is a concept called extended sort key columns in the SQL standard, which means the above query has a slightly different order of operations (apart from the fact that there is no DISTINCT operation):

FROM clause, loading the FILM table
SELECT clause, projecting the LENGTH column from the select list and the TITLE from the extended sort key columns
ORDER BY clause, ordering by the TITLE column
SELECT clause (implicit), projecting only the LENGTH column, discarding the TITLE column

Again, this is what happens logically. Database optimisers may choose other ways to implement this. By example:

Step 1: SELECT * FROM film

Same as before

film_id |title                       |length | ...
--------|----------------------------|-------| ...
1       |ACADEMY DINOSAUR            |86     | ...
2       |ACE GOLDFINGER              |48     | ...
3       |ADAPTATION HOLES            |50     | ...
4       |AFFAIR PREJUDICE            |117    | ...
5       |AFRICAN EGG                 |130    | ...
...     |...                         |...    | ...

Step 2: SELECT length, title...

We get that synthetic extended sort key column TITLE along with the LENGTH column that we requested

length |title                       |
-------|----------------------------|
86     |ACADEMY DINOSAUR            |
114    |ALABAMA DEVIL               |
50     |ADAPTATION HOLES            |
117    |AFFAIR PREJUDICE            |
168    |ANTITRUST TOMATOES          |
...

Step 3: ...ORDER BY title

...we can now order by that column

length |title                       |
-------|----------------------------|
86     |ACADEMY DINOSAUR            |
48     |ACE GOLDFINGER              |
50     |ADAPTATION HOLES            |
117    |AFFAIR PREJUDICE            |
130    |AFRICAN EGG                 |
...

Step 4: SELECT length

...and finally discard it because we never wanted it

length |
-------|
86     |
48     |
50     |
117    |
130    |

So Why Can't We Use DISTINCT?

If we try to run this:

SELECT DISTINCT length FROM film ORDER BY title

We would get an additional DISTINCT operation in our logical set of operations:

FROM clause, loading the FILM table
SELECT clause, projecting the LENGTH column from the select list and the TITLE from the extended sort key columns
DISTINCT clause, removing duplicate (LENGTH, TITLE) values... Ooops
ORDER BY clause, ordering by the TITLE column
SELECT clause (implicit), projecting only the LENGTH column, discarding the TITLE column

The problem is, since we have synthetically added the extended sort key columnTITLE to the projection in order to be able to ORDER BY it, DISTINCT wouldn't have the same semantics anymore as can be seen here:

SELECT count(*)
FROM (
  SELECT DISTINCT length FROM film
) t;

SELECT count(*)
FROM (
  SELECT DISTINCT length, title FROM film
) t;

Yielding:

140
1000

All titles are distinct. There is no way this query can be executed reasonably. Either DISTINCT doesn't work (because the added extended sort key column changes its semantics), or ORDER BY doesn't work (because after DISTINCT we can no longer access the extended sort key column).

A more constructed example. T contains this data:

CREATE TABLE t (a INT, b INT);
INSERT INTO t VALUES (1, 1);
INSERT INTO t VALUES (1, 2);
INSERT INTO t VALUES (2, 3);
INSERT INTO t VALUES (1, 4);
INSERT INTO t VALUES (2, 5);

What would this query produce?

SELECT DISTINCT a FROM t ORDER BY b;

Clearly, we should only get 2 rows with values 1, 2, because of DISTINCT a:

A 
--
1
2

Now, how do we order these by B? There are 3 values of B associated A = 1 and 2 values of B associated with A = 2:

A   B
------------------
1   Any of 1, 2, 4
2   Any of 3, 5

Should we get 1, 2 or 2, 1 as a result? Impossible to tell.

But There Are Some Exceptions

The way I read the SQL standard, the following exception should be possible. The SQL standard ISO/IEC 9075-2:2016(E), 7.17 <query expression>, Syntax Rules 28) d) i) 6) references the "Left normal form derivation." But I may be reading this wrong, see also a discussion on the PostgreSQL mailing list:
https://www.postgresql.org/message-id/20030819103859.L69440-100000%40megazone.bigpanda.com

In any case, it still makes sense to me. For instance, we can form expressions on the columns in the select list. This is totally fine in MySQL (strict mode) and Oracle:

SELECT DISTINCT length 
FROM film 
ORDER BY mod(length, 10), length;

It will produce:

PostgreSQL doesn't allow this because the expression MOD(LENGTH, 10) is not in the select list. How to interpret this? We're looking again at the order of SQL operations:

FROM clause, loading the FILM table
SELECT clause, projecting the LENGTH column from the select list. MOD(LENGTH, 10) does not have to be put in the extended sort key columns, because it can be fully derived from the select list.
DISTINCT clause, removing duplicate LENGTH values ... all fine, because we don't have the verboten extended sort key columns
ORDER BY clause, ordering by the mod(LENGTH, 10), LENGTH columns. Totally fine, because we can derive all of these order by expressions from expressions in the select list

Makes sense, right?

Back to our constructed table T:

We are allowed to write:

SELECT DISTINCT a, b FROM t ORDER BY a - b;

We would get:

Again, the order by expressions can be derived completely from the select list. This also works in Oracle:

SELECT DISTINCT a - b FROM t ORDER BY abs(a - b);

The select list contains a column A - B, so we can derive any ORDER BY expression from it. But these don't work:

SELECT DISTINCT a - b FROM t ORDER BY a;
SELECT DISTINCT a - b FROM t ORDER BY b;
SELECT DISTINCT a - b FROM t ORDER BY b - a;

It is easy to build an intuition for why these don't work. Clearly, the data set we want is:

A - B  A             B             B - A
------------------------------------------
-3     Any of 1, 2   Any of 4, 5   3
-1     Any of 2, 1   Any of 3, 2   1
 0     Any of 1      Any of 1      0

Now, how are we supposed to order these by A, B or B - A? It looks as though we should be able to sort by B - A in this case. We could derive a complicated transformation of expressions that can be reasonably transformed into each other, such as A - B = -(B - A), but this simply isn't practical. The expression in the projection is A - B, and that's the only expression we can re-use in the ORDER BY. For example, we could even do this in Oracle:

SELECT DISTINCT a - b FROM t ORDER BY abs((a - b) + (a - b));

Or start using aliases:

SELECT DISTINCT a - b AS x FROM t ORDER BY abs(x + x);

PostgreSQL DISTINCT ON

PostgreSQL has a nice feature for when you want to order by something from within a group of non-distinct values. Remember how this wasn't possible?

SELECT DISTINCT length FROM film ORDER BY title

Well, this is:

SELECT DISTINCT ON (title) length FROM film ORDER BY title

And we're getting now:

length |
-------|
86     |
48     |
50     |
117    |
130    |
169    |
62     |
...

What we're essentially doing is, we take all distinct lengths, and for each group of identical lengths, we're taking the top title as a criteria to order by. In a way, this is syntax sugar for this:

SELECT length
FROM (
  SELECT length, MIN(title) title
  FROM film
  GROUP BY length
) t
ORDER BY title

Which is what most people really want, when they ORDER BY something they cannot really order by.

Conclusion

The SQL language is quirky. This is mostly because the syntactical order of operations doesn't match the logical order of operations. The syntax is meant to be human readable (remember Structured English Query Language?) but when reasoning about a SQL statement, we would often like to directly write down the logical order of operations.

In this article, we haven't even touched the implications of adding

Which add more fun rules to what's possible and what isn't. Our previous article on the true logical order of SQL operations explains this completely.

Need more explanation? Check this out.

Database sql MySQL

Published at DZone with permission of Lukas Eder. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

Trending