Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Faster SQL Through Choosing Natural Keys Over Surrogate Keys

DZone's Guide to

Faster SQL Through Choosing Natural Keys Over Surrogate Keys

It's surrogate keys versus natural keys. Which one should you be using? Is the answer really that obvious? It might not be.

· Performance Zone
Free Resource

Evolve your approach to Application Performance Monitoring by adopting five best practices that are outlined and explored in this e-book, brought to you in partnership with BMC.

There are many many opinions out there regarding the old surrogate key vs. natural key debate. Most of the times, surrogate keys (i.e., sequence-generated IDs) win because they’re much easier to design.

  • They’re easy to keep consistent across a schema (i.e., every table has an ID column, and that’s always the primary key).
  • They’re thus a no-brainer to add. When you create a new table, you don’t need to worry about any candidate keys.
  • They’re guaranteed to be unique, because they have absolutely no business value, only a technical value.

Great. So why bother using natural keys in the first place?

Well, there is a very compelling reason!

Performance!

Whenever you introduce surrogate keys, this means that your key data becomes completely meaningless. From a design perspective, that’s not too bad. You can easily join that other table to get the interesting, meaningful information that hides behind the surrogate foreign key value — for example, in our Sakila database:

We have a typical many-to-many relationship modeled with a relationship table between the FILM table and the CATEGORY table — state-of-the-art normalization. But check out this interesting thing:

  • The FILM_CATEGORY relationship table doesn’t contain any interesting information at all — just the relationships.
  • The category table only contains a single useful column: The NAME column.
  • The remaining columns (CATEGORY_ID and LAST_UPDATE) have absolutely no meaning.

With this in mind, we could design a much simpler schema, where we use the category name as a natural key, and in fact, we don’t even need the CATEGORY table anymore, we can now remove it (that’s optional here). Check this out:

Now, if we run a query like the following one against our entire Sakila schema:

SELECT c.name, count(*)
FROM film_actor fa USING (actor_id)
JOIN film_category fc USING (film_id)
JOIN category c USING (category_id)
WHERE actor_id = 1
GROUP BY c.name
ORDER BY count(*) DESC

The query finds all categories a given actor played in, and the number of films that the given actor played in each category. For instance, this could be the result:

NAME       COUNT(*)
-------------------
Horror     3
Classics   2
Family     2
New        2
Games      2
Animation  1
Sports     1
Children   1
...

With an alternative schema where the category NAME has been moved to a new FILM_CATEGORY_NATURAL table, we could run this much simpler query:

SELECT fc.name, count(*)
FROM film_actor fa 
JOIN film_category_natural fc 
  USING (film_id)
WHERE actor_id = 1
GROUP BY fc.name
ORDER BY count(*) DESC

Notice how we can omit an entire JOIN.

The execution plans (here on Oracle) are quite different. Check this out.

Before:

After:

The cost has gone from 8 to 5 on our trivial database. That’s a significant improvement for something this simple. Imagine the improvement if we could roll out this kind of better query throughout the system?

What if we benchmark the two queries using the same silly benchmark, as always, repeating each statement 100 times?

SET SERVEROUTPUT ON
DECLARE
  v_ts TIMESTAMP;
  v_repeat CONSTANT NUMBER := 100;
BEGIN
  v_ts := SYSTIMESTAMP;

  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT c.name, count(*)
      FROM actor a
      JOIN film_actor fa USING (actor_id)
      JOIN film f USING (film_id)
      JOIN film_category fc USING (film_id)
      JOIN category c USING (category_id)
      WHERE actor_id = 1
      GROUP BY c.name
      ORDER BY count(*) DESC
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;

  dbms_output.put_line('Statement 1 : ' || (SYSTIMESTAMP - v_ts));
  v_ts := SYSTIMESTAMP;

  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT fc.name, count(*)
      FROM film_actor fa 
      JOIN film_category_natural fc 
        USING (film_id)
      WHERE actor_id = 1
      GROUP BY fc.name
      ORDER BY count(*) DESC
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;

  dbms_output.put_line('Statement 2 : ' || (SYSTIMESTAMP - v_ts));
END;
/

The results are drastic:

Statement 1 : 00:00:00.122070000
Statement 2 : 00:00:00.051179000

A factor of 2.5x faster!

The JOIN Is Completely Unnecessary

The only reason why we joined the CATEGORY table each time is because we needed to display the meaningful business value of a category to the user, the CATEGORY.NAME. We could have avoided the JOIN if that was not a requirement, but displaying surrogate keys to the user (the CATEGORY_ID) would be rather harsh, wouldn’t it?

And we’re doing this all the time with all sorts of tables. We have:

  • Category tables (category name is a good candidate key).
  • Translation tables (label is a good candidate key).
  • Country tables (ISO 3166 country codes are a good candidate key).
  • Language tables (ISO 639 language codes are a good candidate key).
  • Currency tables (ISO 4217 currency codes are a good candidate key).
  • Stock symbol tables (ISIN security codes are a good candidate key).
  • …and much more!

Working with natural keys can be quite cumbersome. But in some entities, the internationally standardized codes are really good candidate keys, and most of the time, they’re sufficient. What does the LANGUAGE_ID 47 even mean? It means nothing. An experienced DBA will remember, after a while, that it means “English.” But wouldn’t EN be a much better value?

You would have EN as a primary key and foreign key value, so chances are, because everyone (including frontend developers who probably hard-code some translations anyway) knows what language EN is (but no one knows what 47 means), you will almost never need to join the language table again – except in those rare cases where you want to work with secondary language columns, such as, for instance, DESCRIPTION.

Now, imagine what happens if we search for English entries, or as in our previous example, for films of category “Classics”? Our entire JOIN graph would be simplified.

(Another place where we don’t need additional surrogate keys is the relationship table. In this particular case, there’s no such key anyway)

Conclusion

This should be quite a straightforward refactoring for many applications. If your tables are extremely obvious picks for a natural key (like the above), then do use natural keys. Your queries will immediately be faster — not necessarily much faster, but probably you can speed up a significant number of queries, i.e., take the load off your entire system. Plus, your database will be more user-friendly.

And all of this at the price of not using the identical table design everywhere. I mean, when was having an identical table design a real business case anyway, right?

Side Note

In some cases, you could take this even one step further and denormalize your schema by putting categories as arrays or XML or JSON data structures directly inside your films. You’ll lose the normalization benefits, but you could further win in terms of performance. For more details, read this very interesting article (about PostgreSQL).

Learn tips and best practices for optimizing your capacity management strategy with the Market Guide for Capacity Management, brought to you in partnership with BMC.

Topics:
performance ,sql ,join ,speed ,keys

Published at DZone with permission of Lukas Eder, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

THE DZONE NEWSLETTER

Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

X

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}