Math and SQL Part 6: The Problem with NULLs
Math and SQL Part 6: The Problem with NULLs
Join the DZone community and get the full member experience.Join For Free
Take 60 minutes to understand the Power of the Actor Model with "Designing Reactive Systems: The Role Of Actors In Distributed Architecture". Brought to you in partnership with Lightbend.
This will be the final installment on Math and SQL and will cover the problem with NULLs. NULL handling is probably the most poorly thought-out feature of SQL and is inconsistent generally with the relational model. Worse, a clear mathematical approach to NULLs is impossible with SQL because too many different meanings are attached to the same value.
Unfortunately, nulls are also indispensable because wider tables are more expressive than narrower tables. This makes advice such as "don't allow nulls in your database" somewhat dangerous because one ends up having to add them back in fairly frequently.
At the same time understanding the problems that NULLs introduce is key to avoiding the worst of the problems and managing the rest.
Definition of a Null Set
A null set is simply a set with no members. This brings us to the most obvious case of the use of a NULL, used when an outer join results in a row not being found. This sort of use by itself doesn't do too much harm but the inherent semantic ambiguity of "what does that mean?" also means you can't just substitute join tables for nullable columns and solve the problems that NULLs bring into the database. This will hopefully become more clear below.
Null as Unknown
The first major problem surfaces when we ask the question, "when I do a left join and the row to the right is not found, does that mean we don't know the answer yet or that there is no value associated?" In all cases, a missing result from an outer join will sometimes mean that the answer is not yet known, if only because we are still inserting the data in stages. But it can also mean that maybe there is an answer and that there is no value associated. In almost all databases, this may also be the case in this situation.
But then there is no additional harm done in allowing NULLs to represent unknowns in the tables themselves, right?
Handling NULLs as unknown values complicates database design and introduces problems so many experts like Chris Date tend to be generally against their use. The problem is that using joins doesn't solve the problem but instead only creates additional failure cases to be aware of. So very often times, people do use NULL in the database to mean unknown despite the problems.
NULL as unknown introduces problems to predicate logic because it introduces three value logic (true, false, and unknown), but these are typically only problems when one is storing a value (as opposed to a reference such as a key) in the table. 1 + NULL IS NULL. NULL OR FALSE IS NULL. NULL OR TRUE IS TRUE. This makes things complicated. But sometimes we must....
Null as Not Applicable
One severe antipattern that is frequently seen is the use of NULL to mean "Not Applicable" or "No Value." There are a few data types which have no natural empty/no-op types. Prime among these are numeric types. Worse, Oracle treats NULL as the same value as an empty string for VARCHAR types.
Now, the obvious problem here is that the database does't know here that NULL is not unknown, and therefore you end up having to track this yourself, use COALESCE() functions to convert to sane values, etc. In general, if you can avoid using NULL to mean "Not Applicable" you will find that worthwhile.
Now, if you have to do this, one strategy to make this manageable is to include other fields to tell you what the null means. Consider for example:
CREATE TABLE wage_class ( id int not null, label text not null ); INSERT INTO wage_class VALUES(1, 'salary'), (2, 'hourly'); CREATE TABLE wage ( ssn text not null, emp_id int not null, wage_class int not null references wage_class(id), hourly_wage numeric, salary numeric, check (wage_class = 1 or salary is null), check (wage_class = 2 or hourly_wage is null) );This approach allows us to select and handle logic based on the wage class and therefore we know based on the wage_class field whether hourly_wage is applicable or not. This is far cleaner and allows for better handling in queries than just putting nulls in and expecting them to be semantically meaningful. This solution can also be quite helpful because it ensures that one does not accidentally process an hourly wage as a salary or vice versa.
What Nulls Do to Predicate Logic
Because NULLs can represent unknowns, they introduce three-valued predicate logic. This itself can be pretty nasty. Consider the very subtle difference between:
WHERE ssn like '1234%' AND salary < 50000vs
WHERE ssn like '1234%' AND salary < 50000 IS NOT FALSEThe latter will pull in hourly employees as well, as they have a NULL salary.
Nulls and Constraints
Despite all the problems, NULLs have become a bit of a necessary evil. Constraints are a big part of the reason why.
Constraints are far simpler to maintain if they are self-contained in a tuple and therefore require no further table access to verify. This means that wider tables admit to more expression relating to constraints than narrow tables.
In the example above, we can ensure that every hourly employee has no salary, and every salaried employee has no hourly wage. This level of mutual exclusion would not be possible if we were to break off salaries and wages into separate, joined tables.
Nulls and Foreign Keys
Foreign keys are a special case of NULLs where the use is routine and poses no problems. NULL always means "no record referenced" in this context and because of the specifics of three-valued boolean logic, they always drop out of join conditions.
NULLs in foreign keys make foreign key constraints and 5th Normal Form possible in many cases where it would not be otherwise. Consequently they can be used routinely here with few if any ill effects.
What Nulls Should Have Looked Like: NULL, NOVALUE, UNKNOWN
In retrospect, SQL would be cleaner if we could be more verbose about what we mean by a NULL. UNKNOWN could then be reserved for rare cases where we really must need to store a record with incomplete data in it. NULL could be returned from outer joins, and NOVALUE could be used for foreign keys and places where we know the field is not applicable.
Published at DZone with permission of Chris Travers , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.