NULL Confusion
Join the DZone community and get the full member experience.
Join For FreeThe SQL database offers a domain-independent NULL value. This is a
terrible thing, and should be treated with a depth of respect and fear.
With two meanings for a single value, hilarity always ensues.
Before
using NULL values in a database, read things like "Null Values
in Fuzzy Datasets" and "Null Values
Revisited in Prospect of Data Integration".
See
this question,
and the answers -- people are very, very confused by NULL. The issue
is that the SQL NULL conflates several separate and unrelated concepts.
- Not Available Now. We expect that the data will be discovered later. That is, this NULL is a work-around for a process issue.
- Not Applicable (or Optional). This means that disjoint subtypes have been unified into a single table with optional or not-applicable attributes. This is an optimization choice. This also be due to state changes: the attribite is not used in one state, but will be used in another state.
With two meanings for a single value, hilarity always ensues.
Further Confusion
The
NULL value doesn't participate in comparisons or indexes. This is --
apparently -- shocking to some people. Here's a nice summary in "Why
NULL never compares false to anything in SQL". Also, some notes
for Oracle users in "Understand
Oracle null values" and "Oracle
conditions and how they handle NULL"
Because
of this "NULL doesn't compare" problem, people get baffled by the use
of NVL (or IFNULL) functions.
The Rules of
NULL
The first rule of NULL is "Don't."
Don't define a data model that depends on nullability. Define a model
where each class is distinct and all the attributes are required.
This will lead to a number of focused, distinct class definitions. A
large number. Get over it. Don't pre-optimize a design to reduce your
number of classes or tables.
Once you have a
model that makes sense -- one that you can prove will work -- one that
precisely matches the semantics of your problem -- you can optimize.
But don't start out by pre-optimizing or taking "obvious" short-cuts.
The
second rule of NULL is "Don't conflate Availability with
Applicability." If you have data that is not available, you may
have serious issues in the process you're trying to automate. Often,
this is because you have multiple views of a single entity. If you have
data that's not applicable, you've done your design wrong -- you put
disjoint subtypes into a single class definition. Factor them (for now)
into class definitions where all attributes are required.
If
you have inapplicable or unavailable data, you must factor things into
pieces where all attributes are required. You will then find that your
"thing with optional attributes" is either "thing that changes state" or
"thing with multiple subsets of attributes that must be joined
together." Later, you can think about optimizing.
"But,"
you object, "it's a single entity, I can't meaningfully
decompose it."
Consider the root causes for
missing data before you take this position too seriously.
Let's
take a concrete example. We're doing econometric analysis of
businesses. We have a "Business" entity that has various kinds of
measurements from various sources. Not all sources of information have
data on all businesses. There's a lot of "not available" data. Also,
depending on the type of business, there may be a certain amount of "not
applicable" data. (For example, not-for-profit corporations, privately
held companies and public companies all have different kinds of
reporting requirements.)
What you have is a
core "Business Name" entity with the minimal information. Name; maybe
other identifying information. But often, the name is all you know.
You
have a "Business Address" entity with mailing address. Small
businesses may have only one of these. Large businesses will have more
than one.
You have "Econometric Scoring From
Source X" which may have a normalized name, a different address and
other scores.
Conceptually, this is a single
entity, a business. But from three different points of view.
Initially, this is not a single class definition. It may be optimized
into a single table with NULLs. That's a lousy design, but very
popular. There are multiple addresses; scores change over time.
Implementing this as a "single-table-with-nulls" seems to be a bad
policy.
The third rule of NULL is "Only as
an optimization." If you can prove that a particular join is
nearly one-to-one, and you can prove that the cost of the join is too
high, then you can consider pre-joining and using NULLs.
Common
Mistakes
There are two common NULL-join
mistakes: optional joins and date ranges.
One
common mistake is attempting to join on an "optional" attribute. You
wind up with NVL functions in the WHERE clause. RED ALERT.
If
you have NVL functions in a WHERE clause then (1) you've defeated all
indexing and (2) you've reinvented the wheel.
An
NVL to a where clause is stand-in for a UNION. When you think you need
an NVL, you've got two subsets that you're tying to put together: the
subset with a non-NULL value and the subset with a NULL value. This is a
UNION, and the UNION will probably be faster than your NVL construct.
(If it isn't, good on you for benchmarking.)
Another
common mistake is date ranges. Folks insist on having "open-ended"
date ranges where the "end-of-time" is represented as a NULL. RED ALERT.
NULL already means not applicable or not available. The end of time
is both applicable and available. Don't add a new meaning to NULL.
Coding
the end of time as NULL is simply wrong. The end of time (for now) is
12-21-2112. It's an easy date to remember. It's a cooler date in
Europe than the US.
"But," the deeply
argumentative claim, "I can't have my application dependent on a mystery
constant like that". Lighten up. First, your app won't be running in
2112. Second, your app is full of mystery constants like that. You've
got codes, your users have codes they think are important. Your
localtime offset from GMT is a mystery constant. The start and end
dates for daylight savings time are mystery constants. Please. You
have lots of mystery constants. If you want to be "transparent", make
it a configuration parameter.
Just code the
end-of-time as a real date not a NULL and use ordinary date-range logic
with no silly NVL business.
Database
Data (computing)
Published at DZone with permission of Steven Lott, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Trending
-
Building A Log Analytics Solution 10 Times More Cost-Effective Than Elasticsearch
-
How To Design Reliable IIoT Architecture
-
Build a Simple Chat Server With gRPC in .Net Core
-
Managing Data Residency, the Demo
Comments