The '= NULL' Mistake and Other SQL NULL Heresies
The '= NULL' Mistake and Other SQL NULL Heresies
Let's explore NULL and SQL NULL-related calamities.
Join the DZone community and get the full member experience.Join For Free
The SQL Prompt Best Practice rule checks whether a comparison or expression includes a NULL literal ('NULL'), which in SQL Server, rather than result in an error, will simply always produce a NULL result. Phil Factor explains how to avoid this, and other SQL NULL-related calamities.
SQL Prompt has a code analysis rule (BP011) that checks whether a comparison or expression includes a
NULL literal (
'NULL'). These will always produce a
NULL result. To determine whether a datatype is or isn’t
Beyond that, working with a nullable datatype in an expression requires the use of the
NULLIF() functions, as appropriate, to deal with
NULL values safely.
What Is NULL?
SQL allows any datatype to have a
NULL value. This isn’t the same as a blank string or a zero integer. It boils down to the meaning ‘Unknown’. If you ask SQL theoreticians, they’ll say that it represents missing information and inapplicable information.
NULL string, for example, isn’t the same as knowing that you have a blank string. It means you don’t know what is in the string. If you are doing a database of your ancestors and you put the value of
NULL in for your great grandmother, it means that you don’t yet know her name, but it certainly doesn’t imply that you haven’t got a great grandmother.
NULL is a state, not a value. It is useful to have a state for data, and the father of the relational model, E.F.Codd, wanted two of them: ‘A-Values’ and ‘I-Values’, representing ‘Missing But Applicable’ and ‘Missing But Inapplicable’. Sadly, there is but one.
NULLs in a Comparison
If you’re not used to the concept of ‘null’ and you treat it like a number such as pi, then strange perplexing things will happen with the result of SQL Expressions. Many developers get into trouble when they want a
WHERE clause that filters on a column with a
NULL value in it.
Imagine you have a database of Irish saints, each with their saints’ days, contained in a date column. If a saint doesn’t have a saint’s day, then this column is
NULL for that row.
CREATE TABLE #IrishSaintsDay (name NVARCHAR(50), saintsDay DATETIME2(7)); INSERT INTO #IrishSaintsDay VALUES (N'Abbán moccu Corbmaic', N'2019-03-16T00:00:00'), (N'Abel of Reims', N'2019-08-05T00:00:00'), (N'Abran', NULL), (N'Adalgis', NULL), (N'Adomnán', NULL), (N'Adomnán of Coldingham', NULL), (N'Áed mac Bricc', N'2019-11-10T00:00:00'), (N'Áed of Clonmore', N'2019-04-11T00:00:00'), (N'Aidan of Lindisfarne', NULL), (N'Ailbe of Emly', N'2019-09-12T00:00:00'), (N'Ailerán', NULL), (N'Alto of Altomünster', NULL), (N'Andrew the Scot', NULL), (N'Assicus (Tassach)', N'2019-04-14T00:00:00'), (N'Athracht', N'2019-08-11T00:00:00'), (N'Autbod', NULL), (N'Baíthéne', NULL)
The full table build with 190 saints (there are plenty more) is included with this article (CreateIrishSaintsDays) so you can play along.
You want to find all the saints that don’t have a saint’s day:
SELECT Name FROM #IrishSaintsDay AS ISD WHERE ISD.saintsDay = NULL
This seems to go well. No rows are returned, which surely means there are no Irish saints without a saint’s day, which is a relief. Wrong! If you are using SQL Prompt, you will see a squiggly green line under that
ISD.saintsDay = NULL expression.
Upon investigation, you will see this…
ISD.saintsDay = NULL means “where the saints day is equal to something unknown“. This sort of expression must return
NULL regardless of whether you compare it to a value or to another
NULL. Putting a
NULL in the value for the saints day could mean we don’t know the saint’s day, or it could mean that the saint never had his or her day, so how can one ever say that something unknown is the same as something else unknown?
To be accurate, the SQL Standard says that this use of
WHERE<expression> = NULL is invalid syntax and shall lead to an error message or an exception. Unfortunately, in SQL Server it doesn’t: it just returns ‘incorrect’ results.
What you meant to ask is whether the value in the datatype is unknown, not whether it is equal to something that is unknown. In other words, the SQL standard insists that you cannot use
<>NULL, and instead should use
We change it…
NULLs in Expressions
It isn’t just a comparison with
NULL that will always return
NULL. When you do arithmetic on numeric datatypes, and one of the operands or arguments is
NULL, it is likely that the result will be
NULL. Imagine we were to try running this query:
Yes, it returns
NULL and you get the warning from SQL Prompt with the squiggly green line for the BP011 warning. However, it is an odd mistake to make.
What is more likely is that a developer creates a variable and forgets to initialize it.
DECLARE @pi FLOAT DECLARE @MutiplicationFactor INT SELECT @Pi=22.0/(7.0*@MutiplicationFactor) SELECT @pi
The result will be null. Obviously, if you have seven apples and you multiply that by an unknown quantity of apples, then the result is an unknown quantity.
We hurriedly initialize the variable with a value.
DECLARE @pi FLOAT DECLARE @MutiplicationFactor INT SELECT @MutiplicationFactor =1 SELECT @Pi=22.0/(7.0*@MutiplicationFactor) SELECT @pi
This returns 3.14285714285714 (which isn’t quite pi, I know, but doing the Gregory-Leibniz Nilakantha approximation might be a distraction).
So, why didn’t SQL Prompt give a squiggly line when we used a null variable in an expression? It will only protest if you use a null literal (
NULL) in either an expression or a condition. It is doing a static test, so it cannot detect if you have a
NULL value in a variable. Even if it could do so, it wouldn’t be a good thing to flag because then we’d get a warning on the code that uses nullable variables in the useful and valuable ways that you intended.
NULLs can save a lot of tedious work in building strings from components, some of which are optional (concatenation); it also makes building lists easier and in creating reports where the data isn’t continuous.
NULLs in String Concatenations
How could we use
NULL to our advantage? Well, to show you an obvious example, we’ll use the
person.person table in
AdventureWorks that has the various parts of what we understand to be a person’s name.
Some of these constituents must be there (
NOTNULL), such as
LastName, but others can be
NULL, meaning you just don’t know. You may not know a person’s preferred title of address, middle names, or suffix. If you know that someone doesn’t have a middle name, for example, you can legitimately put a blank string. Alternatively, for a part of the name that isn’t necessary for searching and sorting or identification, then entering ‘unknown’ is a reasonable option. Had the creators of the data known definitively that no-one had suffixes, for example, then there would have been no need to use a nullable column to store it.
We are told by the moody boss that he needs a list of all names in the
Adventureworks database, not just first and last names, but all parts of the name that we have. We’ll do a subset of the ten most recent customers.
SELECT TOP 10 Title + FirstName + MiddleName + LastName + Suffix AS customer FROM Person.Person ORDER BY LastName DESC;
That didn’t go so well, did it? Again, we didn’t get BP011 warning because SQL Prompt, as discussed, merely detects
NULL literals in expressions or conditions.
NULLs in every row (the suffix was, in every case, unknown) and they ‘propagated’ to create a resulting
NULL. It makes sense. If you concatenate strings and you don’t know what is in one of the strings, the result of the concatenation must indeed be
NULL. From the perspective of SQL Server, you asked a silly question and got a silly, but correct, answer.
What you meant to say was ‘if you don’t know it, leave that bit out’, and for this task, the
COALESCE function is your friend. You just give it a list of parameters. It evaluates each in turn until it finds one that is not
NULL and it returns this as a value. If they are all
NULL, it returns a
NULL. In this case, we can use it to return a blank string if the value in the column is
NULL. Its potential power in dealing with
NULLs is great.
SELECT TOP 10 Coalesce(Title, '') + ' ' + FirstName + ' ' + Coalesce(MiddleName, '') + ' ' + LastName + Coalesce(Suffix, '') AS customer FROM Person.Person ORDER BY LastName DESC;
After your excitement dies down, you notice that the list doesn’t left-align; there are extra spaces here and there that will trip you up subsequently. You could have tidied this up with a mix of
REPLACE, but then other database developers would smirk at it when doing code reviews. So, you do this:
SELECT TOP 10 Coalesce(Title + ' ', '') + FirstName + Coalesce(' ' + MiddleName + ' ', ' ') + LastName + Coalesce(' ' + Suffix + ' ', '') AS customer FROM Person.Person ORDER BY LastName DESC;
Suddenly, you are over the pain barrier and letting
NULLs work for you. If there is a value in the
Title column, you add a space to it and return it. If the Title column has no value, then
NULL, when added to a string with a space in it, still results in a
NULL, and so the
COALESCE function examines the next parameter in the list and returns the blank string. Some columns in the list are nullable and some aren’t, so we can put spaces, or any other delimiter we want, between the column values, according to whether they are
NULL. We’ve done all this with only slight changes.
However, the moody boss is not happy. It turns out he wanted a comma-delimited list. OK. We now know all about dealing with delimiters because a space character is a delimiter of sorts.
We create a variable, which by its very nature is nullable. We want a comma delimiter between each string and if we just join strings together that we’ve added to a ‘,’ delimiter, we get a comma at the head of the list. To remove it, we use the
COALESCE function to make sure that the first name isn’t preceded with a comma. If the variable has never been set it is
NULL, so it is easy to have special conditions for the first string to be added to a variable. If you add a comma to a
NULL you get a
NULL, so before any values are inserted into the variable, you won’t get the initial delimiter.
DECLARE @CustomerList NVARCHAR(4000) SELECT TOP 10 @CustomerList=coalesce(@CustomerList+', ','') + Coalesce(Title + ' ', '') + FirstName + Coalesce(' ' + MiddleName + ' ', ' ') + LastName + Coalesce(' ' + Suffix + ' ', '') FROM Person.Person ORDER BY LastName DESC; SELECT @CustomerList
Which gives the result that we want:
Mr. Michael J. Zwilling, Michael J Zwilling, Jake Zukowski, Ms. Judy N. Zugelder, Ms. Patricia M. Zubaty, Ms. Carla J. Zubaty, Karin Zimprich, Ms. Karin Zimprich, Tiffany E Zimmerman, Marc Zimmerman
However, the moody boss now comes up with an odd request. He only wants people from the list who have a single initial for the middle name or nothing at all. Those with full middle names will be dealt with in a separate list.
Simple. We just exclude those whose middle-names are more than one character long:
SELECT TOP 10 Coalesce(Title + ' ', '') + FirstName + Coalesce(' ' + MiddleName + ' ', ' ') + LastName + Coalesce(' ' + Suffix + ' ', '') AS customer FROM Person.Person WHERE LEN(middlename) < 2 ORDER BY LastName DESC;
Unfortunately, this does not work. It gives us all those with single initials (so we no longer see any who have a single initial followed by a dot). However, we’ve lost all the people with no middle name:
If the middle name is NULL, then the answer to “is NULL less that 2” is unknown, and the row is excluded.
COALESCE() would again have come to the rescue.
SELECT TOP 10 Coalesce(Title + ' ', '') + FirstName + Coalesce(' ' + MiddleName + ' ', ' ') + LastName + Coalesce(' ' + Suffix + ' ', '') AS customer FROM Person.Person WHERE LEN(Coalesce(middlename,'')) < 2 ORDER BY LastName DESC;
This now gives the correct answer.
COALESCE() is by far the most useful function for dealing with nullable columns or variables safely. However, there are two other functions that I’ll mention just briefly:
ISNULL()– does less than
COALESCE()and could almost have been devised to confuse the unwary because it looks like
ISNULL, which means something else. It is also not standard SQL.
NULLif two expressions passed as parameters are equal. This is SQL Standard, but is just syntactic candy on top of the
CASEexpression, which is easier to understand.
For more details, see How to Get NULLs Horribly Wrong in SQL Server.
NULLs in Aggregate Values
Now, if you are adding up a list of numbers, you will want the
NULLs to propagate. If you add a list of numbers, and you don’t know one of the numbers, then the result should be
NULL, meaning that there is no way of knowing what the sum of that list is. Let’s put this to the test.
SELECT Sum(TheValue) AS [sum of The First value], Sum(OtherValue) AS [sum of The other value] FROM (VALUES (1, 1), (3, 1), (7, 2), (NULL, 0)) AS f (TheValue, OtherValue);
Surprisingly, from what we’ve learned so far, this SQL aggregation expression returns 11 for the sum of the first column rather than
…but it will warn you that
NULL values were eliminated.
Why was the
NULL ignored rather than propagated? The SQL Standard, I believe, dropped the ball. Officially, ‘The inconsistencies in the SQL standard mean that it is not possible to ascribe any intuitive logical semantics to the treatment of nulls in SQL”.
The SQL Standard decided that all aggregate functions except
COUNT(*) and grouping functions should ignore
COUNT(*) function returns either a number or zero. All the other aggregate functions will ignore rows that contain
NULL and return
NULL only if the data set contains no rows or contains only rows with
NULLs as arguments to the aggregate function.
NULLs is not the same as replacing
NULL with zero. We can demonstrate this with our existing example by getting the average calculated in different ways.
SELECT Avg(TheValue) AS [Average of The First value], Sum(TheValue)/Count(*) AS [Calculated Average of The First value], Avg(OtherValue) AS [Average of The second value], Sum(OtherValue)/Count(*) AS [Calculated Average of The second value] FROM (VALUES (Convert(FLOAT,1),Convert(FLOAT, 1)), (3, 1), (7, 2), (NULL, 0)) AS f (TheValue, OtherValue);
TheValue column, which had a
NULL value, the calculated average wasn’t the same as the average produced by the aggregate function. Why was that? The
COUNT(*) returned the number of rows whether or not they contained null values, whereas the
AVG function ignored the row if the column contained a
What I should have done was to specify the column in the
COUNT() aggregate. Then it can ignore rows whenever that column has a
NULL in it:
SELECT Avg(TheValue) AS [Average of The First value], Sum(TheValue)/Count(TheValue) AS [Calculated Average of The First value], Avg(OtherValue) AS [Average of The second value], Sum(OtherValue)/Count(OtherValue) AS [Calculated Average of The second value] FROM (VALUES (Convert(FLOAT,1),Convert(FLOAT, 1)), (3, 1), (7, 2), (NULL, 0)) AS f (TheValue, OtherValue);
OK. We’re feeling less confused now. When you ask for the sum or average, you are asking for all non-
NULL values in the column to be aggregated. If you are doing calculations which rely on the value of the count of the values, you must always specify the column in the
COUNT() aggregate so as to count only the non-null rows for that column.
NULLs With Grouping, Sorting and Deduplicating
The SQL Standard says that
NULL is a state of a datatype rather than a value. So then how do you sort data? You might expect from the behavior of
NULL values in expressions that sorting by a nullable column will result in those rows that have a
NULL value in a sort column being scattered randomly around the sort order. Not so. The
NULLs are sorted as if they were the same value, all together. Where in the sort order should they be? The SQL Standard allows you to specify whether they float to the top or sink to the bottom of the sort order with the keywords
NULLS LAST. SQL Server is always
We can try this with the saints of Ireland and their saint’s days.
SELECT * FROM #IrishSaintsDay ORDER BY saintsDay, name DESC;
SELECT * FROM #IrishSaintsDay ORDER BY saintsDay, name ASC;
You will find that in both cases, the result starts with
Like sorting, grouping is done as if all
NULL values are considered equal, in contrast to the way that they are treated in an expression. They are collected into a single group.
SELECT Count(*) AS [Number of saints], SaintsDay FROM #IrishSaintsDay GROUP BY #IrishSaintsDay.saintsDay
62 saints have an unknown or non-existent saint’s day.
UNION when a column has
NULLs in it? The
UNION operation will remove all duplicate rows. However, what if two rows both have
NULLs for one or columns but are otherwise-identical? If the database engine were unable to compare
NULLs, then it would be unable to say ‘deduplicate’ during
UNION operations. We can easily test this.
SELECT Count(*) FROM (SELECT * FROM #IrishSaintsDay WHERE name <'d' UNION SELECT * FROM #IrishSaintsDay)saints
The result is still 190, which is the number in the original table. The
UNION operation compared rows with NULLs but was otherwise the same, decided they were identical and removed the duplicate row!
If we had done a
UNIONALL instead of a
UNION, we’d have found that there were 272 rows because
UNIONALLdoesn’t remove duplicates from the result.
SELECT Count(*) FROM (SELECT * FROM #IrishSaintsDay WHERE name <'d' UNION all SELECT * FROM #IrishSaintsDay)saints
SQL also gives you the answer you expect if you use
SELECT DISTINCT name, SaintsDay FROM (SELECT * FROM #IrishSaintsDay WHERE name <'d' UNION all SELECT * FROM #IrishSaintsDay)saints
Again, only 190 saints are found.
To cut a long story short, whether you use
EXCEPT, SQL plays nice, figures out the answer you expect, and treats
NULL values as if they were a single value, in contrast to the way that it treats then in expressions or comparisons.
You will, however, be thinking to yourself, like professor Ron van der Meyden, ‘The inconsistencies in the SQL standard mean that it is not possible to ascribe any intuitive logical semantics to the treatment of nulls in SQL‘.
NULL values are an essential part of the relational database, because data can be both intentionally and unintentionally missing, and it can be unknown.
NULL can be assigned, but using ‘
<>NULL‘, or any other comparison operator, in an expression with
NULL as a value, is illegal in SQL and should trigger an error. It can never be correct. The expression will always return
Comparisons or arithmetic involving either a variable datatype or a nullable column are unlikely to be safe unless used with
ISNULL(). However, these functions can be very handy if used guardedly.
Grouping, sorting, and ‘uniquifying’ will behave intuitively by ignoring rows where the columns being specified have
NULLs in them, but when doing sums and averages, for example, care needs to be taken to specify the column in the
COUNT() aggregate, rather than using the
NULLs can trip up a programmer, yet provide a very versatile way of handling data, providing a range of software devices for dealing successfully with commercial database processes and reports. Don’t avoid
NULLs; instead, get familiar with them and their use.
Published at DZone with permission of Phil Factor . See the original article here.
Opinions expressed by DZone contributors are their own.