DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
Building Scalable Real-Time Apps with AstraDB and Vaadin
Register Now

Trending

  • Building A Log Analytics Solution 10 Times More Cost-Effective Than Elasticsearch
  • How To Design Reliable IIoT Architecture
  • Build a Simple Chat Server With gRPC in .Net Core
  • Managing Data Residency, the Demo

Trending

  • Building A Log Analytics Solution 10 Times More Cost-Effective Than Elasticsearch
  • How To Design Reliable IIoT Architecture
  • Build a Simple Chat Server With gRPC in .Net Core
  • Managing Data Residency, the Demo
  1. DZone
  2. Data Engineering
  3. Data
  4. NULL Confusion

NULL Confusion

Steven Lott user avatar by
Steven Lott
·
Apr. 30, 10 · News
Like (0)
Save
Tweet
Share
4.08K Views

Join the DZone community and get the full member experience.

Join For Free
The SQL database offers a domain-independent NULL value. This is a terrible thing, and should be treated with a depth of respect and fear.


Before using NULL values in a database, read things like "Null Values in Fuzzy Datasets" and "Null Values Revisited in Prospect of Data Integration".


See this question, and the answers -- people are very, very confused by NULL. The issue is that the SQL NULL conflates several separate and unrelated concepts.

  • Not Available Now. We expect that the data will be discovered later. That is, this NULL is a work-around for a process issue.
  • Not Applicable (or Optional). This means that disjoint subtypes have been unified into a single table with optional or not-applicable attributes. This is an optimization choice. This also be due to state changes: the attribite is not used in one state, but will be used in another state.

With two meanings for a single value, hilarity always ensues.


Further Confusion


The NULL value doesn't participate in comparisons or indexes. This is -- apparently -- shocking to some people. Here's a nice summary in "Why NULL never compares false to anything in SQL". Also, some notes for Oracle users in "Understand Oracle null values" and "Oracle conditions and how they handle NULL"


Because of this "NULL doesn't compare" problem, people get baffled by the use of NVL (or IFNULL) functions.


The Rules of NULL


The first rule of NULL is "Don't." Don't define a data model that depends on nullability. Define a model where each class is distinct and all the attributes are required. This will lead to a number of focused, distinct class definitions. A large number. Get over it. Don't pre-optimize a design to reduce your number of classes or tables.


Once you have a model that makes sense -- one that you can prove will work -- one that precisely matches the semantics of your problem -- you can optimize. But don't start out by pre-optimizing or taking "obvious" short-cuts.


The second rule of NULL is "Don't conflate Availability with Applicability." If you have data that is not available, you may have serious issues in the process you're trying to automate. Often, this is because you have multiple views of a single entity. If you have data that's not applicable, you've done your design wrong -- you put disjoint subtypes into a single class definition. Factor them (for now) into class definitions where all attributes are required.


If you have inapplicable or unavailable data, you must factor things into pieces where all attributes are required. You will then find that your "thing with optional attributes" is either "thing that changes state" or "thing with multiple subsets of attributes that must be joined together." Later, you can think about optimizing.


"But," you object, "it's a single entity, I can't meaningfully decompose it."


Consider the root causes for missing data before you take this position too seriously.


Let's take a concrete example. We're doing econometric analysis of businesses. We have a "Business" entity that has various kinds of measurements from various sources. Not all sources of information have data on all businesses. There's a lot of "not available" data. Also, depending on the type of business, there may be a certain amount of "not applicable" data. (For example, not-for-profit corporations, privately held companies and public companies all have different kinds of reporting requirements.)


What you have is a core "Business Name" entity with the minimal information. Name; maybe other identifying information. But often, the name is all you know.


You have a "Business Address" entity with mailing address. Small businesses may have only one of these. Large businesses will have more than one.


You have "Econometric Scoring From Source X" which may have a normalized name, a different address and other scores.


Conceptually, this is a single entity, a business. But from three different points of view. Initially, this is not a single class definition. It may be optimized into a single table with NULLs. That's a lousy design, but very popular. There are multiple addresses; scores change over time. Implementing this as a "single-table-with-nulls" seems to be a bad policy.


The third rule of NULL is "Only as an optimization." If you can prove that a particular join is nearly one-to-one, and you can prove that the cost of the join is too high, then you can consider pre-joining and using NULLs.


Common Mistakes


There are two common NULL-join mistakes: optional joins and date ranges.


One common mistake is attempting to join on an "optional" attribute. You wind up with NVL functions in the WHERE clause. RED ALERT.


If you have NVL functions in a WHERE clause then (1) you've defeated all indexing and (2) you've reinvented the wheel.


An NVL to a where clause is stand-in for a UNION. When you think you need an NVL, you've got two subsets that you're tying to put together: the subset with a non-NULL value and the subset with a NULL value. This is a UNION, and the UNION will probably be faster than your NVL construct. (If it isn't, good on you for benchmarking.)


Another common mistake is date ranges. Folks insist on having "open-ended" date ranges where the "end-of-time" is represented as a NULL. RED ALERT. NULL already means not applicable or not available. The end of time is both applicable and available. Don't add a new meaning to NULL.


Coding the end of time as NULL is simply wrong. The end of time (for now) is 12-21-2112. It's an easy date to remember. It's a cooler date in Europe than the US.

"But," the deeply argumentative claim, "I can't have my application dependent on a mystery constant like that". Lighten up. First, your app won't be running in 2112. Second, your app is full of mystery constants like that. You've got codes, your users have codes they think are important. Your localtime offset from GMT is a mystery constant. The start and end dates for daylight savings time are mystery constants. Please. You have lots of mystery constants. If you want to be "transparent", make it a configuration parameter.

Just code the end-of-time as a real date not a NULL and use ordinary date-range logic with no silly NVL business.
Database Data (computing)

Published at DZone with permission of Steven Lott, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Trending

  • Building A Log Analytics Solution 10 Times More Cost-Effective Than Elasticsearch
  • How To Design Reliable IIoT Architecture
  • Build a Simple Chat Server With gRPC in .Net Core
  • Managing Data Residency, the Demo

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com

Let's be friends: