Thursday, March 5, 2009

SQL Server And NULL Values, Revisited

Ladies and Gentlemen, it was brought to my attention yesterday that our mutual friend, The Caped Crusader, hates NULL values. It isn’t easy for me to say this. As you all know, SQLBatman has always been something of a dark knight, wandering the deepest recesses of the intertubes to keep the forces of evil at bay. And yet, even as I say this, I fear that our hero has turned to darkness, his mind consumed by the very madness that he once set out to destroy. We can only hope that SQLBatman will persevere over his demons, and come out the other side even stronger from the battle.

For The Love Of NULL Values

If you haven’t yet done so, take a minute to read through SQLBatman’s treatise on NULL values, SQL Server And Null Values. His article, in which he displays greater than usual amounts of moderate decorum, lists the following as the evils brought on by the use of NULLs in a SQL Server database (I’ve paraphrased, of course):

  • NULL values are not immediately apparent to an application developer or data consumer; as such, they are an entry vector for assumption errors.
  • Query Tool designers don’t seem to have adopted a standard setting with respect to ANSI_NULLS. As such, we are again presented with the possibility of errors due to the differing settings.

Now, in all fairness to the dark knight, he did qualify that he hates the mistakes that people make by assuming that they know all about NULL values, and not the NULL values themselves. But that doesn’t make for a very interesting rebuttal, doesn’t it? And I did promise a rebuttal….

Now, we could go into the nuances of NULL values, performance implications, ANSI standards, and yada yada yada, but in keeping with SQLBatman’s article, I’m going to focus on what has always seemed to me to be the most compelling reason to understand, and where appropriate make use of NULL values: they just make sense.

That was anti-climactic, wasn’t it? Don’t worry – it gets better.

What is a NULL? Well logically speaking, a NULL isn’t nothing; it’s “undefined”. That is to say, a data value is NULL when we simply don’t know what it’s value is. Sounds awfully ambiguous, doesn’t it? It can be, when abused. You see, the reason why so many people hate NULL values is because, by and large, they are used where they simply aren’t appropriate.

Take, for example, a Customer table. A simple customer table might look something like this:

CREATE TABLE Customer

(

CustKey INT IDENTITY(1,1) NOT NULL PRIMARY KEY CLUSTERED,

CustID VARCHAR(50) NOT NULL UNIQUE,

CustName VARCHAR(200) NOT NULL,

CustStatus tinyint NOT NULL --Enumeration: 1=Active, 2=On Hold, 3=In Collections, 4=Closed

)

And if it did, we wouldn’t have a problem. Notice that I’m not using any NULLs in this table, because there isn’t any valid case for NULL values. Let’s think about it, column by column:


  • CustKey is our primary key, so we’d be stupid to allow null values here.
  • CustID is our “friendly” key. We’re using a surrogate key in the CustKey, so CustID lets us have a “human friendly”, modifiable equivalent for the end-user to reference. Note also the UNIQUE constraint.
  • CustName should be pretty obvious……and you can’t go around sending your invoices to customers based upon their CustKey or CustID….
  • CustStatus is the customer’s “status”. That was very repetitive of me, wasn’t it? This is where many application developers would get into trouble. Can anyone see any reason why CustStatus would be NULL? If you can, and I’m just missing it, please post a comment and let me see the error of my ways. I don’t see a valid reason, and here’s why: if you don’t know what the status is, who the hell does? If CustStatus were allowed to be NULL, how should you treat this customer? Should you ship to them? Should you send them a payment reminder? Should you send goons? Even for a new customer, you should have a CustStatus. Perhaps they’re active by default, or perhaps they’re On Hold until reviewed by AR. Whatever the case, CustStatus should never be NULL, and if you go about making those kinds of fields NULLable, you’re just throwing fuel on the “I hate NULL values” fire.

OK, so you’ve told us why we shouldn’t use NULL values. I thought you loved NULLs?

And I do. If you haven’t yet discovered that I tend to take a very long and circuitous road to any given point, you really haven’t been reading my blog for long.

So why are NULLs beautiful? Because there are situations in which unknown values are most definitely expected, and when this occurs, they need to be stored as unknown values. Let’s look at another example. We’ll add another column into our “Customer” table above, so it’s DDL now looks like this:

CREATE TABLE Customer

(

CustKey INT IDENTITY(1,1) NOT NULL PRIMARY KEY CLUSTERED,

CustID VARCHAR(50) NOT NULL UNIQUE,

CustName VARCHAR(200) NOT NULL,

CustStatus tinyint NOT NULL, --Enumeration: 1=Active, 2=On Hold, 3=Closed

AnnualRevenue decimal(15,2) NULL

)

Our Marketing department has decided to profile our customers in order to understand how to better target their marketing efforts. They have asked us to track each of our clients’ “Annual Revenue” as a part of this profiling effort. In this case, a NULL value is absolutely, positively required. Why? Well, let’s say that you create AnnualRevenue as a non-nullable field. As soon as you create the field, you will need to populate it using a default value – and what should that value be? Well….logic would have it that the most “neutral” thing that we can drop into this field is a zero, right? So we create our AnnualRevenue column, and populate each record with a default value of zero. Zero means “we haven’t yet surveyed this client”.

Marketing needs to develop a report to analyze the data that we’ve harvested. Unfortunately, all of our zero-value rows skew their summaries and averages, dragging the “AnnualRevenue” down, and leading to statistical inaccuracies. Well, that’s simple, right? Let’s just exclude customers with zero revenue. Perfecto! We’ve handled unsurveyed customers, and we still didn’t let those ooky NULL values into our otherwise immaculate database.

A few days later, your survey team works down to the “T” section. They phone up Twitter, one of your largest customers:


“Hello, this is Sally from The Bandwidth Bin. Could I please speak with Mr. F Whale?” says Sally.

“One moment please….” tweets the helpful receptionist.

“Hello, this is F. Whale. How can I help you, in 140 characters or less?” says Mr. Whale.

“Hi Mr. Whale, this is Sally from The Bandwith Bin. We’re collecting information about our customers. If you don’t mind me asking, what is your annual revenue?” inquires Sally.

“Hang on a sec…let me check. Mmmhmm….mmmhmm….carry the one…our annual revenue is zero, rounded to the nearest two decimal places” replies Mr. Whale.



Sally ends the conversation, and goes to enter her results into the Customer table. Wuh-oh. If we’re lucky, she realizes that there’s a problem, and raises the alarm. If she doesn’t, she might just assume that somebody had already spoken with Mr. Whale, and already recorded their results in the table. If she does this, Twitter is forever going to be excluded from our Marketing team’s reports.

If Sally does raise the alarm, all eyes turn back to the DBA for a solution. Well…let’s see. We could decide on a different “default” value – say, negative a million. That doesn’t work – it’s not intuitive, and is begging for errors of an even greater scale. We could add an “AnnualRevenueCollected” field (a bit, perhaps) to indicate whether or not the client’s information has been collected. That fails – we now have redundant data, and room for inconsistencies. It may not be such a big deal with a single question, but imagine five or ten different questions, and five or ten corresponding “Answered?” fields. Sound convoluted enough for ya?

Or….we can embrace the NULL value. It isn’t always easy to work with, but then, lots of what we do as SQL Server professionals isn’t “easy”. We need to be careful working with NULL values, just as we need to be careful working with any other construct. I haven’t abandoned integers just because integer division produces rounded results by default. If anything, we should celebrate the fact that NULL values can break our code – whenever it does, it’s a very clear signal that we haven’t accounted for something, and we really need to give our code a second look.

7 comments:

SQL Batman said...

Nicely written, you were able to show that NULLs actually have a value!

Michael Swart said...

If you're CREATE TABLE happy, you can always avoid NULLs, to use your example, you could create a table like the following:

CREATE TABLE CustomerAnnualRevenue(
CustKey INT NOT NULL PRIMARY KEY CLUSTERED FOREIGN KEY REFERENCES Customer(CustKey),
AnnualRevenue DECIMAL(15,2) NULL
)

There, you've successfully avoided a null! However, when following the rule "Normalize 'til it hurts, Denormalize 'til it works". This should definitely go into the Denormalize column.

Aaron Alton said...

Good point, Michael. Again, you'd run into trouble once you want to collect more than just revenue - I can envision a "CustomerAdditionalStats" table, to hold all manner of additional statistics, but this wouldn't get rid of the NULL values. And I agree that you can't justify one table per stat - JOINs are just too expensive to do that.

BTW - I think you meant "NOT NULL" in your CREATE TABLE script. ;-)

Michael Swart said...

Of course I meant "NOT NULL". Kind of an essential typo.

Martin S. Stoller said...

Personally, used correctly, I love the NULL (of course I used to be a K&R C programmer... um... not that you should hold that against me, lol...)

Another great read! One day you might want to roll these "short stories" (for I think they have valid entertainment value) into a book (or at least, an eBook...)

Cheers,
Martin

UncleBid said...

Another great reason to use nulls is to take advantage of their effect on aggregate functions like COUNT, AVG, and SUM. For example, AVG(Answer5)will not be affected by NULL answers from survey respondents who skipped question 5.

Joe Borysko said...

Responding to the response of Michael Swart's comment 5 years ago...

A theoretical "CustomerAdditionalStats" table would be problematic. The better solution along Michael's line of thinking would be to have a different table for every null-able attribute. Of course this would immensely increase the amount of tables and make queries slower due to joins. I think the one case where it does make sense to create a separate table for a null-able attribute would be if that attribute was oftentimes null as opposed to filled in and if space was an issue.