Friday, February 6, 2009

ROW_NUMBER(), RANK(), and DENSE_RANK() – Flexibility at a Price

One of the most handy features introduced in SQL 2005 were the ranking functions; ROW_NUMBER(), RANK(), and DENSE_RANK(). For anyone who hasn’t been introduced to these syntactic gems, here’s a quick rundown (for those of you who are very familiar with these functions already, feel free to read through, or skip right down to “There’s No Such Thing as a Free Ride” below).

OK – so the general syntax for any one of these commands is more or less the same:

ROW_NUMBER() OVER ([<partition_by_clause>] <order_by_clause>) 
RANK() OVER ([<partition_by_clause>] <order_by_clause>)
DENSE_RANK() OVER ([<partition_by_clause>] <order_by_clause>)

So the PARTITION BY part is optional, but everything else is required. An example of a non partitioned, and then a partitioned ROW_NUMBER() clause are listed below:

ROW_NUMBER() OVER (ORDER BY TotalDue DESC) 
ROW_NUMBER() OVER (PARTITION BY CustomerID ORDER BY TotalDue DESC)

The difference between the three functions is best explained using an example. Here’s the data I’m using for this example, in case you want to follow the bouncing ball at home ;-)

CREATE TABLE OrderRanking

(

OrderID INT IDENTITY(1,1) NOT NULL,

CustomerID INT,

OrderTotal decimal(15,2)

)


INSERT OrderRanking (CustomerID, OrderTotal)

SELECT 1, 1000

UNION
SELECT
1, 500

UNION
SELECT
1, 650

UNION
SELECT
1, 3000

UNION
SELECT
2, 1000

UNION
SELECT
2, 2000

UNION
SELECT
2, 500

UNION
SELECT
2, 500

UNION
SELECT
3, 500

I’ll use the following (admittedly ugly) query to demonstrate the difference between each function:

SELECT  *,

ROW_NUMBER() OVER (ORDER BY OrderTotal DESC) AS RN,

ROW_NUMBER() OVER (PARTITION BY CustomerID ORDER BY OrderTotal DESC) AS RNP,

RANK() OVER (ORDER BY OrderTotal DESC) AS R,

RANK() OVER (PARTITION BY CustomerID ORDER BY OrderTotal DESC) AS RP,

DENSE_RANK() OVER (ORDER BY OrderTotal DESC) AS DR,

DENSE_RANK() OVER (PARTITION BY CustomerID ORDER BY OrderTotal DESC) AS DRP

FROM OrderRanking

ORDER BY OrderTotal DESC

Excuse the terrible aliases. Anything longer and the code snippets and output in this blog entry get really, really ugly. When we run the query, this is what we get:


image

So from the example above, we can see that:

  • ROW_NUMBER() assigns sequential numbers to each partition in a result set (an unpartitioned result set simply has a single partition), based upon the order of the results as specified in the ORDER BY clause. If you look carefully, you’ll see that the values in column RN are based upon a simple sort of TotalDue, while the values in Column RNP (Row_Number partitioned) are first partitioned or “grouped” by CustomerID, and then numbered by TotalDue, with the row number resetting on change of customer.
  • Contrary to popular belief, RANK() does not sort rows based upon how bad they smell. RANK() does much the same thing as ROW_NUMBER(), only it acknowledges ties in the columns specified in the ORDER BY clause, and assigns them the same rank. Where a tie occurs (as was the case for orders 6/3, and 1/5/8), the numbers that would otherwise have been “used up” are skipped, and numbering resumes at the next available number. As you can see, RANK() leaves a gap whenever there is a tie.
  • DENSE_RANK() doesn’t like gaps. It’s more of an Abercrombie & Fitch kind of function (ba-dum-ching!). Ohhhhh…that was terrible. My sense of humour may give me up for lent. You might follow it. Anyway….DENSE_RANK() “fills in the gaps”. It starts from the next number after a tie occurs, so instead of 1, 2, 3, 3, 5 you get 1, 2, 3, 3, 4.

There’s No Such Thing as a Free Ride

Ranking functions are not only useful for simple ranking – they’re also great for solving complex problems. In fact, once you get to know them, you’ll find that you’re using them for waaaaaay more than just ranking. In anything from splitting strings to deleting duplicates, ranking functions are the cat’s meow.

But just like most things in life SQL Server, less is more, when you can get away with it. For example, let’s say that you need to get the top order (by TotalDue) for each Customer in AdventureWorks (AdventureWorks2008 in my examples below). You can definitely use ROW_NUMBER (or any of the other ranking functions, for that matter) to do this:

SELECT soh.*

FROM (SELECT CustomerID, SalesOrderID, TotalDue,
ROW_NUMBER() OVER (PARTITION BY CustomerID ORDER BY TotalDue DESC) AS RowNumber

FROM Sales.SalesOrderHeader) AS soh

WHERE soh.RowNumber = 1

The WHERE soh.RowNumber = 1 restricts our results to the top order for each customer. Lovely. And the really beautiful thing about this is, if you need the top 2 orders for each customer, or 3, or 4, or x, all you need to do is replace the = 1 with <=2 (for example), and you’re good to go. Now, with that in mind, let’s look at this query:

SELECT soh.CustomerID, soh.TotalDue

FROM Sales.SalesOrderHeader soh

JOIN (SELECT CustomerID, MAX(TotalDue) AS MaxTotalDue

FROM Sales.SalesOrderHeader

GROUP BY CustomerID) AS ttls ON soh.CustomerID = ttls.CustomerID

AND soh.TotalDue = ttls.MaxTotalDue

If you plug this bad boy in, and run it, you might be surprised by the outcome. It’s actually about half as expensive as the Row_Number solution – but why? Well, as you may or may not know, sorts can be very, very expensive in SQL Server. If we’re only fetching the highest $ sales order for a given customer, the MAX solution does it without a sort, whereas the ROW_NUMBER solution needs to sort (the ORDER BY clause is mandatory, remember).

But there are some caveats to the MAX solution – most notably, how in the world can we get the top 5 orders for each customer? Well…the short answer is, we can’t. We need to change the query up, and in doing so, we’re once again going to incur a sort. Once we get beyond a query that the MAX or MIN tricks can satisfy – for instance, if we need to fetch the top 5 orders for each customer, we may as well take advantage of the ease of coding, and the improved readability of the Row_Number solution. If we want a solution for the “top 5” problem without invoking a ranking function, we’re going to end up with something like this:

SELECT soh.CustomerID, soh.TotalDue

FROM Sales.SalesOrderHeader soh

WHERE soh.SalesOrderID IN
(SELECT TOP 5 SalesOrderID

FROM Sales.SalesOrderHeader soh2

WHERE soh2.CustomerID = soh.CustomerID

ORDER BY TotalDue DESC)

Which in this case is a very, very crappy alternative to a ranking function. Not only is it uglier, but the query plan isn’t nearly as efficient, and the execution times were consistently about 20% longer in my tests.

Now that said, my tests are against a single data set only, and based upon the nature of your data, your mileage may vary. As a general rule, I would use aggregate functions if I’m only looking for the highest or lowest data point in a series, and a ranking function for anything that can’t be solved by simple aggregation.

14 comments:

hatic said...

Hi,
How can I convert these two rows to access sql?

dense_rank() over(partition by field1 order by field2) as name1,

row_number() over(partition by fld1, (dense_rank() over(partition by fldnm1 order by fldnm2)) order by fld2) as name2

thanks in advance

Aaron Alton said...

Hi hatic,

Sorry - I don't know Access SQL syntax from a hole in the wall. You can give the Microsoft Access Newsgroups a try though. Good luck!

Brad Schulz said...

Hi Aaron...

For the TOP 5 orders per customer, the following CROSS APPLY method is a more efficient approach than the SalesOrderID IN method. The query plan is roughly half the cost. (Excuse the formatting... it may not come through):

SELECT CustomerID,TotalDue
FROM (SELECT DISTINCT CustomerID
FROM Sales.SalesOrderHeader) soh
CROSS APPLY (SELECT TOP 5 TotalDue
FROM Sales.SalesOrderHeader soh2
WHERE soh2.CustomerID=soh.CustomerID
ORDER BY TotalDue DESC) F1

That being said, though, the ROW_NUMBER() approach is the way to go (no question) for this type of thing.

--Brad

Brian Tkatch said...

I've been using CROSS APPLY for only one record at a time. I forgot/didn't realize it can return an entire set.

huruy said...

Thank you so much.

Thirumal Reddy said...

Excellent Article ThankQ so much for posting this article

Somy said...

Excellent post, thanks for the article.

Abixel said...

Hi Great article.This tutorial saved me a lot in trying to write my own custom ranking function.By simply following your example i had to substitute row_number() with rank() in three of my stored procedures and voila!!!, my problem of detecting ties in records was solved.Keep up the good work if you were near i would have bought you a cup of coffee.

Abixel said...

Thanks your article it saved me a lot of time that I would have spent tying to write a custom ranking function.Substituted the ROW_NUMBER() function with RANK() function in three of my stored procedures and voila!!!!, ties are being smoothly detected.Would have bought you a cup of coffee if you were near.We need people in the world like you.

K said...

I know this is a bit late ...

But I'm curious about the queries you tested in the 'no free ride' section. The 'MAX' query is actually sensibly less-expensive, namely because everything but the inner subquery is extraneous. I wonder what your performance test results would be like if the outer query also returned an additional column from the SalesOrderHeader table. I have a hunch that they'll be different and I think that's what I've observed before.

salil patil said...

Very Helpful :)
Thank You

karthik said...

Brilliant article,this is exactly what i was looking for.You saved my googling time for perfect article on rank functions.Thanks

Raghuram Tadimalla said...

Great article. Very well explained.

I have a scenario that I am not able to get the solution to.

Here is the scenario. Pardon the formatting if it messes up at the time of posting.

ACCT_NUM CERT_ID Code Date Desired Output
A 1 10 1/1/2007 1/1/2008
A 1 10 1/1/2008 1/1/2008
A 1 20 1/1/2009 1/1/2010
A 1 20 1/1/2010 1/1/2010
A 1 10 1/1/2011 1/1/2012
A 1 10 1/1/2012 1/1/2012
A 2 20 1/1/2007 1/1/2008
A 2 20 1/1/2008 1/1/2008
A 2 10 1/1/2009 1/1/2010
A 2 10 1/1/2010 1/1/2010
A 2 30 1/1/2011 1/1/2011
A 2 10 1/1/2012 1/1/2013
A 2 10 1/1/2013 1/1/2013

As you can see, I need to do a MAX on the date based on each group of code values (apart from ACCT_NUM and CERT_ID) before the value changes. If the same value repeats, I need to a MAX of the data again for that group separately. For example, for CERT_ID of '1', I cannot group all four rows of Code 10 to get a MAX date of 1/1/2012. I need to get the MAX for the first two rows and then another MAX for the next two rows separately since there is another code in between. I am trying to accomplish this in Framework Manager.

Raghuram Tadimalla said...

Can anybody respond to my previous post and help with the desired output that I require? I am not able to get that yet....your help is very much appreciated. Thank you.