Tuesday, January 20, 2009

Breaking a String into “Words”

Occasionally on the MSDN forums, someone has the requirement to break out a string into words, one word per line. The poster usually says “I have the requirement to break out a string into words, one word per line”. Just kidding. The requirement usually rears it’s head in the form of a question like this:

“I have a column that contains a string of text and I need to pull out all any dates that show up in the column values”, or

“I need to pull out everything after the second (x) character”

Or something of the sort. Once you get used to this technique you’ll actually use it in more than a few places. Note that there are a lot of “split” functions in circulation that do this kind of thing very well – most of them use RBAR operations or XML though, which I’m trying to avoid.

This is the kind of thing that is relatively easy to do with a row-based operation such as a cursor or a WHILE loop. And if you would like to know how to do it using one of those constructs, c’est la vie. :P

I came up with this method of parsing data values when answering a question on the forums about six months ago. At the time, I proclaimed myself High Genius of the Royal Order of Brainiacs. Good thing I didn’t tell anyone; since then, I’ve seen a few other variants of the same thing on the intertubes. I’m sure that mine was the original, and everyone else has somehow pilfered and anonymized it.

My medication usually works better than this. Hmmmmm….. wonder if I need to up the dose….. oh.. pardon me, where were we?

Ah yes, you can probably look at the two requirements listed above and think of a few other ways to do them – I know that I can – but I found this solution to be particularly flexible with respect to any additional processing that you may need to do.

Basically, when I started coding this, I was banking on the fact that you can use the SUBSTRING function to pull each character in a string individually. So given the sentence “All work and no play makes Jack a dull boy”, I can use substring to pull “A”, “l”, “l”, “ “, “w”, “o”…..you get the picture.

Now…I’m not really interested in all of the characters. I only care about the spaces right now. And I want to use a set-based operation, not a cursor, WHILE loop, or other such profane construct.

Enter the numbers table. If you haven’t yet been introduced to the numbers table, you don’t know what you’re missing. My favorite way to create a numbers table is through the method outlined on Adam Machanic’s blog article “You REQUIRE a Numbers Table!”. Dunno why. It’s just nice, elegant, and compact. It loses a few brownie points because it uses SQL 89 syntax, but we likes it anyway.

Anyhow, we can easily join the table that we need to parse to the numbers table based upon a SUBSTRING join condition. Here’s the demo data for this article (it includes Adam’s numbers table):

--Adam Machanic's numbers table

CREATE TABLE Numbers

(Number INT NOT NULL,
CONSTRAINT PK_Numbers PRIMARY KEY CLUSTERED (Number)
WITH FILLFACTOR = 100)

INSERT INTO Numbers

SELECT (a.Number * 256) + b.Number AS Number

FROM (SELECT number
FROM master..spt_values
WHERE type = 'P'
AND number <= 255 ) a (Number),
(
SELECT number

FROM master..spt_values
WHERE type = 'P'
AND number <= 255 ) b (Number)

GO





CREATE TABLE #TestWordBreak

(RowKey INT IDENTITY (1,1) PRIMARY KEY CLUSTERED,

TxtCol VARCHAR(500))


INSERT #TestWordBreak(TxtCol)

SELECT 'Please ship no later than 01/12/2009'

UNION

SELECT
'Drop off at loading dock 6'

UNION

SELECT
'Replaces 6 pieces from damaged order sent on 01/01/2009'

UNION

SELECT
'Please hold all orders from 01/03/2009 through 01/25/2009'

And this query returns the position of each space in TxtCol, along with the RowKey and full TxtCol value:

SELECT tb.RowKey,

tb.TxtCol,

n.Number AS PositionOfSpace

FROM #TestWordBreak tb

JOIN Numbers n ON SUBSTRING(tb.TxtCol, n.Number, 1) = ' '

AND n.Number <= LEN(tb.TxtCol)

(Sample Results)

RowKey    TextCol                                                                PositionOfSpace

3                  Please ship no later than 01/12/2009    0


3                  Please ship no later than 01/12/2009    7


3                  Please ship no later than 01/12/2009    12


3                  Please ship no later than 01/12/2009    15


3                  Please ship no later than 01/12/2009    21


3                  Please ship no later than 01/12/2009    26



Note that the query does return a row where PositionOfSpace = 0. This is deliberate, and will be used in the next step.

OK – so now that we know where the spaces are, we can look to return the words themselves, rather than the position of the space. To do so using SUBSTRING, we need position of the first character of each word, and the length of each word. The position is easy – it’s nothing more than the position returned in the previous step, plus one. The length…not so much. To get the length, I’m going to join to a variation on my query above more than once. Have a look:

SELECT     frst.RowKey, frst.TxtCol, frst.FirstCharacter,

COALESCE(lst.LastCharacter, LEN(frst.TxtCol)) AS LastCharacter,

(
1 + COALESCE(lst.LastCharacter, LEN(frst.TxtCol)) - frst.FirstCharacter) AS WordLength

FROM (SELECT tb.RowKey,

tb.TxtCol,

n.Number + 1 AS FirstCharacter,

ROW_NUMBER() OVER (PARTITION BY tb.RowKey ORDER BY n.Number) AS RowNumber

FROM #TestWordBreak tb

JOIN Numbers n ON SUBSTRING(tb.TxtCol, n.Number, 1) = ' '

AND n.Number <= LEN(tb.TxtCol)) AS frst

LEFT JOIN (SELECT tb2.RowKey,

n.Number - 1 AS LastCharacter,

ROW_NUMBER() OVER (PARTITION BY RowKey ORDER BY n.Number ASC) AS RowNumber

FROM #TestWordBreak tb2

JOIN Numbers n ON SUBSTRING(tb2.TxtCol, n.Number, 1) = ' '

AND n.Number <= LEN(tb2.TxtCol)) AS lst
ON frst.RowKey = lst.RowKey

AND frst.RowNumber + 1 = lst.RowNumber

Basically, the first derived table grabs the position of the first character in each word using the logic described above. The second derived table grabs the position of the last character in each word. Actually, it grabs the position of the next space, and we use ROW_NUMBER() to make sure that we only get the next space. At the top, I coalesced the last character and the length of TxtCol – the “next space” will be NULL if we’re on the last word, but the length of the last word in a string can be expressed as the length of the entire string less the position of the first character.


Now, it’s easy as pie to get each word:

SELECT     frst.RowKey, frst.TxtCol,
SUBSTRING( frst.TxtCol,
frst.FirstCharacter,

(
1 + COALESCE(lst.LastCharacter, LEN(frst.TxtCol)) - frst.FirstCharacter)

)
AS Word

FROM (SELECT tb.RowKey,

tb.TxtCol,

n.Number + 1 AS FirstCharacter,

ROW_NUMBER() OVER (PARTITION BY tb.RowKey ORDER BY n.Number) AS RowNumber

FROM #TestWordBreak tb

JOIN Numbers n ON SUBSTRING(tb.TxtCol, n.Number, 1) = ' '

AND n.Number <= LEN(tb.TxtCol)) AS frst

LEFT JOIN (SELECT tb2.RowKey,

n.Number - 1 AS LastCharacter,

ROW_NUMBER() OVER (PARTITION BY RowKey ORDER BY n.Number ASC) AS RowNumber

FROM #TestWordBreak tb2

JOIN Numbers n ON SUBSTRING(tb2.TxtCol, n.Number, 1) = ' '

AND n.Number <= LEN(tb2.TxtCol)) AS lst
ON frst.RowKey = lst.RowKey

AND frst.RowNumber + 1 = lst.RowNumber

Woohoo! Now for the flexibility….

You want dates only? Use this query as a CTE or a derived table, and specify WHERE ISDATE(Word) = 1. Same goes for numbers, and ISNUMBER(Word) = 1.

And we’re not just restricted to using space characters as delimiters. Replace a space with any other character, and you’re now parsing just about everything. You can even have multiple delimiters – just replace

SUBSTRING(tb.TxtCol, n.Number, 1) = ' '
with

SUBSTRING(tb.TxtCol, n.Number, 1) IN (' ', ',', '.')

Very cool…..happy splitting!

No comments: