Tuesday, June 9, 2009

Full-Text Search vs Denormalized Tables vs Link Tables: Handling Tags in SQL Server

“Tags” seem to be a very popular way of categorizing content nowadays. And by “nowadays” I mean “for the last 20 years or so”. But they’re especially popular today, with apps like Flickr and Picasa, sites like the MSDN Forums and StackOverflow, and even the blog post that you’re reading, all using tags to categorize content.

So how can we handle tags efficiently in SQL Server? Well, let’s take a look at the problem first:

  • Tags will fit into a conventional datatype definition. In fact, most tags generally consume 50 characters or less.
  • The relationship between an item and the number of tags that item has is not consistent. In most applications, an item can have as few as zero, and as may as ten different tags.

The first attribute is OK, but the second one presents a bit of a storage dilemma, doesn’t it? I can think of three (reasonable) solutions to handle this, because ultimately a tag/item relationship is not too different from a parent/child relationship. Given that assumption, we could use:

  • A classic parent/child relationship, supported with a link table:
    --"Tags" table used for the "link table" tag reference method.

    --We'll also use it to populate the full text table (see next method)

    CREATE TABLE Tags

    (

    TagID INT IDENTITY(1,1) PRIMARY KEY CLUSTERED,

    TagName VARCHAR(50)

    )

    GO


    --Table for use with Link Table

    CREATE TABLE ItemsWithoutTags

    (

    ItemID INT IDENTITY(1,1) PRIMARY KEY CLUSTERED,

    ItemText VARCHAR(2000)

    )

    GO

    --Link Table

    CREATE TABLE ItemTagXRef

    (

    ItemID INT NOT NULL FOREIGN KEY REFERENCES ItemsWithoutTags(ItemID),

    TagID INT NOT NULL FOREIGN KEY REFERENCES Tags(TagID),

    CONSTRAINT PK_ItemTagXRef_ItemID_TagID PRIMARY KEY NONCLUSTERED(ItemID, TagID)

    )

    CREATE INDEX CX_ItemTagXRef_ItemID ON ItemTagXRef(ItemID)

    GO


    This solution stores items in one table (pictures, blog posts, forum entries, etc), tags in a second table, and uses a link table to store item/tag relationships.

  • A single column with a full-text index:

    CREATE TABLE ItemsWithFTSTags

    (

    ItemID INT IDENTITY(1,1) PRIMARY KEY NONCLUSTERED,

    ItemText VARCHAR(2000),

    Tags VARCHAR(200),

    )

    CREATE UNIQUE CLUSTERED INDEX PK_ItemsWithFTSTags_ItemID ON ItemsWithFTSTags(ItemID);

    CREATE FULLTEXT CATALOG FTCatDemo AS DEFAULT;

    CREATE FULLTEXT INDEX ON ItemsWithFTSTags(Tags) KEY INDEX PK_ItemsWithFTSTags_ItemID;

    GO

    To employ this type of solution, we would need to delimit the tags, for instance: <Tag1><Tag2><Tag3>


  • A denormalized items table with one column for each possible tag:

    CREATE TABLE ItemsWithInlineTags

    (

    ItemID INT IDENTITY(1,1) PRIMARY KEY CLUSTERED,

    ItemText VARCHAR(2000),

    Tag1 VARCHAR(50),

    Tag2 VARCHAR(50),

    Tag3 VARCHAR(50),

    Tag4 VARCHAR(50)

    )


There are probably more, but these three are what 90% of designers will come out with. Each solution has it’s strengths and weaknesses. The Item/Tag/Link solution allows for a true 0-n relationship, without necessitating schema modification as the number of tags increases. This may not be a requirement for every application, but it’s good to be able to increase the number of permitted tags without having to modify the schema. The Full Text Search solution uses only one table (remember, joins aren’t free), and allows us the flexibility of the full-text operators to do our tag search (though that flexibility may not be required in this situation). The denormalized table requires only one table, and doesn’t have a dependency on full-text search.

The next logical question would be: how do they perform? While I was designing a test to answer this very question, it struck me that the denormalized table solution was…well…stupid. I was sitting trying to figure out the best way to index four columns, each of which is potentially nullable, each of which could be the target of a given search (WHERE Tag1 = ‘Java’ OR Tag2 = ‘Java’ OR ….). The solution started to feel like a stretch. One of our challenges as developers and designers is to learn when an idea, no matter how immaculate it seemed when it was conceived, needs to be killed off. When the only thing providing a solution with legitimacy is my desire for it to be a good solution, it's time to kick it to the curb. And that’s just what I did here.

And then there were two…

To create a decent test, I decided to load about 500 tags and 10,000 items. I “borrowed” the tags from a list of programming languages on Wikipedia (truncated so the post wouldn’t turn into a miniseries, but feel free to contact me or leave a comment if you'd like the script):

--First, the Tags table.  I need a decent dataset, so I'm passing a list

--of programming languages that I "borrowed" from
--Wikipedia (http://en.wikipedia.org/wiki/List_of_programming_languages)

--567 rows in all.

INSERT Tags(TagName) SELECT 'A# (Axiom)'

INSERT Tags(TagName) SELECT 'A# .NET'

INSERT Tags(TagName) SELECT 'A+'

INSERT Tags(TagName) SELECT 'A++'

INSERT Tags(TagName) SELECT 'A-0'

INSERT Tags(TagName) SELECT 'ABAP'

INSERT Tags(TagName) SELECT 'ABC'

INSERT Tags(TagName) SELECT 'ABC ALGOL'

.....

INSERT Tags(TagName) SELECT 'XQuery'

INSERT Tags(TagName) SELECT 'XSLT - See XPath'

INSERT Tags(TagName) SELECT 'Y'

INSERT Tags(TagName) SELECT 'YACC'

INSERT Tags(TagName) SELECT 'Yorick'

INSERT Tags(TagName) SELECT 'Z'

INSERT Tags(TagName) SELECT 'Z notation - like UML.'

INSERT Tags(TagName) SELECT 'Zonnon'

INSERT Tags(TagName) SELECT 'ZOPL'

INSERT Tags(TagName) SELECT 'ZPL'

INSERT Tags(TagName) SELECT 'ZZT-oop'


I then populated ItemsWithoutTags, ItemTagXRef and ItemsWithFTSTags using a bit of loop logic and RAND(). I know, every time you loop in SQL Server, an angel loses it’s wings. I’m sorry, but RAND() only evaluates once per statement, which necessarily meant that I had to run the statement 10,000 times. Here’s the script:

--Now for the items being tagged.  10k for each table,

--with 1-4 tags per item.

--2500 items each with 1,2,3 and 4 tags. (Hence the modulo logic below)

--I'm using a standard numbers table for enumeration

INSERT ItemsWithoutTags (ItemText)

SELECT REPLICATE('A', 50)

FROM Numbers

WHERE Number < 10000



--Crude, but effective tagging script

--You'll get a few errors for duplicate tags

--on a given item, but only a few - statistically

--1/567 * the number of tags for a given item,

--* the number of items



DECLARE @CurrentTagItem INT

DECLARE
@NumberOfTags INT

DECLARE
@LoopCounter INT



SET
@NumberOfTags = (SELECT COUNT(*) FROM Tags)

SET @CurrentTagItem = 1

SET @LoopCounter = 1



WHILE @CurrentTagItem <= 10000

BEGIN



WHILE
@LoopCounter <= (@CurrentTagItem%4) + 1

BEGIN

INSERT
ItemTagXRef(ItemID, TagID)

SELECT @CurrentTagItem,
CAST(RAND() * @NumberOfTags AS INT)

SET @LoopCounter = @LoopCounter + 1

END


SET
@CurrentTagItem = @CurrentTagItem + 1

SET @LoopCounter = 1
END



--Now to populate the full-text table.

--Again, 2500ish each with 1, 2, 3, and 4 tags.

--We'll delimit the tags as such: ztagz

SET @NumberOfTags = (SELECT COUNT(*) FROM Tags)



SELECT @LoopCounter = 1, @CurrentTagItem = 1

DECLARE @TagString VARCHAR(250)

SET @TagString = ''



WHILE @CurrentTagItem <= 10000

BEGIN

WHILE
@LoopCounter <= (@CurrentTagItem%4) + 1

BEGIN

SET
@TagString = @TagString +
(
SELECT 'z' + t.TagName + 'z'

FROM Tags AS t

WHERE t.TagID = CAST(RAND() * @NumberOfTags AS INT))

SET @LoopCounter = @LoopCounter + 1

END

INSERT
ItemsWithFTSTags(ItemText, Tags)

SELECT REPLICATE('a', 50), @TagString


SET @LoopCounter = 1

SET @TagString = ''


SET @CurrentTagItem = @CurrentTagItem + 1

END

Still with me? OK, nobody has dozed off – must be doing something right.

Notice the bit of a hack required to store the tags in the full text table? I’ve delimited the tag with a “z” on either end. By default, querying full text search via the CONTAINS operator will find any block of text containing a given word, so a search for “Java” will find not only the “Java” tag, but also “General Java”, “Join Java”, and the like. “Z” isn’t the best choice of delimiters, however you’ll find that SQL Server strips out most symbols when it indexes a column, so (for instance) storing the tags separated by vertical pipes won’t do you any good. Score 1 for the ItemXRef solution – we don’t need to muck around with it in order to get it to work, which leaves far more time for intelligent development and optimization.

Now let’s try querying both tables, for items tagged with the tag “Java”:

SELECT i.*

FROM ItemsWithoutTags i

JOIN ItemTagXRef itx ON i.ItemID = itx.ItemID

JOIN Tags t ON itx.TagID = t.TagID

WHERE t.TagName = 'Java'





SELECT i.*

FROM ItemsWithFTSTags i

WHERE CONTAINS(i.Tags, '"zJavaz"')

The full-text syntax is a little painful, but I can live with it.

Of course, a key factor in the suitability of any solution is it’s performance, so I wrapped each query in a loop, and executed them 100x apiece for five iterations each. Here’s what I found:


image

Pretty comparable, in the end. Of course, YMMV when two solutions are this close to each other, so if you’re considering applying one of these techniques, I recommend testing it in your environment with realistic sample data. If the difference is negligible, the XRef solution will most likely tip the scales as a result of it’s flexibility and ease of coding.

Until next time, that's all folks.

5 comments:

David C said...

Hey, enjoyed the post. :) How about the result quality? Did you get really different result sets, given that the fulltext solution is going to return "similar" matches as well as exact matches, while the join is going to only return exact tags? I'm guessing that because of the expected behavior of tags, your join solution would give a better result set (unless some modifications were made to the FTS solution). Cheers.

Aaron Alton said...

Hi David,

Thanks for the comment.

I specificially intended to get exact matches only, which is why I had to fudge the FTS solution with the "z" prefix and suffix (zJavaz). I used the CONTAINS operator, which looks for a more or less exact match. The end result was that I got exact matches only, but it wass a bit of a runaround to get it working as such!

Cade Roux said...

I like a fully normalized tag solution. It also lets you do other things with tags, like setting up hierarchies. For tech support this is good, because you can have printers and different brands and models of printers and let them roll up without having to be tagged as such. For instance, StackOverflow suffers from not having sqlserver2000, sqlserver2005 and sqlserver2008 rollup easily to sqlserver automatically for searching purposes. I actually like them to be a non-strict tree, so a particular tag could be a member of different parts of a hierarchy.

Aaron Alton said...

Cade,

Thanks for the feedback. You make an excellent point about tag hierarchies, and supplimentary tag attributes.

I much prefer the fully normalized tag solution as well.

Michael said...

Try:

SELECT RAND(CHECKSUM(NEWID()))
FROM Numbers
WHERE ...

...should give you as many random numbers as you want in one statement (and no Angels have to lose their wings)