Digg.com’s SEO Mistake And The 1 Second Fix

by Hessam on January 21, 2010

Update February 4: Following this post and my email conversations with the folks at Digg, they corrected their robots.txt file on January 29.

I was recently looking through some of the major tech news sites around the web and studying their usage of the robots.txt protocol. I came across Digg.com and found a potential error in their robots file which is costing them a lot of lost search engine traffic, as well as revenue from advertising. Since it looks like an honest mistake, I thought I write about it so that you won’t make the same mistake. For those of you not familiar with robots.txt, it is a way for webmasters to give instructions to search engine crawlers (also called robots and spiders) about what they are allowed to crawl, and which areas of the website should not be accessed. You can read more about the robots.txt protocol here.

To come back to Digg.com, I found a mistake among the instructions which causes a large part of the website (the Gadgets directory) to become inaccessible to search engines. If you take a look at their robots.txt file, you can find the following instruction:

Disallow: /gadget

If you are a SEO geek (like me) you have probably noticed the open ending to this line, and hence the wild card effect it causes in practice. What this line basically says is to block access to all content which have ‘digg.com/gadget’ as part of their URL. Note that this is not the same as exclusively blocking the “gadget” directory, but is far more inclusive. In practice, this instruction also covers Digg’s Gadgets directory, ‘digg.com/gadgets‘ (note the ending ‘s’), and hence the problem! The correct implementation includes a forward slash ending the line in order to avoid the wild card effect:

Disallow: /gadget/

Digg has a large amount of pages in its index, for instance a Google site search shows close to 290 thousands pages in the ‘digg.com/apple‘ directory alone.  We can therefore only speculate exactly how much search engine traffic Digg.com is shutting out by accidentally blocking the /gadgets/ directory (Yahoo Site Explorer shows 108 thousands pages in /gadgets/). This could be done intentionally, but it seems more likely to be an honest mistake (it’s one sinlge “/” missing after all). According to Compete.com, Digg has around 40mil monthly unique visitors, so I can only speculate around how much money such a small mistake has cost them in advertising revenues during this period.

Finally, the lessons we learn from this are to:

1) generate easy to read robots.txt files so it’s easier to spot such mistakes. There are online tools which help constructing and deciphering robots.txt files.

2) always use web analytics tools and segment your traffic so it’s clear to you what is going on across the various parts of your site.

3) be afraid of the wildcard effect, in fact just completely avoid anything with the word “wild” in it :)

What are other important things to think about when creating a robots.txt file?

One comment

[…] Digg.com’s SEO Mistake And The 1 Second Fix, The Secret Swede […]

by SearchCap: The Day In Search, January 21, 2010 on January 21, 2010 at 23:15. Reply #

Leave your comment

Required.

Required. Not published.

If you have one.