Working with robots.txt files at the enterprise level can be a nightmare. Common stumbling blocks include millions of URLs, incredibly large spreadsheets, hundreds of URL appends and params, sporatic directory structures and more.
One of the easiest ways to manage URL-hell in an enterprise environment is to take advantage of wildcards in your robots.txt file.
Wildcards allow you to block portions of URLs that match specific patterns.
For example, many publishers and retailers allow visitors to view "printable" versions of their content. Printable content is always duplicate and almost always exists on a unique URL.
Let's say that http://www.example.com/dan/enterprise-seo/wildcards.html is the URL of my article.
I have a "printable" version of this article at http://www.example.com/dan/enterprise-seo/wildcards.html?print=on
I can use a robots.txt wildcard entry to disallow this URL by adding this line:
Disallow: /*print=on*
As of fall of 2006, all of the major search engines accept wildcard entries in the robots.txt file. In fact, Google has published their own guidelines for taking advantage of "pattern matched" entries. Read through their rules to determine how to best implement wildcards in your robots.txt.
REMEMBER: Robots.txt wildcard entries are case sensitive (for Google, at least). Be sure to check your robots.txt file against your targeted URLs using Webmaster Tools (Tools--> Analyze robots.txt).
Subscribe to:
Post Comments (Atom)
0 comments:
Post a Comment