Saturday, September 11, 2004

I, Robots.txt - They came, They saw, They Cataloged!

The robots.txt file is a file placed in your web server's root directory (meaing it should be accessible by typing www.yoursite.com/robots.txt) that contains specific details about your site, making a search engine's job much easier, as well as telling it what NOT to index. This is called the 'Robot Exclusion Standard".

The format for the robots.txt file is special. It consists of records. Each record consists of two fields : a User-agent line and one or more Disallow: lines. The format is:

[field] ":" [value]
The following tags are allowed in the robots.txt file, and examples are given for their usage:
  • User-agent

    The User-agent line specifies the robot. For example:

User-agent: googlebot OR User-agent: * (*=all)

You can find user agent names in your own logs by checking for requests to robots.txt. Most major search engines have short names for their spiders.

  • Disallow:
    The second part of a record consists of Disallow: directive lines. These lines specify files and/or directories. For eaxample:

Disallow: email.htm OR Disallow: /cgi-bin/

If you leave the Disallow line blank, it indicates that ALL files may be retrieved. At least one disallow line must be present for each User-agent directive to be correct. A completely empty Robots.txt file is the same as if it were not present.

  • Any line in the robots.txt that begins with # is considered to be a comment only. The standard allows for comments at the end of directive lines, but this is really bad style:

Disallow: bob #comment

EXAMPLE ROBOTS.TXT FILE:

#Allowing all robots everywhere:
User-agent: *
Disallow:


#This one keeps all those nosy robots out:

User-agent: *
Disallow: /


#The next one bars all robots from the illegal_documents and invoices directories:
User-agent: *
Disallow: /illegal_documents/
Disallow: /invoices/

#This one bans Google from poking around:
User-agent: Google
Disallow: /

#This one bans keeps googlebot from indexing the all_my_credit_card_numbers.html file:
User-agent: googlebot
Disallow: all_my_credit_card_numbers.html

Once you are finished banning and allowing robots, run your file through the Robots.txt file validator. Let me know how you did!

0 Comments:

Post a Comment

<< Home