Importance of a robots.txt file
June 11, 2009
Rob Ackerman
What is the Robots.txt? The robots.txt file is a plain text file stored in the root of your home directory. Its main function is to tell the spiders which pages/directories that you do *NOT* want indexed. It is a guideline containing rules and restriction for the spiders to follow.
How does the robots.txt work? The robots.txt file has certain directives which most, if not all, major search engines follow. They are: User-agent, Disallow, Crawl-delay, and Sitemap.
User-agent specifies which robots, such as Googlebot and Yahoo's Slurp are allowed to crawl the site. The most common value is "*" meaning all bots are allowed. You'll find this written as "User-agent: *"
Disallow can specify directories or files that you don't want indexed. If you have a directory called "MyStuff" at the root of your site that you don't want indexed, a directive of "Disallow: /MyStuff" would be an appropriate entry your robots.txt file.
Crawl-delay allows your to tell the crawlers how long till they request another page. For high-traffic sites or sites with limited bandwidth, this can help you keep the crawlers from bringing down your site. In the robots.txt file, a crawl delay of 10 seconds would be entered as "Crawl-delay: 10".
Sitemap is a becoming a key component of the robots.txt file. This is where you can point the crawler to your xml sitemap to ensure they know about all your pages. This directive is in the robots.txt file as "Sitemap: http://my.domain.com/sitemap.xml". Keep in mind that your sitemap should follow the Sitemaps protocol.
An example of a simple robots.txt:
User-agent: googlebot
Disallow: /private
Sitemap: http://my.domain.com/sitemap.xml
This robots.txt will tell Googlebot not to index anything residing in or below the private folder in your site. This could save you bandwidth/money if you had a large directory of product manuals that you didn’t want indexed, or didn't want your site indexed by a particular spider (for instance the Wayback Machine). By not allowing the directory to be indexed would reallocate that bandwidth to the visitor. It can also prevent private files and directories from being indexed by major search engines.
Another good reason for a robots.txt file is to tell the search engines where your sitemap file is located. As of April 2007, all the major search engines have agreed to accept the auto-discovery of sitemaps through the robots.txt file. All you need to do is put "Sitemap: http://my.domain.com/sitemap.xml" in your robots.txt file.
Some additional features of the robots.txt file (which are considered non standard) are :
Visit-time: 0600-0845 and Crawl-delay. While they are not yet supported by Google, (Google actually understands the crawl-delay, but ignores it anyway) some of the other search engines respect these rules, for example Ask upports the crawl-delay. The Visit-time tells the spiders what times are appropriate to index your site (maybe during off hours, or non-rush times). The Crawl-delay tells the spiders in seconds how long to delay between page requests.
As a final note having a robots.txt file could clear up some of those 404's in your website logs. Spiders are looking for the robots.txt file and if you don't have one your website will produce a 404. While a robots.txt file is not technically required it is good for SEO.
For more information on robots.txt and its usage please visit
robotstxt.org and Wiki