Setting up a robots.txt to Control Search Engine Spiders
Way back in the day, when I published my very first website, I never thought that I would ever need to know about setting up a robots.txt to control search engine spiders. As a new webmaster, you are only to happy to have the search engine spiders crawling and indexing your website. Today, all the websites I develop have a robots.txt in the root directory.
For those of you who have not come across a robots.txt file, it is simply a text file placed in the main, or root directory of your website, instructing search engine spiders and other robots as to which files they should not crawl and index. Not all spiders and robots access this file and follow its instruction, but all the major search engine spiders do.
Why is a robot.txt important?
Lets examine why setting up a robots.txt to control search engine spiders is so important.
Minimize Server Resources
Most websites rely on various bits of code, called scripts, to perform certain functions, for example, look up at the main menu on this website and you see a site search. Now I put that search option there so my readers had a way of finding the information that they were looking for, the search engine spiders have no need to access this search form and I therefore prevent access to it via my robots.txt. If I did not block access to the site search, each and every robot and spider would trigger the site search and use valuable server resources. I also block access to all contact forms, and a variety of directories that contain scripts developed to assist humans.
As I mentioned earlier, there are a few rogue spiders that have little regard for convention and hit your domain hard, draining all your server resources and often bringing your website down. If you know of any of these robots, you could block them via your robots.txt, but understand that a rogue spider will probably ignore the rules in your robots.txt.
Reduce Bandwidth Usage
Have a look at your website statistics and you will notice that your robots.txt file has numerous hits on it each and every day. Search engine spiders and robots access your robots.txt file prior to crawling your website to see if you have left them any instructions. If you do not have a robots.txt file, spiders and robots trying to access it will get the default 404 file not found message, unless you have customised your 404 file, in which case the customised 404 error message will be displayed, which is probably bigger than the default 404 error message. Failure in setting up a robots.txt to control search engine spiders, will result in spiders and robots using up more of your bandwidth each time your 404 page is retrieved.
Some spiders will try to access files that you don't need indexing, like images for example, .gif, .jpg, .png, etc. If you would like to stop the search engine spiders from accessing these files, then adding an instruction to your robots.txt is a great way to acheive this.
Web Statistics are Easier to Read and Act Upon
Whenever I access my website statistics, I always pay attention to pages that my readers tried accessing, but were met with a 404 file not found error. These are usually clues to a problem with your link, incorrect spelling, etc. If you do not have a robots.txt file, this will generate multiple 404 file not found errors each and every day, and will crowd your 404 statistics and may cause you to loose focus on fixing the linking errors or worse still missing the linking errors all together. If for no other reason than this, it is worth setting up a robots.txt file to control search engine spiders.
Blocking Specific Robots
Occassionaly I find a robot accessing my site at such high speed that it impacts negatively on my readers, as they are not able to browse the site effectively, or worse still, it brings the website down. When I see this, I will usually exclude these robots from indexing my website via a robots.txt instruction. Of course this will only work if the robot in question accesses the robots.txt file and obeys its rules.
How to Setup a robots.txt File
Setting up a robots.txt to control search engine spiders is really easy, use an ASCII text editor like Notepad to create your robots.txt and then upload it to your websites root directory. If your domain is mywebsite.com then place your robots.txt at mywebsite.com/robots.txt. Remember that the robots.txt is a robots exclusion file, it tells robots what not to index, there is no way to instruct a robot to index a directory via your robots.txt.
The file itself consists of the name of the spider on one line and the files and directories that you want excluded on subsequent line, one file or directory per line. You may use the wildcard "*", the asterix without quotes, instead of naming spiders individually. Here are a few examples:
User-agent: Googlebot
Disallow: /scripts/
The above instructs Google spider not to index anything in the "scripts" folder or any sub folder.
User-agent: *
Disallow: /
The above instructs all spiders not to index anything in the root directory.
User-agent: *
Disallow: /images/
Disallow: /media/
User-agent: Googlebot-Images
Disallow: /
The above instructs all spiders not to index anything in the images or media directory and Google Images spider not to index anything in the root directory.
Where do I Find the Robot Names?
If you wish to block a particular spider from accessing specific files on your website, you need to know its name. Most reputable spiders will have a page dedicated to how to prevent them from accessing your website on there website.
Common Mistakes in robots.txt Files
If you are new to writting robots.txt files, here are a few mistakes to look out for.
One file per disallow line
Remember that the Robots Exclusion Standar only allows for one file or directory per disallow statement. Having more than one file or director will result in none of the files or directories in the disallow statement being excluded.
Do not list your private directories
Your robots.txt is a public file and is always in the root directory of your website, anyone can download it and read its contents. If you have 'private' (secret) directories, it would not be a good idea to list them in your robots.txt.
There are no Gaurantees
Even though we have a document called the Robots Exclusion Standard, there are a few robots and spiders that do not access this file prior to indexing your website. If you really need to exclude a spider read my article on using a .htaccess file to block a bot.
Robots.txt update
A recent update to the Robots Exclusion Standard now allows you to link a sitemaps protocol file, listing all of your pages. For more on this, read my article on how to get the search engines to index all of your webpages.
SEO Articles RSS Feed
Did you find this SEO article beneficial? Subscribe to our RSS feed and be informed of new SEO articles and scripts that we publish. Point your feed reader or RSS compatible browser to http://seojoomla.co.za/Joomla-SEO-articles/feed/rss.html.
No SEO Article Reprint Rights
Please note that unless you have express permission, you are not entitled to reprint this article in part, or in it's entirety.
Link to this SEO Article
If you think your readers would find this SEO Article helpful, copy the code below and paste it into your website:

