How to make most out of Robots.txt File

Ashesh | September 12, 2009 | How to..., SEO, Tips & Tricks | 2 Comments

The robots.txt file is a text file that has specific instructions for search engine robots about specific content that they are not allowed to index. These instructions tells search engine about which pages of a website should be indexed. The address of the robots.txt file is: www.yoursitename.com/robots.txt .

By Default, every robots at first searches for robots.txt file. It then follows the file for indexing the site content.
Any robots.txt file must contain two fields User-agent and Disallow.

Robots.txt Syntax

# comment
User-agent: [robot-names][(*)Wild card character] Disallow:[(/)all] [specific directory] [specific file Location]

User-agent

The value of this field is the name of the robot the record is describing access policy for. If more than one User-agent field is present the record describes an identical access policy for more than one robot. At least one field needs to be present per record. You can multiple ( more than 1) User-Agents in one entry.
Disallow

The value of this field specifies a partial URL that is not to be visited. This can be a full path, or a partial path; any URL that starts with this value will not be retrieved. For example, Disallow: /help disallows both /help.html and /help/index.html, whereas Disallow: /help/ would disallow /help/index.html but allow /help.html. Any empty value indicates that all URLs can be retrieved. At least one Disallow field needs to be present in a record.

Things to remember while writing robots.txt:

Robots.txt should be written in a plain text editor like Notepad. Do not use MS-Word or any other text editor to create robots.txt. The bottom line is this file should have the extension “”.txt”” else it will be useless.
A robots.txt file is always stored in the root of your site, and is always named in lower case. Spiders will always search for it in the root directory (e.g. http://www.example.com/robots.txt)
There can only be one instruction per line,
You should avoid putting spaces before the instructions (recommended to avoid making mistakes).
For security reasons, be aware while preventing spiders from accidentally indexing sensitive and private areas of your site, as anybody at all can view your robots.txt file.

Important Note:

The presence of an empty “/robots.txt” file has no explicit associated semantics, it will be treated as if it was not present, i.e. all robots will consider themselves welcome.

Standard Syntax Examples:

Allows all robots to visit all files because the wildcard “*” specifies all robots

User-agent: *
Disallow:

Disallow all robots from visiting any file, block entire site for every robot

User-agent: *
Disallow: /

All crawlers but only to specific directories, restricting the access to some directories/ part of website by disallowing them

User-agent: *
Disallow: /private/
Disallow: /images/
Disallow: /temp/
Disallow: /CustomErrorPages/

Ban/ Disallow a specific crawler accessing files

User-agent: BadBotName
Disallow: /private/

All crawlers not to enter one or more specific file, block specific file

User-agent: *
Disallow: /folder/file1.html
Disallow: /folder/file2.html

Add Comments to robots.txt files by starting the line with ‘#’ (hash) symbol

# this is comment.

While most of the robots are still following the above syntax from years, Google, Yahoo and MSN has created many useful newer syntax for better Access Management. These syntaxes are not followed by every robot. But yes Google and Yahoo has implemented them ( see the table below)

Non-Standard Extensions Examples
Sitemaps auto-discovery

Sitemaps specifies the location of the site’s list of URLs. It can be placed anywhere in the file.

Sitemap: http://www.example.com/sitemap.xml.gz

Crawl-delay

Crawl-delay parameter set to the number of seconds to wait between successive requests to the same server

User-agent: *
Crawl-delay: 10

Allow directive

This extension may not be recognized by all other search engine bots. To block access to all pages in a subdirectory except one

User-agent: Googlebot
Disallow: /folder1/
Allow: /folder1/myfile.html

Wild card character and pattern support

Googlebot and Some major web robots support exclusion by patterns too;
“*” matches any sequence of characters, “$” indicates the end of a name.

User-agent: Googlebot
Disallow: /*affid= # will block all dynamic url containing affid=
Disallow: /*sid= # will block all dynamic url containing sid=
Disallow: /*.aspx$ # block all files ending with .aspx
Disallow: /*.gif$ # block all files ending with .gif

The first example would disallow all dynamic URLs, where the variable ‘affid’ (affiliate ID) is part of the query string. The third example excludes .aspx page scripts without a query string from crawling. The fifth example tells the crawler to fetch all image formats except .gif files.

More Examples

To block Googlebot crawling any URL that includes a ? (more specifically, any URL that begins with your domain name, followed by any string, followed by a question mark, followed by any string):

User-agent: Googlebot
Disallow: /*?

If you wanted to disallow every bot except Googlebot.

User-Agent: *
Disallow: /
User-Agent: Googlebot
Allow: /

Prevent all other robots from accessing the a part of site, except only one robot (e.g. GoogleBot) to index everything on your site

User-agent: googlebot
Disallow:
User-agent: *
Disallow: /folder/

Many people believe that it is necessary to define the robot-specific rules before the general rules. Though this is not necessary according to the robots.txt exclusion standard, but may be worth doing, if it can help things to work as you intend. According to , Several Hundreds of Spider robots are already on the internet to search your website. The Database containing details of all robots can be downloaded from here. Not all of them are good, Get the List of bad robots from here. Ok let assume they all are good, even then you will not wish to allow to come your home(website) anytime. You can exclude any spider robot from accessing your site anytime using the syntax used above.

General Rule of thumb
There are several other methods also to restrict access to spider robots. Everyone has its own importance and usefulness, You can choose out of them, which is most suitable for you. You can also use any combination of these methods, like you can apply all of them simultaneously.

Want to block robots from accessing_______________Best Method
Websites or Directories                                                        robots.txt
Single Pages                                                                                   robots metatag
Single Links                                                                      nofollow attribute

FAQ??
What are bad robots or spam bots, what they do?? Please tell me any way to get rid of them.

A bad robot do all the below things

* Ignore the file robots.txt and its guidelines
* Follow links through cgi scripts
* Traverse the whole web site in seconds, and slow it down during this time
* Are known to search for email addresses to make list for e-mail spamming.
* Keep Revisiting the web site too often

If you want to keep away the bad robots, then you can ban a bad robot.Here (http://www.fleiner.com/bots/#banning) is the smaill code you will require to ban robots.txt
Could you please tell me the names of few popular we crawlers and their bot names??

Some popular search engines and their robots name

Search Engine _________________Robot Name
Alexa.com                                            ia_archiver
Yahoo.com                                        slurp
Google.com                                 googlebot
Altavista.com                             scooter
Msn.com                                   msnbot
DMoz.org                                 Robozilla

How can I create or make a new robots.txt for my new site.

Making robots.txt is play of 10 min. Read the above post Completely, I hope It will surely help you in creating robots.txt. If you still confused about the things, you can go for automatic robots.txt generator below.

Various useful Resources about robots.txt
robots.txt generator

It can generate robots.txt automatically. choose from the dropdown list of search engine spiders which you want to block or allow. locate your site map, set crawl delay and many more things you can do in order to generate the file automatically as per your needs.
robots.txt checker or validator

If you have already installed/put robots.txt on your site, then you can check and validate which pages or section of your website are accessible or not-accessible to robots, Very useful tool for webmasters.

Making wordpress seo friendly using robots.txt
This web page containing all the info required to optimize an wordpress blog for search engine.

About The Author

Ashesh

सयौ थुङ्गा फूलका हामी एउटै माला नेपाली, Welcome to my webpage. I'm from the Himalayan Country of Nepal. Well talking about me, I like mostly Web programming and Designing and furthermore I like Philosophical literature, Photography, Social networking. And I am Romantic and Sentimental person to some extent. Read more...

2 Comments

اكسترا November 23, 2010

Google doesn’t recommend Wild card character and pattern support

User-agent: Googlebot
Disallow: /*affid= # will block all dynamic url containing affid=
Disallow: /*sid= # will block all dynamic url containing sid=
Disallow: /*.aspx$ # block all files ending with .aspx
Disallow: /*.gif$ # block all files ending with .gif
Saroj September 12, 2009

Thanks for nice article.

Related Posts

About The Author

Ashesh

Leave a ReplyCancel reply