How to make most out of Robots.txt File
|The robots.txt file is a text file that has specific instructions for search engine robots about specific content that they are not allowed to index. These instructions tells search engine about which pages of a website should be indexed. The address of the robots.txt file is: www.yoursitename.com/robots.txt .
By Default, every robots at first searches for robots.txt file. It then follows the file for indexing the site content.
Any robots.txt file must contain two fields User-agent and Disallow.
Robots.txt Syntax
# comment
User-agent: [robot-names][(*)Wild card character] Disallow:[(/)all] [specific directory] [specific file Location]
User-agent
The value of this field is the name of the robot the record is describing access policy for. If more than one User-agent field is present the record describes an identical access policy for more than one robot. At least one field needs to be present per record. You can multiple ( more than 1) User-Agents in one entry.
Disallow
The value of this field specifies a partial URL that is not to be visited. This can be a full path, or a partial path; any URL that starts with this value will not be retrieved. For example, Disallow: /help disallows both /help.html and /help/index.html, whereas Disallow: /help/ would disallow /help/index.html but allow /help.html. Any empty value indicates that all URLs can be retrieved. At least one Disallow field needs to be present in a record.
Things to remember while writing robots.txt:
- Robots.txt should be written in a plain text editor like Notepad. Do not use MS-Word or any other text editor to create robots.txt. The bottom line is this file should have the extension “”.txt”” else it will be useless.
- A robots.txt file is always stored in the root of your site, and is always named in lower case. Spiders will always search for it in the root directory (e.g. http://www.example.com/robots.txt)
- There can only be one instruction per line,
- You should avoid putting spaces before the instructions (recommended to avoid making mistakes).
- For security reasons, be aware while preventing spiders from accidentally indexing sensitive and private areas of your site, as anybody at all can view your robots.txt file.
Important Note:
The presence of an empty “/robots.txt” file has no explicit associated semantics, it will be treated as if it was not present, i.e. all robots will consider themselves welcome.
Standard Syntax Examples:
Allows all robots to visit all files because the wildcard “*” specifies all robots
User-agent: *
Disallow:
Disallow all robots from visiting any file, block entire site for every robot
User-agent: *
Disallow: /
All crawlers but only to specific directories, restricting the access to some directories/ part of website by disallowing them
User-agent: *
Disallow: /private/
Disallow: /images/
Disallow: /temp/
Disallow: /CustomErrorPages/
Ban/ Disallow a specific crawler accessing files
User-agent: BadBotName
Disallow: /private/
All crawlers not to enter one or more specific file, block specific file
User-agent: *
Disallow: /folder/file1.html
Disallow: /folder/file2.html
Add Comments to robots.txt files by starting the line with ‘#’ (hash) symbol
# this is comment.
While most of the robots are still following the above syntax from years, Google, Yahoo and MSN has created many useful newer syntax for better Access Management. These syntaxes are not followed by every robot. But yes Google and Yahoo has implemented them ( see the table below)
Non-Standard Extensions Examples
Sitemaps auto-discovery
Sitemaps specifies the location of the site’s list of URLs. It can be placed anywhere in the file.
Sitemap: http://www.example.com/sitemap.xml.gz
Crawl-delay
Crawl-delay parameter set to the number of seconds to wait between successive requests to the same server
User-agent: *
Crawl-delay: 10
Allow directive
This extension may not be recognized by all other search engine bots. To block access to all pages in a subdirectory except one
User-agent: Googlebot
Disallow: /folder1/
Allow: /folder1/myfile.html
Wild card character and pattern support
Googlebot and Some major web robots support exclusion by patterns too;
“*” matches any sequence of characters, “$” indicates the end of a name.
User-agent: Googlebot
Disallow: /*affid= # will block all dynamic url containing affid=
Disallow: /*sid= # will block all dynamic url containing sid=
Disallow: /*.aspx$ # block all files ending with .aspx
Disallow: /*.gif$ # block all files ending with .gif
The first example would disallow all dynamic URLs, where the variable ‘affid’ (affiliate ID) is part of the query string. The third example excludes .aspx page scripts without a query string from crawling. The fifth example tells the crawler to fetch all image formats except .gif files.
More Examples
To block Googlebot crawling any URL that includes a ? (more specifically, any URL that begins with your domain name, followed by any string, followed by a question mark, followed by any string):
User-agent: Googlebot
Disallow: /*?
If you wanted to disallow every bot except Googlebot.
User-Agent: *
Disallow: /
User-Agent: Googlebot
Allow: /
Prevent all other robots from accessing the a part of site, except only one robot (e.g. GoogleBot) to index everything on your site
User-agent: googlebot
Disallow:
User-agent: *
Disallow: /folder/
Many people believe that it is necessary to define the robot-specific rules before the general rules. Though this is not necessary according to the robots.txt exclusion standard, but may be worth doing, if it can help things to work as you intend. According to , Several Hundreds of Spider robots are already on the internet to search your website. The Database containing details of all robots can be downloaded from here. Not all of them are good, Get the List of bad robots from here. Ok let assume they all are good, even then you will not wish to allow to come your home(website) anytime. You can exclude any spider robot from accessing your site anytime using the syntax used above.
General Rule of thumb
There are several other methods also to restrict access to spider robots. Everyone has its own importance and usefulness, You can choose out of them, which is most suitable for you. You can also use any combination of these methods, like you can apply all of them simultaneously.
Want to block robots from accessing_______________Best Method
Websites or Directories robots.txt
Single Pages robots metatag
Single Links nofollow attribute
FAQ??
What are bad robots or spam bots, what they do?? Please tell me any way to get rid of them.
A bad robot do all the below things
* Ignore the file robots.txt and its guidelines
* Follow links through cgi scripts
* Traverse the whole web site in seconds, and slow it down during this time
* Are known to search for email addresses to make list for e-mail spamming.
* Keep Revisiting the web site too often
If you want to keep away the bad robots, then you can ban a bad robot.Here (http://www.fleiner.com/bots/#banning) is the smaill code you will require to ban robots.txt
Could you please tell me the names of few popular we crawlers and their bot names??
Some popular search engines and their robots name
Search Engine _________________Robot Name
Alexa.com ia_archiver
Yahoo.com slurp
Google.com googlebot
Altavista.com scooter
Msn.com msnbot
DMoz.org Robozilla
How can I create or make a new robots.txt for my new site.
Making robots.txt is play of 10 min. Read the above post Completely, I hope It will surely help you in creating robots.txt. If you still confused about the things, you can go for automatic robots.txt generator below.
Various useful Resources about robots.txt
robots.txt generator
It can generate robots.txt automatically. choose from the dropdown list of search engine spiders which you want to block or allow. locate your site map, set crawl delay and many more things you can do in order to generate the file automatically as per your needs.
robots.txt checker or validator
If you have already installed/put robots.txt on your site, then you can check and validate which pages or section of your website are accessible or not-accessible to robots, Very useful tool for webmasters.
Making wordpress seo friendly using robots.txt
This web page containing all the info required to optimize an wordpress blog for search engine.
Google doesn’t recommend Wild card character and pattern support
User-agent: Googlebot
Disallow: /*affid= # will block all dynamic url containing affid=
Disallow: /*sid= # will block all dynamic url containing sid=
Disallow: /*.aspx$ # block all files ending with .aspx
Disallow: /*.gif$ # block all files ending with .gif
Thanks for nice article.