Posts tagged: robots

What is a robots.txt file?

There may be areas of your site you do not want crawled by the search engine spiders, such as the admin area of the site or a test page. One way to tell search engines the files and folders to avoid is through the robots meta tag. However, not all search engines read metatags and therefore webmasters use the robots.txt file to tell search engines the areas of the site to avoid.

Search engine robots

Link from:http://www.flickr.com/photos/microcosmos/1265783338/

What is a robots.txt file?
The robots.txt is a text file placed in the root folder of a website (for example: www.example.com/robots.txt).

Why is it used?

To give instructions about the websites to search engine spiders. The robots.txt contains information about the pages that should not be crawled. It also contains the location of the XML sitemap. A lot of people use the robots.txt file to stop the search engine from crawling a page or number of pages, for example if you are still building the site and do not want it to appear in search engines

What does it look like?

User-agent: *
Disallow: /admin
Disallow: /enquiry-form/
Disallow: /shoppingbasket/

The “*” means any robot. Each part of the site that you do not want the robot to crawl you have to put on a separate line.

Although you may have set up the robots.txt file, it does not mean that all robots will respect the file. Robots can ignore it. The file is publicly available, so anyone can see parts of your site you do not want the robots to use.

If you want to find out more information about robots file, here are some useful links:

http://www.robotstxt.org/robotstxt.html

http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=40361

News publishers and the robots.txt file

Ok so this week, for those that missed it, there was some news regarding google indexing news publishers sites.
We all know that if you do not want google to index your website, you simply write in the code:

User-agent: *
Disallow: /

But for some reason the news publishers think that all their content is going to be indexed by the search engine. According to other blogs, news publishers want to charge google for access to their sites. They want google to pay them to index their site. But of course this will not happen. They know they can block the crawler easily.

Search engines have always checked for permissions before crawling through pages from a web site. Webmasters, including news publishers, are aware and use the Robots Exclusion Protocol (REP) to tell search engines whether or not their sites, or a web page, can be crawled.

WordPress Themes

Switch to our mobile site