The importance of ‘robots.txt’ file shouldn’t be underestimated if you want to follow the best SEO practices. Robots.txt will allow you to communicate with various engines by giving the search bots instructions as to all sections of your website needed to be indexed. In fact, your site may still get indexed even without the robots.txt file, but you may lose a huge opportunity to leverage the fullest SEO potential of your site without it.
An overview of robots.txt
Let’s have a quick overview of why the robots.txt file is essential.
⦁ Robots.txt, which is also called robots exclusion standard/protocol, tells the web bots about which pages to crawl and which pages not to crawl.
⦁ A search engine robot will first check the robots.txt before visiting a site to get instructions.
Reading above, one might be wondering why one needs to block the web robots from crawling pages on your website. A typical website will be having a large number of pages in it. If a search engine robot crawls your site by going through every single page on the site, it will take a while for them to complete the task. This delay can have a negative impact on your search engine ranking.
Making of a robots.txt file
Making a robots.txt file is easy. You can find a typical robots.txt file at the root folder of every site. You can open or create it simply using Notepad. There are many YouTube video tutorials also available to get a deeper understanding of how to create robots.txt files.
To start with, you should know the robot exclusion protocol (REP). REP tag is attached to the URL to steer clear of any indexer task. REP is viewed differently by various search engines, i.e.
⦁ Google may fully wipe out the URL only listings on ⦁ search engine result pages (SERPs) if any resource is amended with the noindex tag.
⦁ Bing may act differently as it will list such references too on their SERPs.
REP tags can be included in various meta elements of HTML file in addition to headers. Robot tags will overrule all the conflicting directives which are found in any meta elements.
Next, a typical robots.txt file will have various rules. Each rule is to block or allow access to the crawler to different web page paths.
A simple robots.txt which has two rules may look like below.
This will work as below:
⦁ The crawler “Googlebot” will not access the folders with URL as http://website.com/nogooglebot/ and its subdirectories.
⦁ All other agents than Googlebot can have access to the entire site.
⦁ The location of sitemap file is http://www.website.com/sitemap.xml
Basic guidelines of robots.txt
It is essential to read the complete syntax of the robots.txt files as there is also some subtle behavior of it as one should understand.
⦁ Text editor
As discussed above, you can use any text editor or notepad to create robots.txt. The standards to be met are either UTF-8 or ASCII. However, it is not ideal to use any word processor as these can add unexpected characters like curly quotes which may create confusions to crawlers. It is good to use the online tool of robots.txt Tester to write and edit robots.txt files. This till also has the scope of testing the behavior and syntax of the site.
⦁ Naming convention and location
The file should be named only as robots.txt for the crawlers to understand easily. One site can have only a single robots.txt file. The location of robots.txt is typically the root folder of the site. However, it should not be placed in a subdirectory.
Say, for example, to limit crawling at URLs related to http://www.website.com/; you have to put the robots.txt file at the location of http://www.website.com/robots.txt. If you need permission to access the root directory or unsure about how to do it, contact the hosting service to get it done.
⦁ Robots.txt syntax
– Robots.txt file may consist of only one or more number of rules with each rule containing multiple instructions as one per each line.
– Each rule applies to the user agent on how to behave as which directories can be accessed and which all directories that agent cannot access.
– Rules are executed in the top to bottom order. The user agent will match only a single rule set placed first, which should be the essential rules matching to the given agent.
– The rules are case sensitive. The rule Disallow:/file.asp will not apply to the file named ‘FILE.asp.’
– A page or directory which doesn’t have a disallow rule can be crawled by any user agent.
Basic Do’s and don’ts on creating a robots.txt file
Next, we will have an overview of some do’s and don’ts of creating the robotx.txt file as suggested by the experts.
Do’s:
⦁ Scrutinize all your site directories, and you will surely find some areas you may have to block.
⦁ Also, check for information on your website files as e-mail addresses or phone number of your customers which must be blocked.
⦁ It is best not to index the pages/areas of your site with duplicate content like a repeated manual or a printable page of the same webs page content.
⦁ While creating robots.txt, makes sure that you don’t block the search engines from indexing the main site.
Don’ts:
⦁ Don’t use any general comments in a .txt file. This will create problems for the crawlers to understand the instructions.
⦁ Don’t put all the details of the files into the .txt file. Remember that public can access which will, in turn, fail your attempt to mask certain areas of your website.
⦁ Don’t put the “/allow” tag to the file which doesn’t have any effect.
Some key directives used in robots.txt files
⦁ User-agent
This specifies the search robot name for which the rule applies. This may be the first line of any rule. You can find the user agent names oat the Web Robots DB. For a prefix, suffix, or full string, you can use ‘* wildcard.’
E.g., To block Googlebot and AdsBot
⦁ Disallow
To specify a directory or page that shouldn’t be crawled by user agents. For a page, the full-page name should be put in, and a directory should end with a ‘/’ mark. This command also supports * wildcard to mention path prefix, suffix, or full string.
⦁ Allow
To allow a directory or a page which should be crawled by the specified agent/s. This command can override disallow and allow crawling a page or subdirectory in the disallowed directory. The syntax is the same as that of disallow.
⦁ Sitemap
This is optional to specify the location of the sitemap. This should be an error-proof full URL. Sitemaps are ideal for indicating which content to crawl and which content not to crawl.
Microformats
On putting indexer directive as a microformat, this will overrule any previous page settings to various HTML elements. For example, if a page’s tag shows the command “follow,” then the relative ‘nofollow’ directive by another element of the link will win. It is also possible to set it for a group of specific URLs to apply only to those robots tag. However, this may require some additional programming skills and knowledge of HTTP protocol and web servers.
Protecting private info
You should also remember that the robots.txt file on your site is publicly available. Anyone can check your site to see which areas are blocked for search engines. You have to use other smart techniques if you want to hide private information on your web pages from the public. Higher level file protection protocols like passwords and secured log-ins needed to be set to gain access to such info.
Another thing to be aware of is that a malicious crawler may be set not to refer to the robots.txt file before they act. In such cases, you cannot rely on the robots.txt file as a security measure. Remember that, with robots.txt, you can only disallow authentic bots not to enter the specified URL, nothing extra.
Another tip to remember while setting up robots.txt is that Google and Bing accept asterisks and dollar expression characters. This is a smart approach to leverage the best in pattern exclusion. Always proofread the robots.txt for perfection. As the robots.txt file is, so don’t get confused with the misbehavior, but at the first point whether you finger slipped over the shift key while making an entry.
There are lots of advantages on using robots.txt files. However, you should keep in mind that these are highly sensitive and needed to be handled well. One should beware of the most unfortunate things to happen if you don’t set your robots.txt file well.
There had been several cases in which the well-developed sites which had genuine and valuable backlinks and most organic content ended up with no SEO luck for some unfathomable reasons. It was found that many of them were hit adversely just because of a disallow forward slash included in the file, which, in turn, instructing the crawler bots of the major search engines not to index any of their web pages.