In the vast world of Search Engine Optimization (SEO), some things take up all your time and effort, while robots.txt sits on the sidelines, watching itself being ignored.
It is a simple process that can reap big rewards. However, people don’t show it the respect it deserves because one wrong syntax in the robots.txt file can vanish your entire website from every search engine.
After that, even if you implement Google’s guidelines to the last full stop, it won’t be of any use.
In this blog, we will discuss all that is there about this protocol and the technicalities associated with it, and how you can execute it to your website.
What is robots.txt?
Robots.txt is a text file that carries instructions for search engine robots, also called crawlers, how and what to crawl on your website.
It is a part of robot exclusion protocol (REP), a set of standards that regulate the crawling and indexing behaviors of web robots.
In simpler words, if the crawler of a particular search engine comes to your website, it will look for a robots.txt file. If it finds the file, it will know where to go and where it cannot go. Consider it as a roadside sign that tells which roads are prohibited from entering.
Yes, in essence, the purpose of a robots.txt is that simple. However, various aspects need to be understood and considered when creating it.
Here is the simplest example of a robots.txt file:
In this example, all user-agents are allowed to crawl every page of the website. This robots.txt file has a single line of user-agent and directive. However, it can have multiple lines of user-agent directives.
Each set of user-agent directive is separated by a line break, and it has two components:
User-agent: User-agent is basically the web crawling robot that is addressed. Each search engine has a different crawler. For example, Google has GoogleBot; Bing has Bingbot, etc.
In the above example, the website mysite.com doesn’t want crawlers to crawl wp-admin and wp-content subpages.
The use of an asterisk ( * ) means that no specific user-agent is mentioned, and the directive applies to all the user-agents. Here is a crawler list of almost all the search engines.
Directive: Directive is the command given to the user-agent that will determine its further course of action. These directives can be of three types:
Allow: This directive means the crawler is permitted to crawl all the pages. It is beneficial if you want the crawler to access a section of a page even though the page is disallowed.
This directive isn’t commonly used as web robots do by default crawl all the pages unless disallowed.
Disallow: The directive ‘disallow’ specifies to a web crawler which pages not to crawl.
In this example, msnbot is not allowed to crawl any webpage on this website. Some user-agents may ignore these directives. They are called malware. So, it is wise not to disallow pages with sensitive information. Use other security tools to stop nefarious crawlers from crawling those pages.
Crawl-delay: A crawl-rate defines how fast the search engine robots crawl your website. A crawl delay limits that frequency and saves your website from being overwhelmed by the requests generated by the bots.
In this example, the bingbot must wait 10 seconds before crawling each page on this website. Googlebot ignores crawl-delay directives. So, you have to manage the crawl rate from Google Search Console.
Why should you have a robots.txt file?
Robots.txt is not an integral part of SEO. Your website can function even without it, and the crawlers will crawl it. However, having a robots.txt file does offer certain benefits.
Saves crawl quota: Websites have pages ranging from a few hundred to several thousand. Say an e-commerce website has thousands of service pages and several hundreds of other pages; what pages would you like to be optimized and indexed? Of course, your product or money pages. But if unnecessary pages aren’t disallowed, your website’s crawl quota might be exhausted in crawling them and your real money pages would be left behind.
Improves website’s performance: Every time your webpage is crawled, it generates a request to your server that takes a part of its bandwidth. Now, if multiple requests are being made per second, and there are thousands of pages on your website, it can result in a considerable toll on your server’s speed.
This can result in a bad user experience and increase your website’s load time.
Duplication: With robots.txt, you can disallow pages that have similar content from being crawled. It will save from content duplication in the SERP.
Prioritize important pages: A website can have many pages, but not all of them are important. So, if unimportant pages take all of your crawl-budget, your SERP rankings will tank. You can control this by having a robots.txt file.
How to create a robots.txt file?
A robots.txt file is present at the root directory of the website. A simple /robots.txt after your top-level domain (TLD) will show you the file.
If your website doesn’t have a robots.txt file, then the search will give an empty page.
Creating a robots.txt file is a simple method. It can be created manually on any text editor, or you can use a robots.txt generator to help you with it.
However, before creating one, it is important to understand the guidelines you must follow to create a perfect file.
- A robots.txt file must only be named ‘robots.txt.’
- One website can only have one robots.txt file
- A robots.txt file must only be in the root directory and can’t be placed in a subdirectory
A robots.txt file is syntax sensitive and can be created with a text editor that produces UTF-8 encoded text. You can use Notepad for Windows or TextEdit if you are using a Mac.
Now you can start to create a robots.txt file. As we discussed above, there are two components: user-agent and directive.
You have to start by deciding whether the crawl directives are for a specific user-agent or all of them.
In case of a specific user-agent – consider Googlebot in this case – you’d write:
Next is the directive. Let’s say you want to disallow Googlebot from crawling the checkout page of your site. So, your directive would be:
Done. You have got your robots.txt file.
Yes, it is that simple. Here is one set of user-agent and directive of your robots.txt file:
You can add as many directives as you like in this file. However, each different set is separated by a line break.
Here is a robots.txt example from our own website, Growth Proton:
One thing that is noticeable here is the last two lines that contain the URL of the site map. It is not mandatory to add sitemap to robots.txt file, but having it there has a considerable advantage.
See, the bots coming to your website will look for the robots.txt file first. If they find it, they would know where to go and where not to But, if they find the location of the sitemap, it’s even better! They will jump to the location, and your webpages can be easily crawled.
The happy bots would be in and out in no time.
If you have a WordPress website, you can use plugins to create a robot.txt file. There are several WordPress robots.txt plugins that can be used to create and modify a robtos.txt file.
Common Robots.txt issues
A robots.txt file is simple to create but, it comes with its own set of issues that must be taken care of. For example, if you disallow for wrong pages, you will fail to get any traffic from the search engine.
robots.txt shouldn’t be used to hide your web pages from search engines. A search engine’s bot coming to your site might now crawl a disallowed page, but if you have done link building and another website links to that page, it will still be indexed and will look something like this in the SERPs:
To block access to a page, you can use meta directives that direct the pages’ indexing behavior.
The robots.txt file is syntax sensitive and must always be written as robots.txt and not Robots.txt or ROBOTS.TXT, etc.
A crawler can only find a robots.txt file in the root directory or the main folder. If it is placed in a subdomain, the bots won’t locate it and your entire website will be crawled.
A robot.txt file is a simple text file that can be created even on a Notepad. But, its presence on your website massively benefits your ranking in the SERPs. Having a fully optimized robots.txt file will allow your important pages to be crawled and indexed while leaving aside all the unnecessary ones that can eat into your crawl quota.
So, go to your website’s root directory, and search a robots.txt file. If you can find one, then it is good. But, if nothing comes up, then you should create one today.