An inexpensive way to create & configure perfect robots.txt file [step by step guide]
Did you know that robots.txt is the first thing a search engine bot looks for after entering your website/blog?
Search engines are a tough nut to crack, especially for beginners. And when it comes to doing most essential things after creating a website, most beginners miss creating/updating is the robots.txt file. When it comes to leveraging the power of the internet to grow traffic from the search engines, one of the most powerful arsenals is the robots.txt file.
It’s a ninja technique that you can use right away and easily. It’s a technique that can help you take advantage of search engine crawler’s natural flow. A tiny .txt file that every website on the internet has but not many utilize to the fullest.
In this post, I will be talking about some inexpensive ways to create a perfect robot.txt file & best practices for a robots.txt file for your website/blog.
As part of this post, you will learn the following:
- What is a robots.txt file?
- Finding your robots.txt file
- Robots.txt examples
- How to create a robots.txt file?
- How does robots.txt work?
- Optimize robots.txt for SEO
- Pros & Cons of robots.txt file
- Robots.txt for WordPress
- Robots.txt tester, testing everything is working fine.
Without further ado let’s begin 😉
This is a really long post, sit back and relax while you learn everything about the robots.txt file ☕
What is a robots.txt file? Why is it important to have the file?
Robots.txt is a text file that every website on the internet has with instructions for web crawlers of the search engine on how to crawl your website/blog pages.
Furthermore, the robots.txt file is a subset of robots exclusion protocol (REP). REP is a globally accepted internet standard that governs how search engine bots crawl your webpages, access the content in it, index the same, and serve that content to the users.
In simpler words, a robots.txt file specifies your preferences of which parts of your website should & shouldn’t be crawled by the web crawlers.
There are two behaviors in the file that instructs the web crawlers, ‘follow’ & ‘nofollow’ which ‘allowing’ and ‘not allowing’ respectively. Allowing or not allowing the web crawlers to crawl-index-serve certain pages of your website/blog is totally up to you. There are no standards to govern the preferences of doing it.
User-agent: * [crawler/bot name]
These lines of code can include/exclude any URL on your website/blog from any web crawler. Furthermore, you can include any number of user agents and derivatives(i.e., allows, disallows, crawl-delays etc).
Here’s an example of a robots.txt file:
In the above example, msnbot, discobot, and slurp have their own set of instructions. For the rest of the user agents, the instruction has been given under user-agent: * group.
Finding your robots.txt file
Creating a robots.txt file is as important as creating your website/blog is. There are high chances that you may not have a robots.txt file for your website/blog. Unlike a sitemap file, this file is not default one as the preferences completely depend on you. Hence you have to create a robots.txt file from scratch.
If you are not sure whether or not you already have a robots.txt file, you can simply add ‘/robots.txt’ at the end of your website/blog URL. This way you can even take a sneak peak in other’s files and learn what they’re doing as part of their SEO strategy.
One of the three outcomes will turn up
1] You will locate the file
2] You’ll find an empty page, like Disney
3] You’ll land-up on a 404 page.
If you get a 404 page, it’s to time create a file for your own from scratch.
To create a robots.txt file, use a plain editor (avoid using MS Word, as it adds hidden code to the content) like Notepad or TextEdit.
Now, let’s take a look at some examples to create a perfect file for your website/blog.
Robots.txt file examples
Consider a website www.domain.com and w.r.t that here are some examples of its inclusion/exclusion file:
Assuming that the text file is available, the URL to access it would be: www.domain.com/robots.txt
Allow all-access everything:
This code will allow all the web crawlers/bots/users-agents to crawl-index-serve all of the content to the end users on www.domain.com. If you are just starting out, do not let the web crawlers crawl everything on your website/blog. Just to be on safer side, blocking certain sensitive pages would help.
If you decide to keep everything crawlable by all the bots, make sure you have a thorough & quality content. You cannot afford to create a bad impression at the first sight of new users.
Disallow all-not access anything:
This code in the file will tell bots to not crawl any URL on the website and this is applicable for all the crawlers. With this code in the file, the bots will not crawl even the homepage.
Blocking specific bot from a specific folder:
This code will block Google’s bot, (with user-agent name Googlebot) from crawling any URL with ‘/example-folder’.
Blocking specific bot from a specific webpage:
This code in robots.txt will tell the Google bot to not crawl a specific page at the URL www.domain.com/example/folder/this-page-is-blocked.html
Note: Just like blocking bots in the aforementioned examples, you can even allow the bots from doing exact same thing. Simply replace ‘Disallow’ with ‘Allow’ and the robots.txt file will be ready.
Now that you know, how to create a robots.txt file, you can allow/disallow search engines to access/not access anything/everything on your website/blog. Check the full list of user-agent names of top search engine bots before creating the file.
If you are using wordpress, then you might see the file if you go to www.yoursite.com/robots.txt, that because wordpress creates a temporary file for your which is not editable or usable, make sure you delete it from the database(you will find the file in root folder of your website). In that case, you might need to create a new file from scratch. I will discuss this in the subsequent section, read on.
How to create a robots.txt file
With the examples we saw in the previous section let’s create an inclusion/exclusion file you for your website/blog. A simplest robots.txt file contains a minimum of two lines and two variable names, user-agent & Allow/Disallow.
With this basic thing in mind, let me show you how to set up a simple robots.txt file step by step:
1] Open a plain text editor on your machine
2] Type user-agent: * (this will make the file applicable for all the search engine bots)
3] Type Disallow and leave it blank after that
Since there’s no instruction for the bot what not to crawl, everything on your website is in the radar of the web crawler. You can say, this is another way of allowing the web crawlers to crawl everything on your website/blog.
If you want to mention the sitemap of your website/blog, you can do that by simply adding sitemap:http://www.yoursite.com/sitemap.xml in the file and you’re done. If you don’t have a sitemap you can create it easily by using wordpress plugins for creating a sitemap. For non-wordpress users, you too can create XML sitemaps within 20 seconds.
Before we optimize the file, let’s first understand the working to be in a better position to make necessary changes.
How does a robots.txt file work?
Search engines are super busy day-in and day-out. It has a component that crawls millions and millions of links on the web continuously without any stop. If I were to summarize the whole process of the search engine, it would be crawling the links and then index that crawled information to the end users.
As soon as the bot arrives at a website, the first thing it looks for is the robots.txt file. If the bot finds it, it will read the instructions and then process accordingly. If it doesn’t then it will generally crawl everything on that webpage, without any exception or special instruction.
Best practices for robots.txt file
1] Place the file in the top-level directory of your database. That is, place it in the main folder/root folder in the database. Depending on the FTP you use, accessing the root folder would vary.
2] The file is case-sensitive and should be named as “robots.txt” (all small letters, nothing is in capital)
3] The robots.txt is public and anyone can access it. Simply adding /robots.txt at the end of your URL would expose the strategy to everyone. Therefore, hiding private information in that file would be a regrettable mistake.
4] If you are having sub-domains on your website/blog, make sure you have individual robots.txt files. Example, blog.yoursite.com, and yoursite.com should have individual files ar respective folders.
5] Include sitemap in the file for better results. It keeps the bot busy crawling all the pages of your website/blog.
So it is very important for you to have a file to instruct the bot accordingly. If you don’t have any instruction file, the bot will apply default crawl-index-serve settings for your website/blog. Considering hundreds of blogger not having a robots.txt file, you can imagine the queue for indexing newly created websites.
In order to index it quickly, you will need to optimize it rightly. That being said, let’s now take a look at
How to optimize robots.txt for SEO?
Depending on how you want to optimize your robots.txt file, the content you include in the file will vary. The combination of various user-agents, derivatives and inclusion/exclusion is infinite. Hence, it’s totally up to you on how you want to optimize, or to be more precise, customize the robots.txt file.
Out of the many possible white hat ways of optimizing the file, I will include most fruitful ways to optimize it.
Considering the significance of the crawling frequency, it’s important for you to know that robots.txt file is the best way to get started with search engine optimization for your website/blog and that too absolutely free.
To begin with, you can disallow the bots from crawling the login page of your website. Since It’s just login page to the backend of your website/blog, it is simply waste of time for the bots to crawl that page. This way you can save some time for the bots and direct it to more important pages.
If you are a wordpress user, you can use the following code in your robots.txt file.
This code will block the admin page from crawling and allow everything else for crawling. You can directly copy paste this code and get started with step1 of your SEO. If you want to block any other specific page, which you feel shouldn’t be visible to your readers. You can simply add that page name, after “Disallow:” between the two slashes. That is if you want to block the bot from crawling http://yoursite.com/sample-page/, simply add
Disallow: /sample page/
to the robots.txt file. It’s that simple.
Now, if you were wondering what pages should you disallow and allow, then here’s a list of possible pages.
While duplicate content is not at all acceptable, at time duplicate content is inevitable. For example, a printer-friendly version of a web page is a duplicate content for the search engine, but the printer-friendly version is important for user experience.
Hence, blocking the printer friendly version comes handy to avoid being penalized by the search engine for duplicate content.
Thank you pages
Lead generation is the bread butter of digital marketing. If you don’t capture leads, you don’t grow at the right pace. Furthermore, lead generation is incomplete without thank you pages. A thank you page is where you greet user for doing what you wanted them to do. Having these pages accessible via the search engine will eliminate the process of lead generation.
Therefore, disallowing the bots to crawl such pages are super fruitful. If you block search engine bots from crawling thank you pages, you can ensure that only qualifies leads see that page and not everyone.
Disallowing the crawl of thank you page is same as aforementioned method. Just place the URL of the thank you page(s) after Disallow: variable in the robots.txt file.
Sign up/Call to action pages
If you are into email marketing as part of your growth hack strategy, you’d be knowing the importance of sign up pages. Sign up pages are the pages where the end user input their details (generally email id & name) to get something in return. Either newsletters or exclusive content that you might not have included in the post directly.
Such pages also can be disallowed for the bots to crawl and ensure that only highly targeted user see those pages and not everyone.
The comment section is a rich source for you to dig for feedback and ideas for your next topic. But wouldn’t it look odd if a comment page of your website/blog indexes on SERPs. Sure you want visibility from the search engine, but not this way.
Blocking comments feed would do the trick here. Simple add Disallow: /comments/feed/ to the robots.txt file would do the magic for you in this case.
Pros & Cons of robots.txt file
Pro: Better crawl budget
The first thing the crawl bot looks for is a robots.txt file(which you already know) and depending on the instructions, it will begin the crawling. Furthermore, if you don’t have the file, it will crawl everything under your domain name. This will consume, what SEOs call “Crawl budget” unnecessarily.
Crawl budget is something that the bots have that depending on the robots.txt file. If you specify sections that you want the bots to crawl, it would not waste the crawl budget by crawling unnecessary pages. The robots.txt file saves time and the crawl budget, both for you and the bot.
Con: cannot remove pages from indexing
The Disallow variable is not alone capable of blocking the search engine bots from serving those pages to the end users. There are two more variables that you should use to ensure that those pages are not indexed in the SERPs.
It’s the noindex variable.
Just like the Disallow variable, adding the URL of the page that you don’t want the search engine to the index after the noindex tag will do the needful.
This will ensure that the specific pages do not show up on SERPs.
Now the second variable is nofollow. What this tag does is, it tell the bots to not crawl links on the page mentioned. Furthermore, the links that are disallowed but not noindexed will still appear on SERPs and will look like this.
Source: Yoast SEO
Con: Link value is lost
If the crawl bot cannot crawl through a page, the link value of that page including the links inside that page is lost. However, if the bot can crawl and not index that page, then the link value is not lost.
Robots.txt for WordPress
WordPress being the prime platform, robots.txt has a special spot on the open-source platform. First, it’s super easy for anyone to create a robots.txt file on wordpress. Second, it’s even more easy to configure it. There are a handful number of plugins that can help you create the file.
But, there’s an ultimate plugin for everything SEO — Yoast SEO plugin
You can easily create and configure the robots.txt file with Yoast SEO plugin for wordpress. Once you create the file, follow this guide to configure the file wrt SEO.
Robots.txt tester, testing everything out
To test the configuration of your robots.txt file, head over to Google search console (you can sign up, if not already have an account)
Google search console is a webmaster tool to monitor your website traffic and relevant data.
Step 1: Sign in to Google search console
Step 2: Select your property, and click on ‘Crawl’ menu on the left panel.
Step 3: Click on “robots.txt Tester”
There must some default code in the robots.txt Tester’s editor screen. Delete it and replace it with the new file content that you just created. Click on “Test” button at the bottom of your screen.
Make sure the “Test” button changes to “Allowed”, which means that the file is valid. Here’s some additional information that you’d find useful while testing the file in the robots.txt tester. Once done, upload the text file to your root directory and make sure its named “robots.txt” only.
Now, you’re backed by a simple, inexpensive and powerful tool that is capable of increasing your visibility on the web.
Final thoughts on robots.txt
While you help yourselves getting visitors, you’re also helping Google a chance to serve better quality content and the users with the information they’re looking for. It’s a win-win situation for all three parties and the robots.txt file ensure that it’s done right.
If you help bots spend the crawl budget effectively, it will return the favor by indexing most important pages of your website/blog in SERPs. Furthermore, it doesn’t take much efforts to put this tiny thing to work. It would take some time to analyze the inclusion and exclusion. However, as the content and traffic grow you will need to update the file occasionally. Here’s a helpful video that would give you an idea of how frequently you should update the file.
Other helpful official guides from Google on robots.txt:
Over to you. What do you think of this tiny little file? Have you already created the file for your website/blog? If yes, what are the pages you’ve blocked the bots to crawl? Let me know in the comment section below.
Do you know someone who’s looking for this information? Share this with them also share it in your social network.
Originally published at www.btricks.in on July 4, 2018.