What is a robots.txt file & How To Use It?

Ever wondered what the purpose of a robots.txt file is? We are going to share the common rules that you might want to use to communicate with search engine robots like googlebot. The primary purpose of a Robots.txt file is to restrict access to your website design by search engine robots or ‘bots’. The file is quite literally a simple .txt (text file) that can be created and opened in almost any notepad, HTML editor or word processor.

To make a start, name your file robots.txt and add it to the root layer of your website. This is quite important as all of the main, reputable search engine spiders will automatically look for this file to take instruction before crawling your website. So here’s how the file should look to start with. On the very first line, add ‘User-agent: * ’. This first command essentially addresses the instructions to all search bots:

Starting your Robots.txt File

User-agent: *

Once you’ve addressed a specific, or in our case with the asterisks, all search bots – you come onto the Allow and Disallow commands that you can use to specify your restrictions.

Banning Bots from the entire website

To simply ban the bots from the entire website directory, including the homepage, you’d add the following code:

User-agent: *
Disallow: /

This first forward slash represents the root layer of your website. In most cases, you won’t want to restrict your entire website, just specific folders and files. To do this, you specify each restriction on its own line preceded with the ‘Disallow:’ command. In the example below you can see the necessary code to restrict access to an ADMIN folder along with everything in it.

Banning specific folders and/or files

User-agent: *
Disallow: /admin/
Disallow: /secure/file.php

If you’re looking to restrict individual pages or files the format is very similar. On line 3(above), we’re not restricting the entire ‘secure’ folder, just one PHP file within it. You should bear in mind that these directives are case-sensitive. You need to ensure that what you specify on your robots.txt file is exactly matched against the file and folder names of your website. So those are the basics and next we come onto the slightly more advanced pattern matching techniques.

Pattern Matching

User-agent: *
Disallow: /*?
Disallow: /ADMIN*/
Disallow: /*.php$

These can be quite handy if you’re looking to block files or directories in bulk, without having multiple lines of commands on your robots file. These bulk commands are known, as ‘Pattern Matching’ and the most common one you might want, would be to restrict access to all dynamically generated URLs that contain a question mark for example. So if you check-out line 2 (above) all you need to do to catch all of these is type forward slash, asterisk and the question mark symbol.

You can also use pattern matching to block access to all directories that begin with admin for example. If you check out line 3 (above), you’ll see how we’ve again used an asterisk to match all directories that start with ‘admin’. So if for example, you had the following folders on your root directory – ADMIN-PANEL, ADMIN-FILES, ADMIN-SECURE, then this one line on your robots file would block access to all three of these folders as they all start with admin.

The final Pattern Matching command is the ability to identify and restrict access to all files that end with a specific extension. On line 4 you’ll see how this single, short command will instruct the search engine bots and spiders not to crawl and ideally cache any page that contains the .php extension. So after your initial forward slash, use the asterisk followed by a full stop and then the extension. To signify an extension instead of a string you conclude with the dollar sign. The dollar tells the bots that the file to be restricted must end with this extension.

Robots.txt files and Hackers!

So that’s it, using those 6 different techniques you should be able to put together a well optimized Robots.txt file that flags content to search engines that you don’t wished to be crawled and cached.

It’s important we point out however, that website hackers often frequent robots.txt files as they can indicate where security vulnerability may lie that they like to throw themselves at. Always make sure that you password protect and test the security of your dynamic pages, particularly if you’re advertising their location on a robots.txt file.