How to use search engine robots

Notice anything different?

We've enhanced the appearance of our portal and we're working on updating screenshots. Things might look different, but the functionality remains the same.

September 10, 2019

By Jason Dobry

How to configure and use search engine crawlers with either a robots.txt file or an .htaccess file.

What you need

FTP access to your Nexcess server. For details about how to use FileZilla, a popular FTP client, refer to How to use FileZilla.
A Nexcess account in a physical (non-cloud) environment.

Search engine robots in robots.txt

Locating

Using your preferred FTP client, type your site's directory and /html directory.
Within this directory, locate the robots.txt file. If you are unable to locate this file, create a text file with the name robots.txt.

Adding functions

The following sections describe the formatting for allowing or disallowing crawlers to access specific folders on your website.

ATTENTION: Search engine crawlers do not scan the robots.txt file each time they crawl your site, so changes to your robots.txt file might not be read by the search engine for up to a week.

Blocking search-engine crawlers

If you are performing development work on your site, and would prefer Google or Bing to not crawl your site, blocking your site from search engines is an option.

The first line of the robots.txt file will be User-agent: followed by the name of the search engine you want to block.
On the next line, type Disallow: followed by the folders and files you want to block the bot from crawling. For example:
User-agent: googlebot
Disallow: /photos

Allowing search engines to crawl specific folders of your site

If you would like to block all of your folders from search engine crawlers, configure an allow rule.

The first line of the robots.txt file will be User-agent:, followed this name of the search engine crawler.
On the next line, type Allow:, followed by the name of the folder you would like to allow the bot to crawl.

Adding crawl delays for search-engine robots

If your site is experiencing a large amount of traffic, and it appears to be caused by multiple search engine crawlers simultaneously visiting your site, configure a search engine crawler delay.

ATTENTION: Adding a crawl delay to your robots.txt file is considered a non-standard entry, and some search engines do not abide by this rule. You will need to check with the specific search engine you want to delay for specific details.

The first line of your robots.txt addition will be the User-agent: and the name of the search engine.
The second line will be Crawl-delay: followed by a number between 1 and 30. This is for the second delay a crawling search engine can crawl your site. If your site is being crawled by multiple bots simultaneously, adding a crawl delay of 10 seconds or more.

The following table is a list of search engines and their corresponding bot names:

Search Engines	Search Bot Name
Google	googlebot
Bing	bingbot
Baidu	baiduspider
MSN Bot	msnbot
Yandex.ru	yandex
All Search Engines	*

For example, to block Google bot from viewing your /photos folder, the following con figures a line in your robots.txt file:

User-agent: googlebot
Disallow: /photos

Search engine robots in .htaccess

Depending on the way your website is configured, your robots.txt file might not properly work with search engine crawlers. You can make changes to your .htaccess file instead.

ATTENTION: Search engine crawlers do not scan the robots.txt file each time they crawl your site, so changes to your robots.txt file might not be read by the search engine for as long as a week.

Locating .htaccess

Using your preferred FTP navigator, enter your directory and /html file.
Within this directory, locate the .htaccess file. If the files does not exist, create a text file with the name .htaccess.

Adding functions

Once you have located or created the .htaccess file, open the file in your preferred text editor.
If you are creating a new .htaccess file, the first line should be RewriteEngine On
If the .htaccess file already exists, and you are editing it, you will want to make sure that RewriteEngine On line is located in the file at the top.
The following line reads the user or bot's user agent name and matches it to what was provided in the .htaccess file. Your users will not be matched with this variable, therefore will not be blocked. Replace [crawler] with the name of the search engine.
```
RewriteCond %{HTTP_USER_AGENT} ^[crawler]$ [NC]
```
The final line tells the system what to do with a user that has been matched correctly, in this example, they will be provided a 403 Forbidden Message.
```
RewriteRule .* - [R=403,L]
```

For example, to block Yandex from crawling any pages of your site, the .htaccess file will look something like this:

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^Yandex$ [NC,OR]
RewriteRule .* - [R=403,L]

Adding crawl delays for search engine robots

If adding a crawl delay to the robots.txt file was unsuccessful, add the following to your .htaccess file:

SetEnvIf User-Agent [botname] GoAway=1 
Order allow,deny 
Allow from all 
Deny from env=GoAway

The first line checks the user ID, where [botname] is the name of the bot:
```
SetEnvIf User-Agent [botname] GoAway=1 
```
The second and third lines allows all traffic not matching the first line:
```
Order allow,deny
Allow from all
```
The fourth line denies all traffic that matches the GoAway variable.
```
Deny from env=GoAway
```

For 24-hour assistance any day of the year, contact our support team by email or through your Client Portal.