As a site administrator, there is something you need to be aware of: bots. Bots account for more than half of all web traffic and a little over 40% of all internet traffic. Given the volume of traffic behind these bots, we must know everything relevant about them. In this article, we'll be discussing what they are, how to differentiate between good and bad bots, how they can benefit or harm our sites, and how to prevent bots from crawling our sites.
What Are Bots, and Why Can They Be Problematic for Sites?
In simple terms, bots are software applications configured to automatically run tasks that are often repetitive (including replication of human behavior). With the increasing integration of the internet into our daily lives, bots have become an integral part of the online structure. Bots index websites, so search engines can easily lead users to the most relevant content with a specific keyword. They also have been proven helpful in keeping track of new information, for instance, continuously checking for new room availability for hotels to keep possible guests updated.
Bots can be a great tool, but they can also become dangerous weapons that can harm our sites depending on how we use them. For that reason, we classify them into two categories independently of their programmed tasks: “Good Bots” and “Bad Bots.”
Types Of Bots
As previously mentioned, we’ll be dividing bots into “Good” or “Bad.” However, based on what they do, we can subdivide them into smaller categories.
When referring to “Good Bots,” we are discussing bots that perform tasks that can benefit both users and site owners and make the internet more efficient, safer, and friendlier to use. Some of the most common types of bots are classified below:
Spider Bots (crawlers): These are bots that examine content on websites over the internet, index it, and provide a more accurate response when using search engines.
Chatbots: Bots that emulate a live operator, assisting users with simple tasks or general inquiries.
Monitoring Bots: A particular type of bot whose goal is to monitor sites looking for abnormal behavior or measuring availability.
Web Scanners: Similar to the monitoring bots, they scan a site looking for strange patterns. However, web scanners are focused on security and can help detect if a site has been compromised or has been infected with malware.
“Bad Bots” are bots that are programmed with malicious intent. They can interact with sites, apps, and services the same way a human would, but instead misuse and abuse resources. More than half of targeted bad bots have a high level of sophistication, making them even harder to spot and stop.
Spam Bots: Bots configured to send spam, usually after exploiting legitimate email accounts. Other forms of spamming are fake reviews, comments pointing to a scam, etc.
Scrapers: Content scraping is the practice of searching and stealing content, often with malignant purposes. A bot can download all the data from a website; this data can be used to duplicate a website with a similar but fraudulent version, among other harmful practices. Another common use of these bots is for denial of inventory. The bots place orders of a specific item to make it unavailable to legit users, thus damaging potential sales.
Social Media Bots: Bots that emulate human social behavior, post comments (usually with misinformation), offer to “earn money for free,” or spread certain political opinions, among other uses.
Other Malicious Bot Activities: Several bad practices rely on the usage of bots, such as DDoS attacks, credential stuffing, click fraud, credit card thief, etc.
By seeing some of the above examples of bad bots, it becomes clear that we need to establish a management policy for bots to control them as much as possible and take advantage of good bots, as even these can cause troubles in some instances.
This is especially relevant for small or medium sites, the most vulnerable targets. A significant percentage of them are using a content management system, such as WordPress. We can make sites harder targets for bad bots and supervise the good ones. We’ll focus on three main options: Directory Protection (robots.txt), Web Server Protection, and External Solutions.
Directory Protection: robots.txtA primary technique of protection/regulation of bot activity is through the robots.txt file. It's a special plain text file (usually placed at the root of a site) to control bot crawling. The syntax is relatively simple; we use the special word “User-agent” to specify the agent the rule will apply to, and the “Allow,” “Disallow,” or “Crawl-delay” directives to set the behavior for those agents. Although there are other directives, these are the most common. Let’s see an example:
In this case, the “*” symbol represents all agents, and the “/” reflects the root directory of the site, so all bots are allowed to crawl the entire site, with a rate of 5 seconds per request, using the Crawl-delay directive (this directive might be interpreted differently depending on the agent). Let’s take a look at another use case. Say we want to exclude two bots from accessing the wp-admin directory. We can do so with this syntax:
The difference from the previous example is that we use the “Disallow” directive to exclude that directory (/wp-admin/) after listing the undesired user-agents.
The robots.txt file is very convenient, yet it has its limitations. The file is just an expression of what should be the desired behavior. However, malicious bots may ignore or disregard these rules completely. Some agents may also have different interpretations of the syntax. Therefore, imposed rules might be ineffective for some bots.
Web Server Protection
Using custom web server configuration can help block the majority of the non-sophisticated bots. We have several rules and features to detect and restrict bot activity, for instance, the widely used ModSecurity firewall. Other options to prevent bots from crawling your site are “Access control” rules. With these rules, we can block requests from a specific IP or hostname, and not only that, we can even block by more arbitrary variables, such as “User-agents.” If we would like to block bots using the IP “22.214.171.124”, and the MSN bot, we can add these rules to the configuration file (the rules are using Apache’s syntax):
On another note, to block requests by User-agent (in this case we are blocking “Useless-Bot”), the rule should be:
The above rules are considerably more effective when we create an “Allow-List” or “Block-List” with all the User-agents and IP addresses (or a combination of the two) that we want to allow or block.
We can also use special modules, such as Mod_Evasive, to forcefully rate-limit requests from specific IPs or hosts if they ignore any rate set with the Crawl-delay directive in the robots.txt file.
Some service providers already restrict bots that are problematic; however, it has become increasingly difficult to detect and block highly developed bots, because they can mimic human behavior almost perfectly. In such conditions, the options described above might render inefficient, and reinforcements need to be made.
Adding CAPTCHA tests can significantly decrease the effectiveness of malicious bots and help prevent undesired bots from crawling the site. A more advanced approach is to use a bot management solution. These special solutions should be able to identify bad bots using several techniques, including behavior analysis, challenge-based restrictions, etc.
Blocking All Bots?
We might be tempted to block all bot activity altogether; however, we need to be careful because that can harm our SEO score and, in turn, decrease our traffic volume and user reachability.
So, instead of blocking them all, the most effective way to manage bots is to use a combination of the methods described above and allow good bots to crawl and index our sites for optimal visibility.
Some misconceptions about the “referers” might indicate that you can filter out good bots from bad bots based on the user-agent, and the referer they set. However, a bot does not have any obligation to abide by best practices or standards; and they can set whichever option the bot’s manager sees fit.
In fact, a common practice for bad bots is to disguise themselves using legitimate user-agents and referers. This is why using several methods in conjunction is far better than using a single filter criterion.