The previous articles in this series discussed:
- The high percentage of hostile bots in web traffic today, and the consequences of inadequate bot protection.
- The types of attacks for which hostile bots are used.
- The most significant problems caused by bots in nine prominent industries/verticals.
For any organization with significant web assets, effective bot management is crucial. However, before bot traffic can be blocked, it must first be identified.
This article will discuss traditional methods of bot detection, and why they can no longer identify all malicious bots being used today.
Traditional Bot Detection Methods
Historically, bots have been detected with several different approaches.
Monitoring consumption rates and blocking requestors above a certain threshold.
Refusing incoming requests from IP addresses that are found on blacklists.
Identifying patterns in the incoming requests that indicate that the originator is a bot.
(Completely Automated Public Turing test to tell Computers and Humans Apart): challenge-response puzzles presented to visitors.
However, problems arise when these are the only methods that an IDS (intrusion detection system) or bot management solution uses. (Unfortunately, this is still often the case.)
Why They Don’t Work Against Modern Bots
There are several reasons why traditional methods have become inadequate. Some have arisen from the increasing sophistication of attackers, while others are the result of larger trends.
More sophisticated attackers
Threat actors are continually striving to make their tools more effective. Newer generations of bots are designed to avoid detection.
- Rate limiting is evaded by rotating IPs and keeping the rate of requests to ‘reasonable’ levels.
- Blacklisting is also avoided by IP rotation. The increasing irrelevance of IP tracking is discussed further below.
- Signature recognition is defeated by spoofing user agent strings and other deception, so that the bot appears to be a human user.
Headless browsers in particular have become quite sophisticated. Because they have a number of legitimate uses, developers have created headless functionality in a variety of web languages and frameworks. Unfortunately, threat actors have taken advantage of this.
CAPTCHA /reCAPTCHA problems
A well-known method for verifying human web users is CAPTCHA. These challenges are presented to dubious requestors; in theory, humans should be able to solve them, while bots cannot.
In reality, CAPTCHA has turned into an arms race. Researchers and threat actors are continually finding automated techniques for solving the challenges. As a result, several successive generations of CAPTCHA have been published, to increase the difficulty in solving them automatically.
Unfortunately, this also means that CAPTCHA and reCAPTCHA puzzles have created an increasingly negative user experience for sites that include them. Efforts like Google’s “No CAPTCHA reCAPTCHAs” (shown above) have only partially mitigated this, since they frequently revert to full challenges.
A major development in this field was Google’s 2018 release of reCAPTCHA v3. This promised bot detection “without user interaction.” In other words, human users would no longer have to pass any on-screen challenges; all the bot detection would be done programmatically by Google behind the scenes. Unfortunately, reCAPTCHA v3 has two major flaws:
- It doesn’t accomplish its stated purpose, because it can be solved by an automated process 97.4 percent of the time.
- It has potential privacy issues, since it leverages Google’s cookies in the user’s browser. Since Google encourages site admins to place the reCAPTCHA code on all pages within a site, widespread adoption of v3 will allow Google to more thoroughly track the web usage of a large portion of Internet users. This has created some backlash and resistance to using it.
The goal of CAPTCHA is very attractive: to install a single code snippet and achieve robust bot detection and exclusion. But that goal has not yet been achieved.
The increasing irrelevance of IP and Geolocation
Traditional methods of tracking attackers are based on IP address. Today, an IDS which relies heavily on IP tracking has limited effectiveness.
The reasons for this are similar to those underlying Google’s BeyondCorp initiative and the underlying Zero Trust Network model from John Kindervag. It is the opposite of the old castle-and-moat approach; rather than treating users inside the perimeter as “good” and those outside as “bad,” it scrutinizes all users, and grants/denies access to resources based on who they are.
Similarly, IP address is no longer a useful way of distinguishing hostile web attackers from legitimate users. There is no longer a useful distinction between “good” IPs and “bad” IPs, because all users (whether hostile or legitimate) can have varying addresses.
For example, a legitimate user might access a web application over a mobile device, and its address and connection type (4G, LTE, etc.) can change multiple times during a session. Or perhaps the person is at an airport (or a coffee shop, or an airplane, or a library…) while trying to use a native app. The device is accessing the application API through public WiFi, and sharing a public IP with hundreds or even thousands of other Internet users that day. If an earlier user was hostile, and the API endpoint’s WAF blacklisted the IP, then subsequent legitimate users will not have access.
Meanwhile, attackers are deliberately obscuring their IP-based identities. Threat actors are abusing cellular gateways, infecting IoT devices, distributing compromised browser plugins, and using other techniques to gain remote access to vast numbers of addresses.
Today, malicious users are able to rotate IPs on a per-request basis. Reblaze regularly sees attacks where each request comes in on a unique IP. (In larger incidents, we’ll see 20-25K malicious requests per minute, rotating through IPs from all over the world, in continuous attacks that go on for several weeks. Ultimately, millions of addresses will be used.) In these situations, there’s no single IP that can be quarantined. A specific IP will not be used more than once.
Therefore, IP and geolocation are no longer “facts” associated with attackers. They are not useful foundations for detecting and tracking web threats.
Does this imply that they are completely useless? No; many less-sophisticated attackers do not rotate IPs. Therefore, blacklists and other IP-based security policies can still be used as a low-overhead method to block them. But IP-based methods are only effective against a fraction of hostile Internet traffic today, and the percentage is shrinking rapidly.
The challenge of API protection
Traditional detection methods were originally developed for scrubbing website traffic. However, for many organizations today, mobile/native apps and microservices account for a significant portion of their traffic. Threat actors frequently reverse-engineer the underlying APIs, for example by downloading a mobile client from an app store and sniffing its communications with the endpoint. Hackers then program bots to mimic application behavior and attack vulnerable spots within it (e.g., credit card validation, credential stuffing, brute force logins, gift cards, coupon codes, and more). Any communication that can be initiated by a legitimate API user can also be abused by automated traffic from a botnet. For a large application, the potential damage can be millions of dollars per month.
Unfortunately, many of the traditional methods for bot detection are not useful for API protection. For example, an API user has no web browser environment that could be verified.
Older bots still make up a significant portion of automated traffic. And as mentioned above, traditional detection methods are still effective against them. Therefore, web security solutions still use these methods as part of their bot mitigation. A significant tranche of bots can be detected and blocked without large computational workloads.
Nevertheless, to detect more sophisticated bots—which are an increasingly large percentage of Internet traffic today, for both web applications and APIs—newer detection methods are needed.
This will be the topic of the next article in this series: Bot Protection in 2019, Part 5.
This article is part 4 of a six-part series. You can download the complete report here: 2019 State of Bot Protection.