It is one of the most common questions that I receive from customers, "How do we know whether search bot traffic is valid or not?" Great question!
Good Bots vs. Bad Bots
Let's start from square one. A bot is simple applications that runs an automated tasks across the Internet. Bots are everywhere; some bots are good and others are bad. According to a 2014 study by Incapsula, traffic to an average website consists of between 63% and 80% bot traffic. Therefore, the ability to identify and analyze bot traffic is critical to understanding and protecting your website.
Search and bot traffic is not always easy to validate, one reason being that it is easy and very common for bad bots to disguise themselves as friendly. This process is known as user agent spoofing, where an object identifies itself to a website as something other than itself. User-agent spoofing is a technique that has common and legitimate use cases, for example, when website developers utilize a browser to see how a site appears as a mobile website. However, it can also be a means for bad bots to avoid detection.
Wolves in Sheeps' Clothing: Bad Bots
Hackers and malicious actors leverage user agent spoofing because many sites, especially eCommerce sites, value the traffic that Google and Bing generate. Search engine traffic directly translate into revenue through referrals; therefore, search engine traffic is often exempted from many of the common firewall rules that protect against bad traffic.
Firewall rules are commonly based upon user agents because this is significantly easier to implement than IP based whitelisting, as Google and Bing do not use hard-coded lists and expect webmasters to verify IP addresses individually. Microsoft provides a Bingbot verification tool. Google does not offer such a service.
- Valid user agents from Google search bots are listed here: Googlebot User-Agents
- Valid user agents for Microsoft bots are listed here: Bingbot User-Agents.
How to Verify Valid Bot Traffic
To verify that any IP address that you see in your logs is actually valid, the easiest way is through the nslookup command. Simply open a command line window on most operating systems and type nslookup followed by the IP. For example:
$ nslookup 126.96.36.199 Server: 188.8.131.52 Address:184.108.40.206#53 Non-authoritative answer: 220.127.116.11.in-addr.arpaname = crawl-66-249-65-17.googlebot.com.
The above is a simple nslookup on a Googlebot IP. You can see that the name contains the googlebot.com address at the end. Most malicious actors are not going to go this far. But you will then verify that the name field returns the IP address you entered above when doing an nslookup.
$ nslookup crawl-66-249-65-17.googlebot.com Server: 18.104.22.168 Address:22.214.171.124#53 Non-authoritative answer: Name:crawl-66-249-65-17.googlebot.com Address: 126.96.36.199
In this section, you're looking for the IP for the Address entry in the Non-authoritative answer section. If this matches your original IP this will confirm the source of the IP is valid. Google responses will end with googlebot.com while Bingbot responses will end with search.microsoft.com. Further confirmation can be done by performing a whois request of the IP in question. The following is an excerpt from a whois request with sections skipped noted using an ellipsis(...).
$ whois 188.8.131.52 ... NetRange: 184.108.40.206 - 220.127.116.11 CIDR: 18.104.22.168/19 NetName: GOOGLE NetHandle: NET-66-249-64-0-1 Parent: NET66 (NET-66-0-0-0-0) NetType: Direct Allocation OriginAS: Organization: Google Inc. (GOGL) RegDate: 2004-03-05 Updated: 2012-02-24 Ref: http://whois.arin.net/rest/net/NET-66-249-64-0-1 OrgName: Google Inc. OrgId: GOGL Address: 1600 Amphitheatre Parkway City: Mountain View StateProv: CA PostalCode: 94043 Country: US RegDate: 2000-03-30 Updated: 2015-11-06 Ref: http://whois.arin.net/rest/org/GOGL ...
We can see that both the nslookup and the whois show that this is a Google based product that returns as belonging to Googlebot. This is a valid request. You can also see all other IPs that would fall in the range of valid IPs by looking at the NetRange entry in the whois. This is all current IPs, but these cannot always be expected to be accurate so Google and Microsoft both request that you not hard code any of their IP ranges into a whitelist.
What if you are too intimidated by the console, your operating system doesn't have these tools, or you're already in the browser? Not to fear, there are browser-based sites that will provide the same functionality for checking IPs. A good resource for whois requests is AbuseIPDB.com. Using the website tool is free. A tool for reverse DNS lookups can be found at MXToolbox.com, this site also provides a host of other tools. Using these tools is generally simple. Enter the IP address in question and gather the same information as above to determine whether the request is valid. The process is the same as with the console based tools. Verify the information and you will know that you're looking at legitimate traffic.
This article was written by Philip Truax