Over a million developers have joined DZone.

Google, is That Really You? Verifying Legitimate Bot Traffic

How do you know whether that traffic is actually traffic? Studies have shown that average website traffic is comprised of 63%-80% bot traffic. Learn about good and bad bots, and how to verify bot traffic.

· Performance Zone

Download Forrester’s “Vendor Landscape, Application Performance Management” report that examines the evolving role of APM as a key driver of customer satisfaction and business success, brought to you in partnership with BMC.

It is one of the most common questions that I receive from customers, "How do we know whether search bot traffic is valid or not?" Great question!

Good Bots vs. Bad Bots

Let's start from square one. A bot is simple applications that runs an automated tasks across the Internet. Bots are everywhere; some bots are good and others are bad. According to a 2014 study by Incapsula, traffic to an average website consists of between 63% and 80% bot traffic. Therefore, the ability to identify and analyze bot traffic is critical to understanding and protecting your website.

Search and bot traffic is not always easy to validate, one reason being that it is easy and very common for bad bots to disguise themselves as friendly. This process is known as user agent spoofing, where an object identifies itself to a website as something other than itself. User-agent spoofing is a technique that has common and legitimate use cases, for example, when website developers utilize a browser to see how a site appears as a mobile website. However, it can also be a means for bad bots to avoid detection.

Wolves in Sheeps' Clothing: Bad Bots

Hackers and malicious actors leverage user agent spoofing because many sites, especially eCommerce sites, value the traffic that Google and Bing generate. Search engine traffic directly translate into revenue through referrals; therefore, search engine traffic is often exempted from many of the common firewall rules that protect against bad traffic.

Firewall rules are commonly based upon user agents because this is significantly easier to implement than IP based whitelisting, as Google and Bing do not use hard-coded lists and expect webmasters to verify IP addresses individually. Microsoft provides a Bingbot verification tool. Google does not offer such a service.

How to Verify Valid Bot Traffic

To verify that any IP address that you see in your logs is actually valid, the easiest way is through the nslookup command. Simply open a command line window on most operating systems and type nslookup followed by the IP. For example:

$ nslookup

Non-authoritative answer: = crawl-66-249-65-17.googlebot.com.

The above is a simple nslookup on a Googlebot IP. You can see that the name contains the googlebot.com address at the end. Most malicious actors are not going to go this far. But you will then verify that the name field returns the IP address you entered above when doing an nslookup.

$ nslookup crawl-66-249-65-17.googlebot.com

Non-authoritative answer:

In this section, you're looking for the IP for the Address entry in the Non-authoritative answer section. If this matches your original IP this will confirm the source of the IP is valid. Google responses will end with googlebot.com while Bingbot responses will end with search.microsoft.com. Further confirmation can be done by performing a whois request of the IP in question. The following is an excerpt from a whois request with sections skipped noted using an ellipsis(...).

$ whois
NetRange: -
NetName:        GOOGLE
NetHandle:      NET-66-249-64-0-1
Parent:         NET66 (NET-66-0-0-0-0)
NetType:        Direct Allocation
Organization:   Google Inc. (GOGL)
RegDate:        2004-03-05
Updated:        2012-02-24
Ref:            http://whois.arin.net/rest/net/NET-66-249-64-0-1

OrgName:        Google Inc.
OrgId:          GOGL
Address:        1600 Amphitheatre Parkway
City:           Mountain View
StateProv:      CA
PostalCode:     94043
Country:        US
RegDate:        2000-03-30
Updated:        2015-11-06
Ref:            http://whois.arin.net/rest/org/GOGL

We can see that both the nslookup and the whois show that this is a Google based product that returns as belonging to Googlebot. This is a valid request. You can also see all other IPs that would fall in the range of valid IPs by looking at the NetRange entry in the whois. This is all current IPs, but these cannot always be expected to be accurate so Google and Microsoft both request that you not hard code any of their IP ranges into a whitelist.

What if you are too intimidated by the console, your operating system doesn't have these tools, or you're already in the browser? Not to fear, there are browser-based sites that will provide the same functionality for checking IPs. A good resource for whois requests is AbuseIPDB.com. Using the website tool is free. A tool for reverse DNS lookups can be found at MXToolbox.com, this site also provides a host of other tools. Using these tools is generally simple. Enter the IP address in question and gather the same information as above to determine whether the request is valid. The process is the same as with the console based tools. Verify the information and you will know that you're looking at legitimate traffic.

This article was written by Philip Truax

See Forrester’s Report, “Vendor Landscape, Application Performance Management” to identify the right vendor to help IT deliver better service at a lower cost, brought to you in partnership with BMC.

bots,traffic analysis

Published at DZone with permission of Alex Pinto, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}