Server Log Analysis: It's More Important Than Google Analytics
Server Log Analysis: It's More Important Than Google Analytics
This article discusses the significance of analyzing the server logs. The author also demonstrates a server log dashboard created using the open source ELK Stack of Elasticsearch, Logstash, and Kibana.
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
This little script changed the world in November 2005:
Why? As online marketing was becoming more and more popular with the increasing use of websites and online communities (as well as social media networks in the not-too-distant future), people needed to understand what was going on.
Google had already released the advertising platform AdWords in 2000 and had surpassed Yahoo! in terms of search engine market share. Now, Google needed a way to tell people the exact sources of traffic to their websites as well as what the visitors were doing. Enter Google Analytics. (Of course, Google offered this data in exchange for being able to collect certain information for the company’s own use.)
Today, Google Analytics offers a lot of functionalities including e-commerce transaction information, campaign-specific UTM parameters, and behavioral demographics. But in the end, Google Analytics still depends on that little script executing every time that a page is requested from a server.
Rates of Analytics Script Blocking
Jason Packer, the principal consultant at Quantable, analyzed the issue in December 2015 and again in June 2016 and found the following:
Just to summarize the data: More than 70 million people are estimated to be using at least one advertising blocker that either blocks Google Analytics by default or can be easily configured to do so. (Plus, there are many other blockers in addition to these, the rates of ad block usage can be far higher for certain demographics, and Packer notes that many browsers have a “Do Not Track” HTTP request header as well.)
And it’s not only Google Analytics. Take a look at the advertising and analytics scripts that are blocked by Ghostery when I visit Boston.com:
In short, the use of such platforms will become less and less effective as more and more people use ad and script blockers. (As I like to say, people tolerate offline advertising but hate online advertising.) Plus, these analytics packages were never enough to monitor complete server activity in the first place.
The solution is to forget about the front end and go to the back end. The server log contains the only data that is one-hundred percent accurate in terms of how people — and even bots such as Googlebot — are using your website and accessing your server.
Server Logs, Not Analytics Scripts
For our own purposes and to make sure that we have the correct analytics for our own website and server, we created a server log dashboard using the open source ELK Stack of Elasticsearch, Logstash, and Kibana:
For those who want to use the open source ELK Stack to monitor their server logs, we have created resources on Apache, IIS, and NGINX log analysis with ELK in addition to a guide to AWS log analysis. These dashboards can track website and server activity that Google Analytics misses. In addition, we have free, pre-made dashboards for server log analysis in our ELK Apps library.
Every time that a web browser requests something from a server, a log entry is created. Here’s what one generally looks like:
A quick reference:
- White — the IP address in question
- Blue — the timestamp of the specific log
- Green — the access method (usually to “get” something such as an image file or to “post” something such as a blog post comment)
- Red — the uniform resource identifier (URI) that the browser is requesting (this is usually a URL)
- Orange — the HTTP status code (also called a response code)
- White — the size of the file returned (in this case, it is zero because no number is shown)
- Purple — the browser and the user-agent that is making the request (in this instance, it is Googlebot)
If your website receives 50,000 visitors a day and each person or bot views an average of ten pages, then your server will generate 500,000 log entries within a single log file every single day.
Regardless of how you choose to monitor all of this server log data, here is a list of some of the items to check in terms of server and website performance. As you’ll see, a lot of the information is what Google Analytics purports to show — but the server log data will be a lot more complete and accurate.
Server Statistics to Monitor
- Visits, unique visitors, and visit duration
- Times and dates of visits
- Last crawl dates
- Visitor locations
- The operating systems and browsers used
- Response codes
Business Intelligence Issues to Check
Advanced server monitoring can provide insights for business intelligence purposes:
- Search-engine bot traffic. If the number of times that Googlebot (or any other search-engine bot) suddenly drops, that could result in fewer indexed pages in search results and then in fewer visitors from organic search. Check your robots.txt file and your meta-robots tags to see if you are inadvertently blocking these bots.
- Crawl priorities. Which pages and directories of your website get the most and least attention from search-engine bots and human visitors? Does that match your business priorities? You can influence bot-crawl priorities in your XML sitemap and human navigation through your internal linking structure. You can move pages and directories that you want crawled more often closer to the home page, and you can have more internal links going there from the home page.
- Last crawl date. If a recently published or updated page is not appearing in Google search results, check for when Google last visited that URL in the server logs. If it has been a long time, try submitting that URL directly in Google Search Console.
- Crawl budget waste. Google allocates a crawl budget to every website. If Googlebot hits that limit before crawling new or updated pages, it will leave the site without knowing about them. The use of URL parameters often results in crawl budget waste because Google crawls the same page from multiple URLs. There are two solutions: block Google in the robots.txt file from crawling all URLs with any defined tracking parameters and use the URL Parameters tool in Google Search Console to do the same thing.
- Monitoring Salesforce. Although Salesforce data is separate from server log data, Salesforce information can be analyzed to monitor cross-team sales metrics for BI uses.
The point to remember: If you rely only on Google Analytics or any other platform that relies on front-end scripts, then you are seeing only part of the story as a result of the increasing use of script blockers. To analyze your complete server and website activity properly, you need to monitor your server log files directly.
Published at DZone with permission of Samuel Scott , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.