Over a million developers have joined DZone.
Platinum Partner

Getting Unique Counts from a Log File

· DevOps Zone

The DevOps Zone is brought to you in partnership with New Relic. Improving the performance of your app is easy with New Relic's SaaS-based monitoring.

Two colleagues of mine ask a very similar question for interviews. The question is not particularly hard, nor does it require a lot of thought to solve, but it's something that as a developer or as ops guys you might find yourself needing to do. The question is, given a log file of a particular format, tell me how many times something occurs in that log file. For example tell me the number of unique IP addresses in an access log, and the number of times each IP had visited this system.

It's amazing how many people don’t know what to do with this. One of my peers ask people to do this using the command line, the other tells the candidate they can do this anyway then want. I like this question because it's VERY practical; I do tasks like this everyday, and I expect the people I work with to be able to do.

A More Concrete Exmaple

I like the shell solution, because its basically a one liner. So lets walk through it using access logs as an example.

Here is a very basic sample of a common access_log I threw together for this: - - [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 - - [10/Oct/2000:13:55:41 -0700] "GET /missing.html HTTP/1.0" 404 506 - - [10/Oct/2000:13:55:48 -0700] "GET /missing.html HTTP/1.0" 404 506 - - [10/Oct/2000:13:56:42 -0700] "GET /missing.html HTTP/1.0" 404 506 - - [10/Oct/2000:13:57:05 -0700] "GET /missing.html HTTP/1.0" 404 506 - - [10/Oct/2000:13:58:36 -0700] "GET /missing2.html HTTP/1.0" 404 506 - - [10/Oct/2000:13:59:28 -0700] "GET /exitst.html HTTP/1.0" 200 1506 - - [10/Oct/2000:14:15:20 -0700] "GET /exitst.html HTTP/1.0" 200 1506 - - [10/Oct/2000:14:16:32 -0700] "GET /missing3.html HTTP/1.0" 404 506 - - [10/Oct/2000:14:20:54 -0700] "GET /exitst.html HTTP/1.0" 200 1506 - - [10/Oct/2000:13:22:42 -0700] "GET /exitst.html HTTP/1.0" 200 1506

Let's say you want to count the number of times a unique IP addresses who’ve visited this system. Using nothing more than awk, sort, and uniq you can find the answer. What you’ll want to do is pull the first field with awk, then pipe that through sort, and then uniq. This isn’t fancy, but it returns the result very quickly without a whole lot of fuss.

Like so:

~/Projects/access_logs$ awk '{print $1}' < access_logs  |sort | uniq -c

This gives you each hostname or IP, and the number of times they’ve contacted this server.

Upping the Complexity

Now for something more complex -- let's say you want to get the most commonly requested document that returns a 404. So, again we can do this all in a shell one-liner. We still need awk, sort, uniq, but this time we’ll also use tail. This time we can use awk to examine the status field(9), then print the URL field(7) if the status returned was 404. We can then use sort, uniq, and sort to order the results. Finally we’ll use tail to only print the last line, and awk, to print the requested document.

So here is what this looks like:

~/Projects/access_logs$ awk '{if($9=="404"){print $7}}'  access_logs  |sort |uniq -c |sort -n |tail -1 |awk '{print $2}'

Of course there are many other ways to do this. This is a totally simple way to do it, and the best part of this is that you can count on these tools being on almost every *nix system.

The DevOps Zone is brought to you in partnership with New Relic. Know exactly where and when bottlenecks are occurring within your application frameworks with New Relic APM.


Published at DZone with permission of Geoffrey Papilion , DZone MVB .

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}