Over a million developers have joined DZone.

Getting Unique Counts from a Log File

· DevOps Zone

The DevOps zone is brought to you in partnership with Sonatype Nexus. The Nexus suite helps scale your DevOps delivery with continuous component intelligence integrated into development tools, including Eclipse, IntelliJ, Jenkins, Bamboo, SonarQube and more. Schedule a demo today

Two colleagues of mine ask a very similar question for interviews. The question is not particularly hard, nor does it require a lot of thought to solve, but it's something that as a developer or as ops guys you might find yourself needing to do. The question is, given a log file of a particular format, tell me how many times something occurs in that log file. For example tell me the number of unique IP addresses in an access log, and the number of times each IP had visited this system.

It's amazing how many people don’t know what to do with this. One of my peers ask people to do this using the command line, the other tells the candidate they can do this anyway then want. I like this question because it's VERY practical; I do tasks like this everyday, and I expect the people I work with to be able to do.

A More Concrete Exmaple

I like the shell solution, because its basically a one liner. So lets walk through it using access logs as an example.

Here is a very basic sample of a common access_log I threw together for this: - - [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 - - [10/Oct/2000:13:55:41 -0700] "GET /missing.html HTTP/1.0" 404 506 - - [10/Oct/2000:13:55:48 -0700] "GET /missing.html HTTP/1.0" 404 506 - - [10/Oct/2000:13:56:42 -0700] "GET /missing.html HTTP/1.0" 404 506 - - [10/Oct/2000:13:57:05 -0700] "GET /missing.html HTTP/1.0" 404 506 - - [10/Oct/2000:13:58:36 -0700] "GET /missing2.html HTTP/1.0" 404 506 - - [10/Oct/2000:13:59:28 -0700] "GET /exitst.html HTTP/1.0" 200 1506 - - [10/Oct/2000:14:15:20 -0700] "GET /exitst.html HTTP/1.0" 200 1506 - - [10/Oct/2000:14:16:32 -0700] "GET /missing3.html HTTP/1.0" 404 506 - - [10/Oct/2000:14:20:54 -0700] "GET /exitst.html HTTP/1.0" 200 1506 - - [10/Oct/2000:13:22:42 -0700] "GET /exitst.html HTTP/1.0" 200 1506

Let's say you want to count the number of times a unique IP addresses who’ve visited this system. Using nothing more than awk, sort, and uniq you can find the answer. What you’ll want to do is pull the first field with awk, then pipe that through sort, and then uniq. This isn’t fancy, but it returns the result very quickly without a whole lot of fuss.

Like so:

~/Projects/access_logs$ awk '{print $1}' < access_logs  |sort | uniq -c

This gives you each hostname or IP, and the number of times they’ve contacted this server.

Upping the Complexity

Now for something more complex -- let's say you want to get the most commonly requested document that returns a 404. So, again we can do this all in a shell one-liner. We still need awk, sort, uniq, but this time we’ll also use tail. This time we can use awk to examine the status field(9), then print the URL field(7) if the status returned was 404. We can then use sort, uniq, and sort to order the results. Finally we’ll use tail to only print the last line, and awk, to print the requested document.

So here is what this looks like:

~/Projects/access_logs$ awk '{if($9=="404"){print $7}}'  access_logs  |sort |uniq -c |sort -n |tail -1 |awk '{print $2}'

Of course there are many other ways to do this. This is a totally simple way to do it, and the best part of this is that you can count on these tools being on almost every *nix system.

The DevOps zone is brought to you in partnership with Sonatype Nexus. Use the Nexus Suite to automate your software supply chain and ensure you're using the highest quality open source components at every step of the development lifecycle. Get Nexus today


Published at DZone with permission of Geoffrey Papilion, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}