Getting Unique Counts from a Log File
Join the DZone community and get the full member experience.
Join For FreeTwo colleagues of mine ask a very similar question for interviews. The question is not particularly hard, nor does it require a lot of thought to solve, but it's something that as a developer or as ops guys you might find yourself needing to do. The question is, given a log file of a particular format, tell me how many times something occurs in that log file. For example tell me the number of unique IP addresses in an access log, and the number of times each IP had visited this system.
It's amazing how many people don’t know what to do with this. One of my peers ask people to do this using the command line, the other tells the candidate they can do this anyway then want. I like this question because it's VERY practical; I do tasks like this everyday, and I expect the people I work with to be able to do.
A More Concrete Exmaple
I like the shell solution, because its basically a one liner. So lets walk through it using access logs as an example.
Here is a very basic sample of a common access_log I threw together for this:
127.0.0.1 - - [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 192.168.0.1 - - [10/Oct/2000:13:55:41 -0700] "GET /missing.html HTTP/1.0" 404 506 192.168.0.2 - - [10/Oct/2000:13:55:48 -0700] "GET /missing.html HTTP/1.0" 404 506 192.168.0.5 - - [10/Oct/2000:13:56:42 -0700] "GET /missing.html HTTP/1.0" 404 506 192.168.0.6 - - [10/Oct/2000:13:57:05 -0700] "GET /missing.html HTTP/1.0" 404 506 192.168.0.1 - - [10/Oct/2000:13:58:36 -0700] "GET /missing2.html HTTP/1.0" 404 506 192.168.0.1 - - [10/Oct/2000:13:59:28 -0700] "GET /exitst.html HTTP/1.0" 200 1506 192.168.0.3 - - [10/Oct/2000:14:15:20 -0700] "GET /exitst.html HTTP/1.0" 200 1506 192.168.0.7 - - [10/Oct/2000:14:16:32 -0700] "GET /missing3.html HTTP/1.0" 404 506 192.168.0.7 - - [10/Oct/2000:14:20:54 -0700] "GET /exitst.html HTTP/1.0" 200 1506 192.168.0.8 - - [10/Oct/2000:13:22:42 -0700] "GET /exitst.html HTTP/1.0" 200 1506
Let's say you want to count the number of times a unique IP addresses who’ve visited this system. Using nothing more than awk, sort, and uniq you can find the answer. What you’ll want to do is pull the first field with awk, then pipe that through sort, and then uniq. This isn’t fancy, but it returns the result very quickly without a whole lot of fuss.
Like so:
~/Projects/access_logs$ awk '{print $1}' < access_logs |sort | uniq -c 1 127.0.0.1 3 192.168.0.1 1 192.168.0.2 1 192.168.0.3 1 192.168.0.5 1 192.168.0.6 2 192.168.0.7 1 192.168.0.8 ~/Projects/access_logs$
This gives you each hostname or IP, and the number of times they’ve contacted this server.
Upping the Complexity
Now for something more complex -- let's say you want to get the most commonly requested document that returns a 404. So, again we can do this all in a shell one-liner. We still need awk, sort, uniq, but this time we’ll also use tail. This time we can use awk to examine the status field(9), then print the URL field(7) if the status returned was 404. We can then use sort, uniq, and sort to order the results. Finally we’ll use tail to only print the last line, and awk, to print the requested document.
So here is what this looks like:
~/Projects/access_logs$ awk '{if($9=="404"){print $7}}' access_logs |sort |uniq -c |sort -n |tail -1 |awk '{print $2}' /missing.html
Of course there are many other ways to do this. This is a totally simple way to do it, and the best part of this is that you can count on these tools being on almost every *nix system.
Published at DZone with permission of Geoffrey Papilion, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Trending
-
How to Implement Istio in Multicloud and Multicluster
-
Decoding ChatGPT: The Concerns We All Should Be Aware Of
-
Extending Java APIs: Add Missing Features Without the Hassle
-
A Complete Guide to Agile Software Development
Comments