How Often Should You Look at Your Event and System Logs?
The DevOps Zone is brought to you in partnership with LogEntries. Learn how you can convert any query into a data visualization for your dashboard with Logentries.
The motivation for this post came from a question on a Linkedin system administrator group this weekend, where the question was being asked:
“Do you look at your event and system logs daily, weekly, or just when there is a problem?”
And I guess a natural follow on question is:
“and how often should you look at them?”.
There’s is no single answer that’s absolutely correct. When I’m asked this, I usually respond by saying something as definitive as:
“well, that depends….” :-D
How often you should analyse your log data really depends on the reason why you are carrying out the task in the first place, i.e., why are you analysing your logs, and what exactly are you interested in finding? Below I list some of the common use cases for analysing logs and give some pointers on how often it makes sense to do so in these situations.
During iterative development and deployment you will need to keep an eye on your logs to make sure you are not introducing any bugs as you add new features or fix known issues. This is particularly the case when you are doing regular (e.g. daily) deployments on either production or staging environments. How often you need to analyse your logs is going to depend on how often you make changes and how often you deploy. Some bugs may raise their head immediately and some more intermittently, so you’ll also want to review your logs periodically to make sure that you didn’t break anything and set up relevant alerts in case you did.
Load and performance testing is a key part of the software development cycle. It is imperative that systems are tested such that they meet performance requirements. For example, web pages should be load tested so that they are responsive under different conditions (e.g. low user load, medium load, high/peak load). You may also want to carry out stress testing – i.e. load test the system to see at what point your system becomes unresponsive/breaks or at what point the response time is no longer acceptable from a performance perspective. With stress testing, you get to know your system limits and when you are going to need to add more servers or modify your architecture.
One way or another however, during performance load testing you will need to periodically keep an eye on your logs to see if anything breaks as the stress on the system increases. You may also want to do a full review after the test has completed to see if the test was clean or if errors were being produced in the logs. Its worth noting that the volume of logs produced during load testing can be particularly high, and its not uncommon for performance test teams to run, what they term, ‘long run’ tests, which can last for days. So its a good idea to try to automate your log analysis as much as possible, e.g. using tagging, alerting or log indexing and search functionality to make it easy to find issues.
Support requests or reported issues can require analysing your logs. For example, we have support teams that regularly use Logentries to answer user queries in relation to failed log ins or issues with payments. In this situation log analysis is usually performed on a per request basis.
When managing live systems you will want to know immediately when something is up. And your system logs are one of the first places to look if you do suspect a problem. However its unlikely that you will want to be constantly looking at your logs ‘in case’ something breaks. So it’s a good idea to set up some alerts. That way, if there is suddenly a spike in exceptions or errors, you will be notified. It’s likely that you will also want to glance at your logs periodically.
A quick glance at your logs can give you a ton of info about how your system is behaving or how people are using your system. For example, it is common see a daily pattern of log volumes increasing during busy periods and decreasing when loads decrease. The Logentries log graph below shows a typical log volume pattern for a web site operating normally, with peaks and troughs of activity, over the course of a week.
A sudden spike in log events can often relate to increased load in the system as in the case of this log graph below where log events suddenly jumped from 50 events per second to almost 3000 events per second due to an internal system error.
Also, if you set up tags in your system, a quick glance can tell you immediately if there were any dropped web requests or exceptions in your system. You can also set up tags to understand any business related events in your system, such as the number of sign-ins or payments in the last 24 hrs. The log graph below shows a combination of fatal errors and business events over a 3 day period for example.
If you need to review logs to meet compliance or security standards, you may be actually required to review your logs at a particular frequency. For example, The Payment Card Industry Data Security Standard (PCI DSS) applies to organizations that handle credit card transactions. It mandates logging specific details, log retention and daily log review procedures. To be precise under the PCI DSS Requirement 10, which is dedicated to logging and log management, logs for all system components must be reviewed at least daily. For all you need to know about PCI DSS compliance check out Anton Chuvakin’s recent work . There are also similar rules around the storage of health information as mandated by HIPAA.
I hope this helps somewhat to answer the question on how often it makes sense to analyse your logs. As always let us know if we’ve left any use cases out that you think we should add to the list!