High Performance And Smarter Logging
Join the DZone community and get the full member experience.
Join For Free1. Summary
Logging, often a back-burner activity in the development cycle can actually be the very lifeline of the development team once the application is in production. Assuming that readers are well acquainted with various logging frameworks like Log4J, SLF4J, etc, this article lays guidelines for 'real-world' production logging, examines its influence on the application quality attributes and also rediscovers a forgotten high performance, smarter logging technique with codes and centralized logging.
2Introduction
Logging is often relegated to a back-burner activity while architecting, designing and developing applications or even when enhancing applications. Finally when the application is ready, you deploy it. And then what happens?
Your application now is out of your development environment - it is not running out of your favorite IDE workbench and much less having a debugger at your service. It is now that you realize the importance of logs. And as you wade through a huge set of logs trying to debug any problem with your application, you realize that it is not a trivial task. It is boring, tedious and time-taking. You ask yourself, with all your experience and expertise, is this what you were destined to do - finding a needle in a haystack of logs! Or possibly you off-load this activity to your sub-ordinates or an application maintenance team who equally curse you for their tedious and thankless job! Or possibly they would just throw up their hands, and you the specialized architect/designer would get involved, muttering under your breath how your skills are being wasted away.
Logs can be the very lifeline of your production application and should not be taken lightly or as an afterthought. Sure, most of us know the basics - that there are various logging levels which can be turned on or off at will, that there are log categories and using one's favorite logging framework - Log4J, Commons Logging or SLF4J one can direct logs to various destinations such as Files, Databases, JMS Queues, etc.
But how many of us actually plan our logs? How many of us understand how logging affects other system quality attributes? How many of us continually prune logs keeping in mind how logs can impact our system and our lives once the application goes live? How many of us have used the advanced features of our favorite logging frameworks to achieve high performance logging?
This article attempts to emphasize the importance of this oft-ignored but vital aspect of an application and provides guidelines for application logging. Finally it revives the design of a forgotten logging technique that could achieve high quality logging without impacting performance and other quality attributes.
3Logging with respect to other System Quality Attributes
3.1.1Monitoring
Logging is the most frequent implementation of the Monitoring attribute. It helps the application developer get to the root cause of application bugs. But often it is made out to be the only implementation of Monitoring, possibly because it is the easiest thing in the world for an application developer to write a System.out.println() or logger.log() statement at various places in the code. However there is such a thing as too much logging, which can result in having useful incident logs being rolled over thus defeating the very purpose and intention of logging. Application developers would do well to find other techniques of monitoring the system, such as using or developing JMX Consoles. This would be discussed in detail in the section 'Logging for production-ready applications'.
3.1.2Performance
Ask any performance specialist or architect and in nearly 90% of the applications, excessive logging has been found to have a very detrimental effect on performance. Logging is an I/O intensive activity and can get in the way of the application's performance, particularly when logging is done in the traditional way that is, synchronously writing to a FileAppender in the context of application threads. Not only that, it can often contribute to heavy heap consumption and garbage collection due to logging code such as -
if (logger.isDebugEnabled()) logger.debug("name "+person.name+" age "+person.age+" address "+person.address);
Not only that, the doAppend() method that is internally called for logging to a Log4J Appender is synchronized for thread safety. What this also means is that application threads are not only doing heavy disk I/O operation synchronously, they are also blocking each other while writing to the log! In one of the performance critical situations for an e-Governance Portal, a thread dump revealed that portal threads were blocking each other for doing logging into a single file appender as the application logs were being written to a single appender file set!
In fact, ascertaining the current log level of the application is the first and foremost thing that a performance specialist does, and often just changing the log level from DEBUG to INFO or WARNING gives significant performance benefits. However, after the performance benchmarking exercise or immediate scalability problem is resolved, the log level is changed back again to DEBUG by some application developer trying to find the root cause of a functional bug in the application. This is an indication of unhealthy logging practices. In 'Logging for production-ready applications' we also discuss discipline and maintaining hygiene of logs. The 'High performance and quality logging' section elaborates techniques to achieve quality logging without compromising on the performance of the application.
3.1.3Security (auditing and other sensitive information)
Audit logs are a special class of logs that enable security auditing of the application, and are used to track actions of users. This is an example of how logging aids security practices.
At the other extreme, however, if logs carry sensitive information such as passwords of user accounts, they may expose a system vulnerability.
Thirdly, logging application events and flow may aid developer's monitoring and debugging of an application but may also expose the internal architecture of the application which is certainly not desirable.
In today's cloud application environments, where applications may be hosted on public clouds, such vulnerabilities can pose risks to the Intellectual Property of the application owner.
3.1.4Scalability and high availability
Logging similarly influences and is influenced by scalability and high availability. By improving performance of an application with 'High performance and quality logging' techniques, you improve scalability and availability of the application with lesser hardware, as your application is already able to pack a heavier punch with the existing resources and has lesser reasons to fail.
When an application is scaled out or is enabled for an active-passive configuration for availability, there are multiple instances of the application and the strategy of logging becomes important. Would the application support or development team like to gather logs from 10 different machines or directories or have a centralized location for collecting logs? This is where centralized distributed logging becomes essential.
3.1.5Recoverability
Major databases like Oracle already use redo logs to ensure recoverability of transactions. Applications may also borrow a leaf and use a special class of logs for recoverability if required.
3.1.6Error Handling and Fault tolerance
In most applications, logging is one of the, or sometimes only measure taken for error handling. In cases of recurrent errors such as unavailability of SMS / Mail server or even database for long periods of time, there is no benefit gained and even much harm incurred by logging the error repeatedly and frequently, especially with huge exception stack traces, due to increased I/O activity. In the process, one may discover that important logs that were important for analyzing a week old frustrating issue have been rolled over. This approach only makes a bad situation worse.
3.1.7Capacity
While doing capacity sizing of an application, architects would do well to consider the size of logs that would be generated by the application during production and estimate the disk space required, the provisioning of centralized file system and so on.
For centralized logging in a distributed environment, it is also relevant to estimate the size of log objects distributed over the network to a remote machine.
4A case study of missing the importance of logging
This section presents the experiences of the author in the form of an embedded application case study, where the importance of logging was missed during architecture, design and development phase, only to be driven home later on through bitter lessons learned the hard way.
The embedded application in question was a GPS device unit which was to be deployed on vehicles for enabling location tracking. The device did not provide any user interface apart from LEDs and a few buttons, so there would be nobody to administer the application inside the vehicles, unlike the luxury one is used to with server side applications. So if anything went wrong with the device application, how would the application developers supposed to diagnose it, when all logs were being written inside the device being conveyed around the whole of the country inside trucks?
What the application developers had naively assumed was that the Client Organization would have service engineers nearly around the whole of India who could take down the device and bring it to them for analysis. That being not possible, one of the developers had to accompany the service engineers to copy the logs by logging into the device operating system. You can imagine what an effect those late night sojourns had on his health and state of mind.
The next thing done by the application developers was to build a log download button in the desktop servicing application for the device. The service engineers would carry the desktop servicing application in laptops and download the logs from the devices. This was at least an improvement, that the device would not have to be taken off the vehicle and there was respite for the poor developer who was earlier accompanying the service engineers to remote places at ungodly hours. However, soon it became clear that the service engineers also could not travel to remote locations all the time and they were demanding that it should be the application development team's lookout to download the logs.
Finally, the development team had to enhance the device application so that it could send its logs to the backend server on demand over wireless GPRS just like it was sending other tracking data.
The point to be noted here was that all this extra effort had certainly not been captured during requirements gathering or budgeted for during initial estimation and development. The development team had a typical customer functional requirement driven mind-set. Application logging was not a customer-driven requirement and was not a prominent non-functional attribute and the novice developers did not have the foresight to convince their equally short-sighted seniors/managers to give them time to build such a facility as a budgeted effort. And when dozens of those devices went live (with 100s more waiting at the store) and they met all the above problems, they had to burn midnight oil while their managers and clients breathed angrily down their necks. Such a thankless job!
5Logging for production-ready applications
The previous section on 'Logging with respect to other System Quality Attributes' we have already come across some issues that are faced in production environment. Let us enlist them and more such dos and don'ts in this section, which would make an application production ready.
·Discipline and code review of logs
Discipline is of prime importance and this single step would pave the path for other best practices discussed in this article. There are many production systems having frivolous logs such as 'Hi', 'Came over here', 'Done', 'XXXYYYYZZZZ'. Most probably these were introduced during debugging of the application or even during unit testing in the development phase of the application.
The excuse given for such logs is that they would be at the debug level and could easily be turned off. However, in practice, this is rarely done and what is more, there are other logs at debug level which are not so frivolous. To control such logs, one would require extremely fine-grained and careful level of configuration of the logging framework and this is quite often missed out in production.
Code reviews have to be more effective and senior developers or team leads reviewing the logs must be firm to get less useful logs removed even on the face of defiant comments such as - 'I have just done it for testing', 'What's the brouhaha about the whole thing; it is just a simple log'. It would be well to have a rule that logs should not be the means of debugging basic problems during development – for that we have debuggers in IDEs!
·Categorize logs with the same care as you would do for modularizing your application
This seems an obvious practice but it is surprising how many times it has been seen in the production environment that application logs are being written with the Application Server's SystemOut or SystemError files, or as discussed previously, the case where an e-Governance Portal's threads were blocking one another, contending for a single FileAppender and in turn a single file I/O.
At the very least, developers are expected to have package-based or fully qualified class naming convention for obtaining Loggers and possibly use different Appender destinations for different classifications of logs.
It would be well to keep in mind that the log level during production is supposed to be WARNING or depending on the information in logs, at INFO level.
An effective technique would be to list down all the possible logs in a central configuration or constants classes and allow developers to only use those logs. A more effective technique is discussed in 'The Design of High Performance Smarter Logging'.
·Logging in a clustered and distributed environment
Nearly all production server-side applications have to be clustered and distributed for scalability and/or availability. In a clustered environment, logs should reflect the components, modules, subsystems and even process instances from where they emerged.
There is great benefit of having a centralized log server in a distributed and clustered environment to prevent the hassle of collecting logs from multiple directories and machines. This would also have the extremely beneficial side effect of moving the I/O to a separate machine so that the performance of the application is not impeded by logging I/O.
·Logging in thick distributed clients or embedded applications
As illustrated in the anecdote covered in section ‘A case study of missing the importance of logging’, the logs in thick clients or embedded applications are not easily accessible to the development team to aid their analysis of defects. The mechanism of transferring the logs remotely should be well thought out and developed with proper inclusion in the project plan.
·Using Mapped Diagnostic Context and Nested Diagnostic Context
Log4J's MappedDiagnosticContext (MDC) and NestedDiagnosticContext (NDC) use ThreadLocal storage to store context specific information. They can be used to store information such as user name or transaction id to identify all operations done by the particular user or transaction. This obviates the need to pass context specific information across classes and methods just for the purpose of logging. With PatternLayout's %x or %X{key}, the stored value would be rendered in the log.
·Planning life-cycle and maintenance of logs
This includes planning the size of each log file and the maximum number of files before they get rolled over. This is a no-brainer, but often occurs - the size of the log file should not be so huge that it can’t be opened with a text editor! Just as there are scripts to take regular database backups, there should be scripts to backup and archive logs as well. It is a good idea to zip the log files in case there are limits of disk space, which would also make them easier to transport remotely.
·Resist the temptation to log source location information at runtime
Location information comes at the price of loss of performance as the logging framework tries to determine the current thread stack and thereby the method, filename and line number, which is a costly operation. Rather, the log information in itself should suffice to find out the source of the log by giving information of the server, subsystem, module, component, thread and so on.
·Avoid recording errors with long stack traces repeatedly
If possible, there should be enough information in the log to indicate where the error has occurred and huge stack traces should be avoided as far as possible. Of course, this is not a mandate especially for exceptions such as NullPointerException. But it can be done for some specific application errors which can be easily identified. Also, when there is a recurrent problem that stays for a long duration such as connectivity issue with Email, SMS servers or Database Servers also, it would be well to record the issue every 5 minutes or so rather than filling up the logs with huge stack traces every few seconds.
·Do not use AOP indiscriminately for injecting logs, especially in production
The most basic textbook AOP example given to a newbie is for logging. The argument is that logging is a cross-cutting concern and one can use the powers of AOP to inject logs before entering or after exiting a method. This example or argument should be seriously considered before applying in production worthy applications. As has already been established above, logging is no trifling matter but which deserves planning like most other Non Functional Requirements (NFRs).
·Do not use logging as a replacement for other monitoring strategies
One of the most bizarre cases of using logging is for ‘performance logs’ such as the following –
19 Sept 2010 10:20:30 PERF INFO Thread-25 OrderInsertAction.java Time taken in processing OrderInsertAction: 50ms
19 Sept 2010 10:20:33 PERF INFO Thread-8 OrderInsertDao.java Time taken in insert 30ms
The author has been guilty of doing this herself, not realizing that she was jeopardizing the very performance she was measuring by increasing the I/O.
What would be infinitely better practice is to capture a ‘TimeStatistic’ with total time and a counter to give the average time, the maximum time and the minimum time taken and to display the same with a GUI screen.
6The Design of High Performance Smarter Logging
The strategy discussed in this section is essentially to combine centralized logging with a logger facade that uses integer codes instead of Strings for logs. This technique has been used by the author at the advice of her mentor with much success.
There are numerous articles on the internet to build centralized logging with JMS queues and topics or even with sockets. Centralized logging enables high performance by moving the I/O activity to a different machine leaving the application nodes to have a slight overhead due to network I/O.
However, the key here is to combine the centralized logging with the usage of codes instead of lengthy strings. The practice of using strings to log information innocently encouraged by existing frameworks like Log4J or Commons Logging has the effect of using up precious memory, disk and network resources where a simple code could suffice.
A separate file can list the mapping between the error codes and the complete human decipherable String.
Consider the following log-
[090822 16:02:48] TX WARNING (Tx-2-thread-1: 1163 transmitData): Server has not responded with an ACK, so trying again.
in contrast with
1300604499194,4,192168001002,20600,1001,2,500000
The above log indicates the timestamp in long, the log level, the ip address of the machine generating the log, integer values for process, module of the process, the instance id of the application and an integral error code which when translated would convey the same meaning as the complete lengthy string. Such an object is also extremely light to transport over the network in binary format. In case there is contextual information to further qualify the information in the log, an Object[] array can also be passed while the main error code would translate to a String with placeholders as in printf() format.
Using short codes for errors is a practice that is quite common in nearly all major products such as Oracle, WebSphere, Microsoft. For example in Microsoft Office applications when there is an error, all one gets in the error dialog box, is an undecipherable integer code which can be sent to Microsoft support to diagnose.
The various error codes could be translated to the complete strings by a log translator while viewing.
There are the following benefits to consider -
·Reduction of disk space and thereby longer shelf life of logs and smaller file sizes
·Some security that the application's internal design and working is not revealed by log. The logs can be translated to full text by using a translator for viewing offline.
·Reduction of memory usage by avoiding construction or network transfer of lengthy strings.
·As the logs transferred in the network are extremely light, it is possible to keep even debug logging on with minimal overhead.
·Random construction of logs is prevented
·More efficient searching of specific errors by custom tools.
Furthermore, logging discipline can be enforced by preventing direct usage of Log4J or other such frameworks and writing a custom facade on top of your favorite logging framework or custom logging such as -
public class LogClientFacade { public void log(int logLevel, int instanceId, int subsystemId, int componentId, int errorCode); public void logWithContext(int logLevel, int instanceId, int subsystemId, int componentId, int errorCode, Object[] contextInfo); public void logWithEx(int logLevel, int instanceId, int subsystemId, int componentId, int errorCode, Throwable ex); ... }
Such a log interface would ensure that developers take care to log essential information such as subsystems, components and so on for diagnosing problems in a distributed environment and would obviate the need for example, of recording the source line number and file name by expensive runtime operations.
he above diagram shows the suggested design for such a solution. The intention is to make the process of logging as non-blocking as possible by collecting information in queues and asynchronously processing or transferring them in receiver threads. Maximum benefit would be obtained by transferring the information in binary serialized format over the network, especially if the complete solution is homogeneously using a single language.
There would be a simple GUI to watch logs as they pour into the server and also look at offline logs. It would use a translator to convert the logs into textual format. The logs themselves would be written on the disk in binary format. It may be noted that this methodology has significant relevance in cloud environments to safeguard the confidentiality of the deployer's intellectual property.
6.1A note about the implementation
It is possible to extend existing frameworks such as Log4J / Commons Logging / SLF4J. However, in doing so there may be some reduction in efficiency to comply with the generic nature of the internal APIs of the frameworks. For example, Log4J serializes log messages and stack traces as strings to send them over network in SocketAppender and JMSAppender. It is fairly easy to extend and override selected portions of the chosen frameworks, such as by adding or extending new Appenders, extending internal data transfer objects like LoggingEvent and doing custom serialization. Or, one could just create a simple custom logging framework which would not take long to implement, if maximum flexibility is desired.
Another interesting decision is whether to use JMS which can be used if the application is running in an application server or by employing a standalone queue such as WebSphereMQ, HornetQ or ActiveMQ. However, the following is the author's advice if the chosen transport is JMS-
·Use lenient quality attributes - such as avoiding transactions, avoiding persistence and allowing duplicates in the queue transport. Remember that stringent reliability will bring down performance and is most likely not necessary for logs.
·Within a single JVM, that is either at the log server or at the client subsystems, it is advisable to use either lightweight java.util.concurrent Queues or an in-VM queue implementation which would avoid serialization overheads.
·It would be advisable to use message brokers or bridges for the transport, rather than having a centralized queue and doing remoting calls to the same.
The author's personal preference is using simple sockets instead.
7Conclusion
With this article, we have discussed best practices and identified malpractices in general for logging. We have also proposed a technique to combine centralized logging with the usage of codes instead of strings as a high performance smarter logging practice.
The author credits her former mentor, Mr Akash Gupta, Solution Architect at InterGlobe Technologies (http://in.linkedin.com/pub/akash-gupta/3/79/2b3) to have conceived the proposed design of High Performance and Smarter Logging, and under whose guidance she successfully implemented and observed the tremendous performance benefits of this technique.
8References
Pro Apache Log4J by Samudra Gupta
Opinions expressed by DZone contributors are their own.
Trending
-
Health Check Response Format for HTTP APIs
-
Step Into Serverless Computing
-
Performance Comparison — Thread Pool vs. Virtual Threads (Project Loom) In Spring Boot Applications
-
How To Approach Java, Databases, and SQL [Video]
Comments