Need for Application Log Analytics
One of the biggest hurdles in restoring service for a crashed application is to wade through the log files and identify the reason(s) for application failure. One of the reasons for the popularity of tools like vi and grep for identification of application failures is that many applications do not follow well-defined guidelines while writing log information. In most cases, application logging is considered to be an overhead by development teams and this activity is mostly done merely as a ‘tick in the box’. It is important for application developers to realize that they need to provide system administrators sufficient information that will allow for quicker resolution and restoration of service. Simply including packages like log4j and dumping some text is not very helpful.
This situation is also aggravated as enterprises are deploying a lot of applications and that too using many servers to handle volume as well as for reliability and availability. However robustly developed, applications are known to fail once in a while and the task of restoring service falls on the shoulders of the system administrators, who need to go through the log files and identify the causes of failure. If the number of servers is very large, scanning log files on multiple servers becomes a monumental task.
The Pillars of Application Log Analytics
As the system of manually scanning log files in not scalable, we need a better solution. We need to introduce automation and the ability to identify failures as soon as they occur. Today, many tools are available for parsing log files and generating reports from them. Some of the tools are Splunk, Fluentd, Logstask, Elasticsearch, Kibana, Google Analytics, SiteCatalyst, Papertrail, Airbrake, Exceptional, Logscape and loggly. While auto correction may be an end-goal, we can at least aim to put in place a system that will generate an email alert or log a service ticket, whenever a failure is detected.
To be in a position to introduce automation related to application logs and generate reports or alerts or tickets, we believe that application developers need to implement the two pillars (as we are calling them) of application logging - ‘Annotated Data’ and ‘Search’. By implementing these pillars, enterprises can enable application log analytics for their application logs. Initially, the analytics can be confined to raising alerts, sending email and/or creating tickets, with the passage of time, the analytic solution can predict, with confidence, the occurrence of events of significance.
The first pillar of application log analytics is to have a good data set. From an analytics and inference perspective, our inferences and insights will only be as good or as bad as our data. Today, in many cases, the quality of information written to log files is quite poor. Log information is written without any thought and they also do not have a well-defined structure. Many development teams consider logging to be an overhead and hence do not follow well-defined guidelines while writing application logs. For example, if an exception occurs while fetching a record form the database, the corresponding log entry might be
error connecting to database
While this message conveys what went wrong, it does not provide any additional information that can be used for failure analysis. Instead, the error message can be written to provide more information.
Error: error in database connection, errortype: database, table: department, module: purchase
This annotated information can be used to generate reports that list errors by date, by table and the like. By annotating the data written to the log file, we are not only making the task of debugging and problem identification easier, we are also opening the door for automation and application log analytics.
As another example, consider the ‘semi-annotated’ log is given below. We are calling this example ‘semi-annotated’ as the error message does not contain a time stamp. Hence it will not be possible to locate this record based on the duration. We will need to specifically look for the business unit field or the source IP address.
error at SynchronizeDCM() method, for BusinessUnit = 20807 and SourceIP = 10.25.7.1
This message has been annotated to include a lot more information about the error.
ERROR, 2012-12-09 06:06:18, Type: Application, Error message: Failed to execute the command 'UpdateCommand' for table 'ShoppingCart'; the transaction was rolled back. Ensure that the command syntax is correct.
When dealing with voluminous data sets, the easiest mechanism of locating data is to search for it. If you are not convinced, you only need to look at the popularity of the Google search engine. Similar to what Google aims to do – index all information in the world – our aim should be to index (make searchable) all the application logs.
When presented with annotated text, it is easy to parse the data and store it as documents or in a database, making it easier to search. While databases have been used to store log information, they are not really well-suited for the task of search. What we need, is a search engine. Apache Lucene is a popular search engine that can be used for this purpose. Recently, Elasticsearch has gained a lot of popularity as and open source search engine. Splunk is another tool that is very popular in the application log analytics area.
Splunk is a tool for searching, monitoring, and analyzing machine-generated big data, via a web-style interface. It captures, indexes and correlates real-time data in a searchable repository from which it can generate graphs, reports, alerts, dashboards and visualizations. Splunk aims to make machine data accessible across the organization by identifying data patterns, providing metrics, diagnosing problems and providing intelligence for business operations.
While not in the same league as Splunk, Elasticsearch is also gaining prominence as an Open Source alternative to Splunk. Elasticsearch is commonly deployed as a part of the ELK stack of tools, where E stands for Elasticsearch, L stands for Logstash (log parser) and K stands for Kibana (visualization).
Elasticsearch, an open-source search engine is built on top of Apache Lucene™. It is a full-text search-engine library. Lucene is a complex, advanced, high-performance, and fully featured search engine library. Elasticsearch uses Lucene internally for all of its indexing and searching, but aims to make full-text search easy by hiding Lucene’s complexities behind a simple, coherent, RESTful API. Elasticsearch also supports the following features
- A distributed real-time document store where every field is indexed and searchable
- A distributed search engine with real-time analytics
- It is capable of scaling to hundreds of servers and petabytes of structured and unstructured data
With many applications getting deployed on many servers, the task of managing them is becoming time-consuming. In case of business critical applications, any application down time results in loss of business and results in many escalations. By annotating the logs (to contain all relevant information for failure analysis) and by ensuring that the generated log files are easily accessible, it is possible to implement an analytics solution that can help in reducing the time needed to identify problems as and when they occur and serve as the first step towards extending the solution to enable predictive analytics.