This article was written by Gal Berg, VP R&D at Xpolog, and was republished with his permission
It happens to everyone. Bugs pass through the most stringent testing processes, finding their way into the production environment. It’s true in long version cycles and more so in an Agile environment with continuous deployment. These problems can be the result of configuration problems and malfunctions in the production environment as well as bugs in the code itself.
There is an ongoing discussion on how DevOps play an important role in troubleshooting these problems and how log management and analysis solutions, in general, and Augmented Search, specifically, can help expedite this process. In this article, I will show an example of how a bug in the code can remain undetected as it goes into production, and then how analytics technologies and Augmented Search can be used to discover and troubleshoot the problem within minutes.
Code Problem Example
Let’s say we have an application in production and have just added new code which is responsible for creating user accounts. The image below represents the browser-based front-end of this function, where users type the name of the user and click “Add”.
The Add User screen evokes the following AddUserAjax.JSP file, which receives the string that the user typed and sends an Ajax request to the AddUserAjaxAction.JSP file.
The AddUserAjaxAction.JSP file creates the Users class and places the string in it through the Users.add (name) function. If the function fails, i.e. is not able to add the user to the database, it should return an exception. In this case, as we can see below in the AddUserAjaxAction.JSP code, the developer did not add an instance of the HashSet (the HashSet is null). Note that this is a very simple example and in most cases the code would be much more complex, involving much more code, classes, APIs, components etc.
At the bottom of the code we can see that the developer created the exception error log.error("Failed to add user " + name,e)which writes the error in the log data. But due to a developer oversight, the function does not return an indication that it failed, the only indication of failure is the log event. This means that the end user will not see this indication after clicking Add. The code itself does not indicate that the user was not added (publicvoid addUser(String name)). In this case, the diagnostics tools would not be able to identify this problem and of course there is no performance issue here at all.
This mishandling of the error is just an example of a very common problem, creating an error event in the log file without an indication of the exception within the application itself. This one-line error message will be most probably added to a log file, which may include thousands of other events from multiple applications. So finding it without a log analysis system is virtually impossible. At this stage users may start to complain that they have added their user name but they don’t seem to be registered, and so the clock starts ticking to find a resolution for this problem.
Troubleshooting with traditional log analysis
Assuming that the DevOps have a log management and analysis solution, they will use it to start exploring the log data. We should remember that at this point they don’t have a clue as to the source of the problem, and so they may start by using broad search queries such as “error” or “user”.
These broad searched yield thousands and often tens of thousands of results. Finding the error message is like finding a needle in a hay stack. If they know exactly what they are searching for, they can continue narrowing their searches until they would find the problem. But this process can often take too much time.
Troubleshooting with Augmented Search
With Augmented search, DevOps can start with a broad search at the time interval of interest (in this case, when the users started complaining) and then simply look at the augmentation layer to examine what happened during that period of time. As we can see in the screenshot above, out of 7,349 search results, Augmented Search highlighted several high priority and medium priority events (represented by the small rectangles on the timeline). We can immediately see the java.lang.Null.Pointer.Exception and the Failed to add user message with minimal action.
These events were analyzed by the system using semantic analysis and uncovered out of more than 7000 events during the same time period. A troubleshooting process that would take hours, can now take seconds.
Furthermore, DevOps can zoom in on it and receive additional information that can help them find the root cause and the impact (servers. databases, etc.) that the flagged events had. This makes it easy for them to go back to the relevant dev team in order to develop a fix. This troubleshooting speed is essential in an agile environment with continuous deployment of new code.