Ranking Incidents Using Document Similarity

DZone 's Guide to

Ranking Incidents Using Document Similarity

A way to use big data analytics to improve the lives of IT helpdesk workers, saving time so they can help with bigger problems.

· Big Data Zone ·
Free Resource


Resolving multiple incidents in a day is a routine task for many people who work in helpdesk activities. To resolve an incident, teams need to either rely on their expertise or go through the resolution guide or go through historical incidents. In order to make it easier to find historical incidents and their resolutions, we implemented a solution using the concept of document similarity. In this article, we share an overview of the solution.


An expert – typically a person who has vast experience and knowledge of how to resolve incidents raised by customers – is a boon to an enterprise. This very boon can turn into the bane for the enterprise. Experience is a boon when it can be used to resolve issues much quicker than possible by the average person. Experience is a bane because it prevents people from changing roles and accepting more responsibility. Additionally, it means that things come under rough weather if the experienced person leaves the organization or moves to a different team.

Experience is particularly important for help desk solutions, where teams need to go through user logged incidents / issues and resolve them as soon as possible. As is observed most commonly, help-desk teams usually have one or two experts on team who are not only able to resolve the incidents assigned to them, but also help other team members resolve their incidents. This is possible only due to expertise and the fact that they are able to recollect older incidents that were similar in nature and can drill down to the concerned incidents for resolution.

As the number of incidents in the repository grows, it becomes difficult to scan through the incidents and identify that one incident which is similar to the problem being resolved. To mitigate this challenge, most solutions allow users to search for relevant text. While this mechanism is quite popular, the number of results shown for a search can be quite large. Additionally, most of the time, the results are ordered either by date or type or something similar. If there is some way that can be used to rank the results, it would be very beneficial as it would be possible to rank the results in the proper order. Ranking documents can be done using the concept of ‘document similarity’.

Document Similarity

Document similarity, alternately called semantic similarity, is a method commonly used in the field of text analytics. It is typically used to find out how similar two documents are to each other. As per Wikipedia, semantic similarityis a metric defined over a set of documents or terms, where the idea of distance between them is based on the likeness of their meaning or semantic content as opposed to similarity which can be estimated regarding their syntactical representation (e.g. their string format). In simple terms, two documents are considered to be similar to each other if they contain the same set of words.Some of the popular similarity algorithms used are Cosine Similarity, Jaccard Index, TFIDF, BM25, and Locality-sensitive hashing, to name a few.

Cosine Similarity

As defined, Cosine Similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them. In simple terms, Cosine Similarity is a method that compares the words in one document with words in another document by identifying the cosine between the arrays of words created for each document.

Description of Solution Implemented

The solution we implemented is described in this section.

As the incident management system was implemented using a popular product, it was not possible for us to make modifications. We decided to implement the solution using the ELK (Elasticsearch, Logstash, Kibana) trio of tools. To create the Elasticsearch index, we exported the incident information into a CSV (Comma Separated Values) file, which was then imported into an Elasticsearch index using Logstash. A simple frontend for the help desk personnel was developed using JSP (Java ServerPages) technology, hosted on an Apache Tomcat instance. The web application allows teams to enter the search term. On entering the search term, a search query is sent to Elasticsearch. All the records are collected by the web application and ranked using the Cosine Similarity method. The ranked incidents are arranged in descending order as per the match percentage.

It is important to note that a lot of pre-processing needs to take place before applying any of the similarity methods. For example, we need to remove punctuation marks from the text as well as convert the documents to the same case before comparing them. Similarly, removing commonly occurring articles like ‘a’ and ‘the’, in addition to other stop words helps improve the chances of finding a matching document in the repository.


Though the creation of a new system adds to the headache of maintenance, we have reaped benefits from implementing this solution. The solution was developed using the ELK stack of tools and fronted by a simple web application, serves the purpose of allowing teams to search for similar incidents from the repository. After retrieving the matching incidents – based on the search string provided by the user – the solution applies the Cosine Similarity method to assign a ‘matching percentage’ value to each incident. Each incident is then displayed in descending order of the match percentage, while pruning any incidents that fall below a threshold.

Given the importance of automation in today’s world, the solution described can easily be automated. Text from all the incidents that are reported during the day (or at set time intervals) can be collected and ranked. Thus, the solution will generate a report that will list all the incidents that need resolution along with a list of matching incidents.


big data ,elk stack ,java ,kibana ,logstash

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}