10 Things to Consider When Parsing with Logstash
10 Things to Consider When Parsing with Logstash
Join the DZone community and get the full member experience.Join For Free
After spending the last couple of weeks using the ELK (Elasticsearch, Logstash, Kibana) stack to process application logs, I have collated the following points that need to be considered while developing Logstash scripts.
- 'sincedb_path' parameter of 'file' plugin
When using the 'file' plugin in the 'input' section of the Logstash script, ensure that you are using the 'sincedb_path' parameter. If this parameter is not specified, Logstash creates one for each input file. The file stores information about the position of the last byte that was read from the file. During the next execution for the same input file, Logstash starts reading the input file from the position mentioned in this file. If we try and process the same file once again, the file reading pointer will go the end of file and keep waiting for new data (which is not there). For the user, there is no response from Logstash, even though the input file contains data.
- Delete the file mentioned in 'sincedb_path' before each execution
On development and test environments we typically need to run the parsing script (for the same data set) multiple times, before the script is finalized. If new information has not been added to the input file (same file is processed without any change), Logstash does not respond. This is because of the file offset stored by Logstash. Hence we need to make Logstash forget the byte offset for the input file, which can be done by deleting the file (indicating the file has not been read before).
- Read file from the beginning
In the development and test environments we end up processing the same file multiple times, without making changes to the file. To force Logstash to process the file from the start need to set the value of 'start_position' parameter in the 'file' plugin to 'beginning'.
- Handling log data spread across multiple physical lines
To process log data that spans multiple physical lines that are related (e.g. multiple lines in a Java stack trace), use the 'multiline' plugin. This plugin will merge multiple lines from the input into one line that can then be processed using a suitable 'grok' plugin. In other words, by using the 'multiline' plugin, Logstash allows us to put all of them into a single document.
- Placement of 'multiline'
The 'multiline' plugin should be placed in the 'filter' section rather than the 'input' section (where it is treated as a 'codec'). By placing it in the filter section, it will work across multiple forms of input – file, TCP/IP, stdin, etc. If 'multiline' is placed in the 'input' section and input method is 'tcp', I have observed that line breaks are not recognized properly. Additionally, input text is broken at arbitrary places. For example, if the input line contains text such as 'java.lang.OutOfMemoryError: PermGen space', the 'tcp' plugin can return us text as 'java.' or 'java.lang.OutOf', etc.
- Regular expression in the 'grok' plugin
Ensure that the regular expression used by the 'grok' plugin that follows a 'multiline' plugin, matches the final output of 'multiline'. In other words, the 'grok' plugin that we use to convert the text into a document needs to match the complete merged text, which is generated as output by the 'multiline' plugin.
- Use multiple parsing expressions ('grok' statements)
If the log information is not the same across each line of input, we will need to define multiple parsing expressions. Using a single, generic expression to read up all the input into a single field is not advisable as it defeats the purpose of parsing. We will end up simply replicating the input file into a document structure, into a single field. Then, the business logic that processes this data will have to include logic that will split the data into its constituent parts, increasing its complexity. For simplicity and ease of use, we should use Logstash to split the input data into its constituent parts and store the result in relevant fields in the document. By parsing data into appropriate fields, we make the task of the business logic easier. Queries formulated to fetch documents from the store can be more precise and will generate accurate results.
- Sequence of 'grok' statements
When using multiple 'grok' statements, ensure that they are placed in the proper sequence. Place the most specific 'grok' statement at the beginning of the 'filter' section and order the statements such that the most generic 'grok' statement will be the last. The most generic statement will act as a 'catch all' statement in order to ensure that we do not skip any data from the input.
- Split input into relevant fields
Use Logstash to parse the input into as precise data fields (document attributes) as possible. Be as granular as possible. By splitting data into relevant fields, we make the task of querying, computation and visualization, much easier. For example, in one of my use cases, the log file had text 'Business Unit = 234' for each error that was logged. Given that this information was in the log file, I parsed the value '234' into a field named 'businessUnit' of the resulting document (with field type as integer). This allowed me to create a monthly Pie chart (in Kibana) that showed proportionate errors generated by the various business units.
- Convert fields into proper formats
By default, Logstash parses input as a string. It is a good idea to convert the parsed data into their respective values, making querying, business logic implementation and visualization easier. For example, instead of retaining a value such as '2014-04-01 12:40:00' as a string, we should convert it into a date. By doing so, we can use date-based 'filter' queries in Elasticsearch and restrict the scope of the queries to the specified date range. String data can be converted into an integer using the 'mutate' plugin. String data can be converted into an equivalent date using the 'date' plugin. It should be noted that a date format, locale and time zone need to be specified during date conversion.
- Replace newline, linefeed and tab characters
When merging multiple physical lines from the input file, the 'multiline' plugin retains the line separators ('\n' on Unix/Linux systems and '\r\n' on Windows) in the merged text. If the input contains the '\t' character, it too is retained. These characters can create a problem for the business logic or queries. Hence you may want to either remove these characters or at least to replace them by other easily identifiable place holders. To replace such characters, we need to a 'mutate' plugin immediately after the 'multiline' plugin.
Opinions expressed by DZone contributors are their own.