DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library
  1. DZone
  2. Refcards
  3. Getting Started With Log Management
refcard cover
Refcard #290

Getting Started With Log Management

The reality of modern application design means that when an unexpected issue occurs, the ability to find the root cause can be difficult. This is where the concept of centralized log management can provide a great deal of assistance. This Refcard teaches you the basic flow of a log management process, provides a comprehensive checklist of questions to consider when evaluating log management solutions, advises you on what you should and should not log, and covers advanced functionality for log management.

Free PDF for Easy Reference

Brought to You By

Hydrolix
refcard cover

Written By

author avatar John Vester
Senior Staff Engineer, Marqeta
Table of Contents
► Introduction ► Who Uses Centralized Log Management? ► Log Management: The Basic Process and Techniques ► How to Get Started With Log Management ► Conclusion
Section 1

Introduction

As application architectures have matured, development has shifted toward specialized, distributed technologies. Microservices, modern JavaScript frameworks, containerized deployments, cloud-native infrastructure, and Infrastructure as Code practices have introduced a far more complex and decentralized landscape for generating log events and operational telemetry.

Each of these aspects can log events as they participate in some form of application service delivery. Without considering a centralized log management (CLM) solution, those who rely on these very logs to support and maintain applications find themselves at a disadvantage that can impact the bottom line of those components they support. Unfortunately, the format and structure of these logs vary from one system to the next.

This is where the concept of centralized log management can provide a great deal of assistance. In the diagram below, a CLM solution can ingest the log events from all components of an application to provide a single source to review and analyze.

Figure 1: Centralized log management

When the logs are combined and organized properly, the engineer assigned to the situation can easily walk through the event without having to toggle between log files contained within disparate and proprietary systems.

In the last few years, new challenges have emerged as priorities within log management:

  • The need to analyze and debug faster
  • The need to move beyond index-heavy architectures that drive up cost and query latency at scale
  • The ability to adopt appropriate storage tiers to reduce periodic costs

Preferred log management solutions should account for all these needs, either directly or via some form of integration. Additionally, forward-thinking solutions will consider features such as:

  • Recurring pattern identification
  • Machine learning and crowdsourcing support
  • Anomaly detection
  • Data visualization
Section 2

Who Uses Centralized Log Management?

DevOps

When an unexpected situation emerges, a DevOps engineer often has a collection of information to analyze. Source logs from all the components in the application landscape are buried deep within a file system, most likely using proprietary logging formats. Even when all the logs can be gathered, stepping through each log chronologically to determine the cause of the issue is a tedious task at best.

By contrast, if a CLM system is in place, the logs are not only consolidated but also standardized and organized in a manner that will allow the DevOps engineer to play back the scenario that is under investigation. In doing so, the wider view provides a far greater opportunity to identify the root cause.

In most cases, justification for centralized log management is met by this benefit to DevOps staff. However, four other areas can become beneficiaries of a CLM solution: security, compliance, IT operations, and software engineering.

Security

Security teams can benefit from a CLM solution when scanning for unauthorized access to a given application or service. The range of this benefit can include both anonymous external entities as well as internal accounts. Reports within the solution can be created within the CLM system to match logged events that may be indicators of suspect activity. Consider the following errors being logged by an API under management by the solution:

Shell
 
2022-02-14 03:07:15.824 ERROR 36209 --- [main] AccessServiceImpl   : User id (someUserId) does not have access to perform this operation

2022-02-14 05:27:07.212 ERROR 36209 --- [main] OrderServiceImpl   : User id (null) does not have access to review order (1001001)

2022-02-14 10:17:37.542 ERROR 36209 --- [main] AccessServiceImpl   : Attempt to access API without a proper token on IP 127.0.0.1


Once enriched and ingested into the CLM solution, the report can be set up to keep track of error type (36209), showing the date/time information, plus the message that was logged.

Compliance

As a part of the periodic compliance/audit tasks, a CLM solution can assist compliance efforts in making sure the application or its users are following the expectations that have been established. For applications required to comply with a regulatory guideline (SOx, FDA, HIPAA, etc.), employment of a centralized logging solution can provide a one-stop source for analysis and certification.

IT Operations

Monitoring and understanding the complexities of an IT infrastructure can become less intrusive when a CLM solution is adopted. IT operations staff can use the tool to gain an understanding of how systems interact with each other, thus providing a tool to help make decisions when routine tasks (like system maintenance outages) are scheduled.

Software Engineering

Software engineers who focus on building features, services, or integrations can benefit from centralized log management, regardless of the frameworks being employed. This is because the log events needed for analysis and debugging during the development phase are sourced by frameworks, services, and containers called from proprietary services.

In a non-centralized model, the time required to locate and analyze independent logs quickly impacts the engineer’s time to focus on delivering functionality planned for their current iteration. This time can lead to delivery delays, impacting the features and functionality requested by the business sponsor.

Section 3

Log Management: The Basic Process and Techniques

A centralized log management solution utilizes a flow like what is noted below:

Figure 2: Example log management workflow

Collect

Collect is the process of establishing a connection to a source system and ingesting the logs as they are natively created. A determination can be made regarding the log levels that are routed to the CLM solution, if necessary.

Parse

Parse provides the ability to transform source log messages into a format that is standardized within the CLM solution. This is an important aspect since logs produced by an API using an “extended” Apache format will be different from a log from a database server (as an example).

Correctly parsed, the following log events:

Shell
 
1550149377 INFO Userid (someUserId) successfully logged in from IP 127.0.0.1
02/14/2022 03:07:15.824 ERROR 36209 --- [main] AccessServiceImpl   : User id (someUserId) does not have access to perform this operation


Could be updated to appear as shown below:

Shell
 
2022-02-14 08:02:57.000 (GMT) INFO Userid (someUserId) successfully logged in from IP 127.0.0.1
2022-02-14 08:07:15.824 (GMT) ERROR 36209 – AccessServiceImpl - User id (someUserId) does not have access to perform this operation


The resulting parsed messages provide a standardized appearance for all messages, allowing the analyst to process the results of the logs more efficiently.

Enrich

Enrich introduces the ability to further define the log event. As an example, enrichment functionality could perform the necessary logic to analyze a logged IP address to make it easier for the analyst to understand the system or service that is being referenced. Application or service-specific constants can be transcribed here as well to limit the need to cross-reference logged information.

The following event:

Shell
 
1550149377 ERROR 90215 – ServiceProviderCallOut – Could not access service on 127.0.1.1 due to an internal error code 10017.


Could be enriched as shown below:

Shell
 
2022-02-14 08:02:57.000 (GMT) – ERROR 90215 – SeviceProviderCallOut – Could not access DocumentGenerationProcessor 4C (127.0.1.1) due to an internal Socket Failure (error code 10017).


With the above enrichment, the timestamp was converted to a GMT date, the IP address was translated to include the host system, and the error code was looked up to return additional information.

Store

Store persists the collected, parsed, and enriched logs into a data store utilized by the CLM solution. At this point, indexes and filters can be leveraged to provide greater insight into the native logs in the source system.

Alert

With the necessary information stored in the CLM solution, alerts can be configured to catch events before they escalate to a higher severity level. At this point, CLM solutions can often send events to other systems to provide instant notification and escalation.

Advanced log management solutions should take things a step further:

  • Real-time alerts allow engineers the ability to be at the front end of the situation, receiving a notification whenever an alert surfaces.
  • Live tail log monitoring allows support teams to monitor the logs as they are written from the source system, avoiding the need to look at historical logs.

Analyze

Using an interface into the CLM solution, an analyst can search, filter, and review all the events related to a given situation without the need to review logs directly from the source system.

Section 4

How to Get Started With Log Management

The time required to process and analyze logged information is important, especially when there is an urgent need to resolve an unexpected situation. As a result, a period of exploration should be taken to determine what information will be logged and what information will not be logged. In the example noted in the introduction, the DevOps engineer needed to review the consolidated events for an application that encountered an incident. In most cases, the situation needs to be resolved as soon as possible.

This effort could easily include logs from the microservices, databases, client application frameworks, and the security layer. If the logs included aspects that were not pertinent for analysis, additional time will be required to review and discard this type of information.

Consider the following example of logs that are ingested from the authentication/authorization service participating in the centralized log management strategy:

Shell
 
1550149377 INFO Userid (someUserId) successfully logged in from IP 127.0.0.1
1550149382 INFO SSID for someUserId updated to reflect last login
1550149385 WARN Password for someUserId will expire in 26 hours
1550149415 INFO UserId (someUserId) granted access to SomeApplication via token #tokenGoesHere


While the information being logged is important information, it might be best to filter out all the events, except for the following message:

Shell
 
1550149377 INFO Userid (someUserId) successfully logged in from IP 127.0.0.1


In doing so, the number of log messages that would need to be reviewed during a crisis is minimized — especially in cases where hundreds (or thousands) of users are accessing the application.

Sensitive Information

Another aspect to consider is making sure that secretive information (access tokens, database connection strings, encryption keys, account information, user information, etc.) is not stored in the CLM solution. In the log example above, the #tokenGoesHere log message should be suppressed from ingestion into the CLM solution since that token could be considered sensitive information. If the event is required, the message should be enriched to only ingest the following information:

Shell
 
1550149415 INFO UserId (someUserId) granted access to SomeApplication


Establish Guidelines

The key is to establish guidelines that meet the needs of the entire user community that will utilize the solution. Think of this no differently than how any other application is architected — understanding the limits that are introduced by both not enough log events and too many log events. Once established, this information should be shared with teams who can create the events that are being captured by the CLM solution.

Centralized Log Management Checklist

When the decision is made to evaluate CLM solutions, the number of available offerings will certainly appear daunting. As a result, it is crucial to understand which features and functionality are important for your implementation. Below are some high-level aspects and questions to consider.

Zero to 60

  • What is involved in getting started?
  • How quickly can a new implementation span from setup and configuration to log analysis and debugging success?

While the time required to get up to full speed is not an exclusive metric, there is some merit in knowing if the product under review gives the ability to get started quickly without a great deal of setup and configuration.

Ease of Data Exploration

  • Can all types of users easily operate the system and locate data?  

Remember, the user is often more of a data explorer and not a data scientist. Any built-in reports or filters will lead to a better user experience when extracting results from the system.

Analyst Efficiency

  • How quickly does the system respond, including the ability to create complex searches or filters?
  • Once data is returned, how easy are the results to comprehend and utilize?

As noted above, time is often the driving component when trying to retrieve information from a CLM solution. Again, filters and reporting can help improve end-user efficiency.

Scalability 

  • How well does the solution work within your organization?
  • Can all systems function within one CLM solution or would multiple instances be required? How about five years from now?

It is important to understand how the CLM solution scales as more systems are introduced to the technology. Historically, some performance degradation was considered an acceptable tradeoff at high data volumes. But modern architectures no longer make you choose. Solutions that separate storage from compute can maintain consistent query performance regardless of data volume.

When evaluating scalability, ask vendors to demonstrate query response times not just at your current volume, but also at 10x and 100x that volume. The answer will tell you a great deal about the underlying architecture.

Storage

  • What are the storage expectations?
  • Where does the storage live?
  • What are the associated costs of a target implementation?

Storage architecture has evolved significantly. Traditional log management systems required tiered storage strategies, keeping recent data in fast, expensive storage and older data into slower, cheaper tiers. This created a practical tradeoff: Cost savings came at the expense of accessibility/speed. Modern solutions built on cloud-native object storage eliminate the need for tiered storage, delivering subsecond query performance directly against inexpensive storage regardless of data age.

When evaluating storage, ask whether all of your data, recent and historical, is queryable at the same speed and cost. If the answer requires a tiering strategy, factor that into total cost of ownership and explore your alternatives.

Log Completeness

  • Does the information retained include everything necessary?
  • Is extra effort required to retrieve data from an additional source?

If you find yourself having to retrieve data from outside the CLM solution, there might be a gap in functionality with respect to your requirements.

Data Enrichment Functionality

  • Does enrichment functionality exist?
  • How easy is it to utilize and maintain?

It’s a good idea to review current log sources and understand more of the edge cases that could require data enrichment exceeding typical enrichment usage patterns.

Open/Closed Source

  • Does the solution utilize an open source approach?
  • How does this approach line up with other solutions your entity employs?

Log Collectors

  • How are the log collectors defined?
  • Are they proprietary or have they been created by third parties/vendors?
  • Can these log collectors be centrally managed by the solution?
  • Do they have to be independently managed at the log source level?

Typically, proprietary developed connectors lag those created by the third party/vendor themself. If the connectors can be managed and configured by the CLM solution, there is less need for configuration to be maintained on the log source.

Configuration as Code

  • Does the solution employ a “* as Code” approach, allowing the configuration of the CLM solution to be stored in a code repository?

If your organization is embracing the concept of “* as Code,” CLM solutions that adopt this philosophy will be able to build and configure instances programmatically.

Class-Specific Functionality

  • Does the solution contain features that help a particular class of user (e.g., DevOps, security, compliance, ITOps) obtain common results?
  • Do reports exist to locate regulatory exceptions (e.g., HIPAA)?

Having functionality built in to assist specific use cases will lessen the time required for such groups to get up to speed and recognize value in the CLM solution.

API Functionality

  • Is there a public API available for the solution?

Gaining familiarity of any underlying API for the CLM solution could further justify or leverage the value returned from implementation.

Anticipated Costs

  • How is the product licensed? 

Understanding the cost model will allow for product comparisons as time progresses and more components embrace centralized log management within your organization.

CLM vs. SEIM

  • Is the target solution a CLM solution, or is it a security information and event management (SEIM) product?
  • Do your current needs require one solution over the other… or perhaps both solutions?

The requirements approved for the product will be a guide to understand what type of solution should be considered. It is important, however, to understand the differences between CLM and SEIM.

ELK Stack

  • Does the underlying solution leverage the ELK Stack?
  • If not, what integrations exist to leverage these tools to improve the effectiveness and observability of the centralized logging solution?

Solutions like the ELK Stack have been a widely adopted foundation for centralized log management, and for many organizations, it remains a practical starting point. However, as log volumes grow, ELK-based architectures increasingly surface real constraints: Indexing overhead drives up storage costs, query performance degrades at high cardinality, and operational complexity grows with the data. Understanding where your current architecture hits those limits, and what alternatives exist, is an important part of any honest product evaluation.

During product evaluations, document the specific volume thresholds at which the solution was tested. Costs and performance characteristics at your current scale may differ significantly from what you will encounter as your log data grows.

Advanced Features for Log Management

While the centralized log management checklist section provided features and functionality that should be expected in an acceptable solution, below is a collection of advanced features for products that offer leading-edge functionality. These features are intended to leverage technology in order to enrich the log management experience.

Recurring Pattern Identification

Having an option to quickly see recurring patterns across all log data allows analysts to isolate issues and unexpected behavior. Implementing the ability to present a patterns perspective will alter the analyst’s view of a series of endless log entries to a categorized table, showing something like what is displayed below:

Shell
 
Count    Ratio   Pattern
-----    ------  ---------------------------
52,701   22.17%  License key has expired
19,457    7.99%  Invalid User ID
18,295    7.25%  New customer account created


In the example above, a large percentage of the messages match the pattern where a license key expired, which may not be easy to locate in a large volume of individual log entries.

Machine Learning and Crowdsourcing

Analysts often find themselves looking for the root cause of an unexpected issue, which can resemble a needle in a haystack. Rather than spending hours trying to narrow down the endless sea of log entries, the concept of machine learning and crowdsourcing support can minimize the time required to identify the root cause. Solutions that employ machine learning can help present key terms found in the centralized log events, along with the number of occurrences for each term. This enables the analyst to quickly reduce the number of log entries to process, thus making the haystack much smaller.

With a smaller collection of logs to process, these same market leaders provide additional information from sources outside the CLM service. For example, opening a given log entry provides the expected information related to the captured log data, but it goes a step further by linking crowdsourced data like:

  • Discussion threads related to the logged message or condition
  • Blog entries focused on how to correct the situation
  • Documentation for the method, function, or class being utilized
  • Additional information related to the error itself

Anomaly Detection

Issues that do not appear in the logs often, but are extremely valuable in nature, are often referred to as anomalies. Advanced solutions should provide the ability to detect anomalies so they can be surfaced and addressed. Anomaly detection leverages machine learning patterns to automatically isolate and investigate emergent problems based upon unusual behavior.

Once a baseline collection of fields has been specified and a query is established, the CLM service builds a model and begins to analyze and capture unusual behavior and events. Those detected events can be surfaced as new alerts in real time to provide instant access to emerging issues.

Data Visualization

Visualizations enable analysts to view data in graphical format, which can allow them to better understand comparisons over time. By defining common elements for the X- and Y-axis, data in the advanced CLM solution can be leveraged to communicate the state of the events being captured. These visualizations can be combined into a dashboard to show the overall status of the environment being monitored by the CLM service.

When Volume Changes Everything

Log management guidance often treats “high volume” as a relative term. But in practice, the architecture decisions that serve a team ingesting millions of events per day break down at billions, and break down further at hundreds of billions.

At lower volumes, index-based search, tiered storage, and periodic batch processing are workable approaches. As volumes scale toward enterprise-level traffic, those approaches introduce compounding costs and latency.

At that threshold, the architectural requirements shift: Streaming ingest replaces batch collection, storage and compute decouple, and query performance must hold regardless of how much data is stored. Understanding which threshold your organization is approaching, or has already crossed, is one of the most useful inputs to any log management platform decision.

Streaming vs. Batch Ingest

The collect step in most log management implementations has traditionally operated as a batch process: Logs are gathered from source systems on a schedule and pushed to the centralized solution periodically. For many use cases, this is sufficient. For teams operating in CDN, media delivery, security operations, or any environment where conditions change in seconds, batch ingest introduces a lag that can make the difference between catching an incident in progress and investigating it after the fact.

Streaming ingest architectures, often built on systems like Apache Kafka or similar message queues, allow log events to be ingested and made queryable in seconds. This enables live operational visibility rather than historical review. When evaluating log management solutions, consider whether your use case requires data to be actionable in seconds or minutes, and whether the solution’s ingest architecture can support that requirement without added infrastructure complexity.

Section 5

Conclusion

The value of a CLM solution can be justified by several groups within an entity’s IT division. With proper guidelines and a successful implementation, the benefits from utilizing CLM can result in the ability to identify an event’s root cause and receive notification prior to issues escalating.

Implementation of a CLM solution should be treated no differently than any other tool being introduced into an organization. Requirements and understanding should drive what components are logged with the right level of logging ingested into the solution. Any sensitive information should be excluded from ingestion into the CLM solution, always adhering to any regulatory guidelines.

A major success factor for a CLM solution is the ability for an analyst to find the information needed, taking as little time as possible without having to span outside the solution itself. CLM solutions that include pre-designed reports and functionality to assist all classes of users can provide a faster turnaround with performing routine requests.

Time should be taken to determine if a CLM solution is the right fit for your needs, or if your requirements fall into a security information and event management solution set. While some solutions may operate well as either, understanding your target end state is helpful in identifying the best solution for your corporation.

As log volumes grow, the criteria for evaluating CLM solutions should evolve with them. Features matter, but architecture matters more at scale. Look for solutions that can ingest and query in real time, maintain performance as data volumes increase, and do so at a cost that does not force teams to sample, drop, or tier their data into inaccessible cold storage. Having advanced features available will not only equip analysts with a complete set of tooling to leverage, but also minimize the potential downtime from an unexpected situation.

Like This Refcard? Read More From DZone

related article thumbnail

DZone Article

Good Data, Bad Metric: A Mutation Testing Pattern for Analytics Engineering
related article thumbnail

DZone Article

Advanced Error Handling and Retry Patterns in Enterprise REST Integrations
related article thumbnail

DZone Article

Persistent Memory for AI Agents Using LangChain's Deep Agents
related article thumbnail

DZone Article

A System Cannot Protect What It Does Not Understand
related refcard thumbnail

Free DZone Refcard

Full-Stack Observability Essentials
related refcard thumbnail

Free DZone Refcard

Getting Started With Log Management
related refcard thumbnail

Free DZone Refcard

Observability Maturity Model
related refcard thumbnail

Free DZone Refcard

Getting Started With OpenTelemetry
  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook