A Look Into Log Analysis and Effective Critical Alerting
While most DevOps teams believe in the importance of log analysis, they consider it akin to eating spinach — it’s good for you, but do you really have to do it? Yes!
Join the DZone community and get the full member experience.Join For Free
The Great Wall of China began construction in 7 B.C. to protect the Chinese kingdom from Eurasian warriors. Chinese soldiers would marshal forces to protect the Great Wall from enemy attack by using smoke signals to send alerts from tower to tower. This method of alerting enabled messages to be sent to garrisons hundreds of miles away in just a few hours time. With these alerts, soldiers could prepare to convene and combat their enemies.
How can teams effectively analyze this vast amount of data from their various systems? How can they use this data to troubleshoot issues when they do arrive? How can they use this data to prepare for the IT dangers they know of and those that are unforeseen? Furthermore, how can they make sure they are alerted when serious issues and even dangers arrive?
Why Is Log Analysis Important for IT Teams?
While most developers and DevOps teams believe in the importance of log analysis, they consider it akin to eating spinach — it’s good for you, but do you really have to do it? While log analysis contains a lot of important information on how the system is behaving, analyzing logs is a lot of work. However, avoiding this analysis is dangerous. Without this careful analysis, a company cannot recognize the threats and opportunities that lie before it.
Most companies run off multiple servers and have numerous devices providing logs to inform them about troubleshooting issues, monitoring, business intelligence, and SEO. Furthermore, as written in a previous article, IT infrastructure continues its move to public clouds such as Amazon, Microsoft Azure, and Google Cloud. As such, it becomes more difficult to isolate issues — and since there is a lot of fluctuation of server usage in the cloud based on the specific loads, environments, and the number of active users, obtaining an accurate reading can become quite difficult.
Yet by centralized log analysis, you have a way to normalize the data in one database and acquire a sense of how the system’s “normal state” operates. Log analysis can provide insight into cloud-based services as well as localized systems. The analysis provides the knowledge of how the network looks when it is humming along. Knowing baseline traffic, companies then have a sense of how to view the outliers. What should our site traffic be like? What error logs are normal and consistent with system traffic and which are causes for alarm? Having answers to these questions enables engineers to make data-informed decisions.
Furthermore, logs and log analysis can provide insight into many key points of information throughout deployment. Analytics can be used to understand system logs, web server logs, error logs, and app logs. Logs provide us with a way to see traffic, incidents, or events over time. By including log analysis as part of healthy system monitoring, the seemingly impossible process of reading logs and responding to their information becomes possible. By enabling log analysis, companies can optimize and debug system performance and give essential inputs around bottlenecks in the system.
Where Does ELK Come in?
There are several software packages out there that provide log analysis capabilities. Some large enterprises use packages such as Splunk and Sumo Logic. Yet these packages can get quite expensive at high scale. Instead, many in the DevOps community have moved towards using the ELK (Elasticsearch, Logstash, and Kibana) stack for their log analysis. ELK components can be used separately, but when joined together, they give users the ability to run log analysis on top of open sourced software that everyone can run for free.
ELK has many advantages over competitors. It is open source, is easy to set up, and provides fast performance. Of additional value is the visibility it offers into the overall IT stack. When numerous servers are running multiple applications as well as virtual machines, you need a way to easily view and analyze problems. ELK provides this opportunity in a low-cost way that correlates metrics with logs.
Example of ELK Solutions
One of the biggest challenges of building an ELK deployment is making it scalable. Given a new product deployment or upgrade, traffic and downloads to a site might conceivably skyrocket. Ensuring this influx doesn’t kill the system requires that all components of the ELK stack scale as well. Ideally, you would have a tool which combines these three components into a viable stack that is integrated into the cloud so that scaling and security are taken care of. This is where a hosted ELK solution like Logz.io or Elastic Cloud steps in. Logz.io is built on top of Amazon’s AWS and enables this very type of scaling.
Additionally, when running a large environment, problems can originate from the network and cause an interruption in the application. Trying to correlate these issues can be very complicated and time-consuming. The ELK stack is useful in these cases because it provides a method to bring in data from multiple sources and create rich visualizations.
Published at DZone with permission of Orlee Berlove, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.