This is designed to help developers, DevOps engineers, and operations teams that run and manage applications on top of AWS to effectively analyze their log data to get visibility into application layers, operating system layers, and different AWS services. This is a step-by-step guide to retrieving log data from all cloud layers and then visualizing and correlating these events to give a clear picture of one’s entire AWS infrastructure.
Why Should You Look at Your Logs?
Cloud applications are inherently more distributed and built out of a series of components that need to operate together to deliver a service to the end user successfully. Analyzing logs becomes imperative in cloud environments because the practice allows relevant teams to see how all of the building blocks of a cloud application are orchestrated independently and in correlation with the rest of the components.
Why ELK (Elasticsearch, Logstash, and Kibana)?
ELK is the most common log analytics platform in the world. It is used by companies including Netflix, LinkedIn, Facebook, Google, Microsoft, and Cisco. ELK is an open source stack of three libraries (Elasticsearch, Logstash, and Kibana) that parse, index, and visualize log data (and, yes, it’s free).
So instead of going through the challenging task of building a production-ready ELK stack internally, users can signup and start working in a matter of minutes. In addition, Logz.io’s ELK as a service includes alerts, multi-user, and role-based access, and unlimited scalability. On top of providing an enterprise-grade ELK platform as a service, Logz.io employs unique machine-learning algorithms to automatically surface critical log events before they impact operations, providing users with unprecedented operational visibility into their systems.
- How to Deploy ELK in Production
- Lessons Learned from Elasticsearch Cluster Disconnects
- How to Use ELK to Monitor Performance
- How to Integrate AWS CloudTrail and the ELK Stack
Analyzing Application Logs
Why Should I Analyze My Application Logs?
Application logs are fundamental to any troubleshooting process. This has always been true — even for mainframe applications and those that are not cloud-based. With the pace at which instances are spawned and decommissioned, the only way to troubleshoot an issue is to first aggregate all of the application logs from all of the layers of an application. This enables you to follow transactions across all layers within an application’s code.
How Do I Ship Application Logs?
There are dozens of ways to ship application logs. The best method to use depends on the type of application, the format of the logs, and the operating system. For example, Java applications running on Linux servers can use Logstash or logstash-forwarder (a version that is lightweight and includes encryption) or ship it directly from the application layer using a log4j appender via HTTPs/HTTP. You can read more in our essay on The 6 Must-Dos in Modern Log Management.
Analyzing Infrastructure Logs
What Are Infrastructure Logs?
We consider everything which is not the proprietary application code itself to be an infrastructure log. These include system logs, database logs, web servers logs, network device logs, security device logs, and countless others.
Why Should I Analyze Infrastructure Logs?
Infrastructure logs can shed light on problems in the code that is running or supporting your application. Performance issues can be caused by overutilized or broken databases or web servers, so it is crucial to analyze those log files especially when correlated with the application logs. While troubleshooting performance issues, we’ve seen many cases in which the root cause was a Linux kernel issue. Overlooking such low-level logs can make forensics processes long and fruitless. Read more about why it’s important to ship OS logs in our essay on Lessons Learned from Elasticsearch Cluster Disconnects.
How Do I Ship Infrastructure Logs?
Shipping infrastructure logs is usually done with open-source agents such as rsyslog, logstash, logstash forwarder, or nxlog that read the relevant operating system files such as access logs, kern.log, and database events. You can read here about more methods to ship logs here.
Monitoring System Performance with ELK
One of the challenges organizations face when troubleshooting performance issues is that they are looking at one dashboard that shows performance metrics and another to troubleshoot issues and analyze logs. In many cases, it’s possible to use a single dashboard to that shows both the performance metrics and the visualized log data that is being generated by all of the components of your system. In many cases, performance issues are related to events in application stacks that are recorded in log files. Collecting system performance metrics and shipping them as log entries then enables quick correlations between performance issues and their respective events in the logs.
How Do I Ship Performance Metrics?
To use ELK to monitor your platform’s performance, run probes on each host to collect system performance metrics. Software service operations teams can then visualize the data with the Kibana part of ELK and use the resulting charts to present their results.
For example, we encapsulated Collectl in a Docker container to have a Docker image that covered all of our data collecting and shipping needs. Read more and get a download on our site: How to Use ELK to Monitor Platform Performance.
Be sure to catch us next time in Part II when we take a look at monitoring EBL logs. We will also cover AWS CloudTrail logs, AWS VPC flow logs, Cloudfront logs, and S3 access logs.