So, you've decided to go with ELK to centralize and manage your logs.
The ELK Stack is now the world's most popular log analysis platform, with millions of downloads per month. The platform's open source foundation, scalability, speed, and high availability, as well as the growing community of users, are all excellent reasons for this decision. But before you go ahead and install Elasticsearch, Logstash, and Kibana, there is one crucial question that you need to answer: Are you going to run the stack on your own, or are you going to opt for a cloud-hosted solution?
Jumping to the conclusion of this article, it all boils down to time and money. When contemplating whether to invest the valuable resources at your disposal in doing ELK on your own, you must ask yourself if you have the resources to pull it off.
This article will break down the variables that need to be added into the equation.
These variables reflect what a production deployment of ELK needs to include based on the extensive experience of both our customers and ourselves while working with ELK. Also, these recommendations are based on the assertion that you are starting from scratch and require a scalable, highly available, and at least medium-sized ELK deployment.
Installation and Shipping
Installing ELK is usually hassle-free. Getting up and running with your first instances of Elasticsearch, Logstash, and Kibana is pretty straightforward, and there is plenty of documentation available if you encounter issues during installation (see our Elasticsearch tutorial, Logstash tutorial, and Kibana tutorial for help).
However, connecting the dots is not always error-free. Depending on whether you decided to install the stack on a local, cloud, or hybrid infrastructure, you may encounter various configuration and networking issues. Kibana not connecting with Elasticsearch, Kibana not being able to fetch mapping, and Logstash not running or not shipping data are all-too-frequent occurrences. (For more, see my prior post on troubleshooting five common ELK Stack glitches.)
Once you've troubleshooted those issues, you need to establish a pipeline into the stack. This pipeline will greatly depend on the type of logs you want to ingest and the type of data source from which you are pulling the logs. You could be ingesting database logs, web server logs, or application logs. The logs could be coming in from a local instance, AWS, or Docker. Most likely, you will be pulling data from various sources. Configuring the integration and pipeline in Logstash can be complicated and extremely frustrating, and configuration errors can bring down your entire logging pipeline.
It's one thing to ship the logs into the stack. It's another thing entirely to have them actually mean something. When trying to analyze your data, you need the messages to be structured in a way that makes sense.
That is where parsing comes into the picture, beautifying the data and enhancing it to allow you to analyze the various fields constructing the log message more easily.
Fine-tuning Logstash to use a grok filter on your logs correctly is an art unto itself and can be extremely time-consuming. Take the timestamp format, for example. Just search for "Logstash timestamp" on Google, and you will quickly be drowned in thousands of StackOverflow questions from people who are having issues with log parsing because of bad grokking.
Also, logs are dynamic. Over time, they change in format and require periodic configuration adjustments. This all translates into hours of work and money.
Elasticsearch mapping defines the different types that reside within an index. It defines the fields for documents of a specific type — the data type (such as string and integer) and how the fields should be indexed and stored in Elasticsearch.
With dynamic mapping (which is turned on by default), Elasticsearch automatically inspects the JSON properties in documents before indexing and storage. However, if your logs change and you index documents with a different mapping, they will not be indexed by Elasticsearch. So, unless you monitor the Elasticsearch logs, you will likely not notice the resulting "MapperParsingException" error and thereby lose the logs rejected by Elasticsearch.
You've got your pipeline set up, and logs are coming into the system. To ensure high availability and scalability, your ELK deployment must be robust enough to handle pressure. For example, an event occurring in production will cause a sudden spike in traffic, with more logs being generated than usual. Such cases will require the installation of additional components on top (or in front) of your ELK Stack.
For example, we recommend that you place a queuing system before Logstash. This ensures that bottlenecks are not formed during periods of high traffic and Logstash does not cave in during the resulting bursts of data.
Installing additional Redis or Kafka instances means more time and more money, and in any case, you must make sure that these components will scale whenever needed. In addition, you will also need to figure out how and when to scale up your Logstash and Elasticsearch cluster manually.
While built for scalability, speed, and high availability, the ELK Stack — as well as the infrastructure (server, OS, network) on which you chose to set it up — requires fine-tuning and optimization to ensure high performance.
For example, you will want to configure the allocations for the different memory types used by Elasticsearch, such as the JVM heap and OS swap. The number of indices handled by Elasticsearch affects performance, so you will want to make sure you remove or freeze old and unused indices.
Fine-tuning shard size, configuring partition merges for unused indices, and shard recovery in the case of node failure — these are all tasks that will affect the performance of your ELK Stack deployment and will require planning and implementation.
These are just a few examples of the grunt work that is required to maintain your own ELK deployment. Again, it is totally doable — but it can also be very resource-consuming.
Data Retention and Archiving
What happens to all of the data once ingested into Elasticsearch? Indices pile up and eventually — if not taken care of — will cause Elasticsearch to crash and lose your data. If you are running your own stack, you can either scale up or manually remove old indices. Of course, manually performing these tasks in large deployments is not an option, so use Elastic's Curator or set up cron jobs to handle them.
Curation is quickly becoming a de-facto compliance requirement, so you will also need to figure out how to archive logs in their original formats. Archiving to Amazon S3 is the most common solution, but this again costs more time and money. Cloud-hosted ELK solutions such as our Logz.io platform provide this service as part of the bundle.
Handling an ELK Stack upgrade is one of the biggest issues you must consider when deciding whether to deploy ELK on your own. In fact, upgrading a large ELK deployment in production is so daunting a task that you will find plenty of companies that are still using extremely old versions.
When upgrading Elasticsearch, making sure that you do not lose data is the top priority — so you must pay attention to replication and data synchronization while upgrading one node at a time. Good luck with that if you are running a multi-node cluster! This incremental upgrade method is not even an option when upgrading to a major version (e.g. 1.7.3 to 2.0.0), which is an action that requires a full cluster restart.
Upgrading Kibana can be a serious hassle with plugins breaking and visualizations sometimes needing total rewrites.
Think big. As your business grows, more and more logs are going to be ingested into your ELK Stack. This means more servers, more network usage, and more storage. The overall amount of computing resources needed to process all of this traffic can be substantial.
Log management systems consume huge amounts of CPU, network bandwidth, disk space, and memory. With sporadic data bursts being a frequent phenomenon — when an error takes place in production, your system with generate a large number of logs — capacity allocation needs to follow suit. The underlying infrastructure needed can amount to tens of thousands of dollars per year.
In many cases, your log data is likely to contain sensitive information about yourself, your customers, or both. Just as you expect your data to be safe, so do your customers. As a result, security features such as authorization and authentication are a must to protect both the logs coming into your ELK Stack specifically and the success of your business in general.
The problem is that the open source ELK Stack does not provide easy ways to implement data protection strategies. Ironically, ELK is used extensively for PCI compliance and SIEM but does not include security out of the box. If you are running your own stack, your options are not great. You could try to hack your own solution, but as far as I know there is no easy and fast way to do that. Or, you could opt for using Shield — Elastic’s security ELK add-on.
You’ve probably heard of Netflix, Facebook, and LinkedIn, right? All these companies are running their own ELK Stacks, as are thousands of other very successful companies. So, running ELK on your own is definitely possible. But as I put it at the beginning, it all boils down to the amount of resources at your disposal in terms of time and money.
I have highlighted the main pain points involved in maintaining an ELK deployment over the long term. But for the sake of brevity, I have omitted a long list of features that are missing in the open source stack and but are recommended for production-grade deployments. Some needed additions are user control and user authentication, alerting, and built-in Kibana visualizations and dashboards.
The overall cost of running your own deployment combined with the missing enterprise-grade features that are necessary in any modern centralized log management system make a convincing case for choosing a cloud-hosted ELK platform.
Or do you think you can pull it off yourself?