If you work in the SaaS world and you operate high-availability systems, you need to monitor your systems closely. Monitoring is one of OpsGenie’s critical internal business continuity operations.
Why do we monitor our system? There are two main reasons. First, we need to be aware of any issues in our system, and monitoring lets us know what is happening. Second, we need to identify issues, as awareness of issues is not enough. In order to identify issues, we need to collect as much monitoring data from our application as we can.
The next question is what is monitoring data? The answer really depends on the application. For our systems (and in general), we are interested in three types of monitoring data:
- Audit (aka trace)
- Stat (or metric)
There are many products, frameworks, and tools for monitoring classical web applications. If you implement your system on AWS Lambda and want to monitor your Lambda functions, almost no monitoring tool (at least for Java) tells you what is going on in your Lambda function. Although AWS provides you some metrics about how long the invocation took, it doesn’t say much about what is going on under the hood. For monitoring your Lambda function in more detail, AWS XRay support had been added. But while AWS XRay provides metrics about calls to external services such as DynamoDB, HTTP endpoint, etc.; it doesn’t expose internal metrics. You still have no idea what is going on inside.
Let me explain why you need more detailed monitoring data for Lambda functions. Assume that you have a Lambda function and it takes a message by request, processes and then saves it to DynamoDB. Let’s call this Lambda function “message-handler”. Your Lambda function was implemented to be used by different callers so the request contains a regex and a message body text. Each caller invokes Lambda function with their own logic related regex. In here, regex is used for normalizing message body before saving it to DynamoDB.
You have deployed your Lambda function and it works like a charm. So far so good. But then you have seen that some Lambda invocations take a few minutes and some unlucky ones fail with timeout. Here, the only data you have is the invocation time and you have no idea which part of the code during invocation takes most of the time. Then you have enabled AWS XRay support. By AWS XRay metrics, you have seen that for the long-running Lambda invocations, DynamoDB calls are not the problem. Then what is the next part to blame?
This is one of the worst cases you might be faced with for your Lambda function in the production environment. Here, the problem might be about regex operation which causes catastrophic backtracking for some problematic regex and input texts but you have no metric to verify that is the problem. You cannot debug your Lambda function, you cannot profile it, and you cannot collect more detailed metrics. In this case, you should be a magician developer to find the problem by guessing with static code analyze.
At OpsGenie, we are implementing new applications as new microservices and splitting our existing monolithic architecture into microservices as we mentioned in our previous blog posts:
- OpsGenie is on a journey to reap the benefits of serverless architecture.
- Migrating a Monolithic Java Application to AWS Serverless Architecture.
So, we decided to use AWS Lambda as FaaS for deploying and running our microservices. But before we go to production with our new Lambda architecture, the problem mentioned above was one of the key challenges we needed to handle.
Thundra: Let the Monitoring Begin
Because of the reasons we mentioned above, we implemented our in-house monitoring infrastructure Thundra, which is a character from Aladdin, to collect more detailed metrics from Lambda functions to be used for analyzing issues. Note that Thundra is not specific to AWS Lambda, it provides monitoring (audit + stat + log) infrastructure and we have adapted Thundra to our Lambda infrastructure Sirocco, which is also another character from Aladdin, by using its plugin support.
All of the monitoring data types (audit + stat + log) can be published via “MonitorDataPublisher” API. Currently, there are CloudWatch, Firehose and Kinesis based publisher implementations and others can be implemented when needed. For Lambda environment, to minimize monitoring data publish overhead, we are publishing monitoring data through CloudWatch logs.
Monitoring data is written to CloudWatch logs (log streams of Lambda function log group) and then another Lambda “monitor-lambda” is subscribed to Lambda function log group with a filter pattern to only receive monitor data (audit + stat + log) events. Then monitor data events are demultiplexed and forwarded to their targets (NewRelic/ElasticSearch + Graylog).
The following diagram shows our Lambda monitoring architecture:
Audit and stat data are published to the following systems:
- NewRelic: Currently, we are using NewRelic’s Insights product (not the APM because we have our own APM solution for thw AWS Lambda platform) to only query and visualize our audit and stat data.
- Firehose: We also send our audit and stat data to Firehose for storing it as batch on both of Amazon S3 and ElasticSearch. S3 is used as the backup of raw audit and stat data currently. For possible further improvements, we are also thinking of analyzing audit and stat data on S3 via Elastic Map-Reduce. ElasticSearch is used for querying and visualizing (with Kibana and Grafana) audit and stat data like NewRelic Insights. Currently, we are using both of NewRelic Insights and ElasticSearch for similar requirements because they have advantages and disadvantages over each other.
Log data is published to Graylog. At first, we were sending log data to Graylog over HTTPS in GELF format. But then we switched to TCP-level communication via SSL sockets (based on Netty) in GELF format due to performance overhead of application layer (HTTPS) communication and intermittent SSL handshake errors when using HTTPS.
Let me give an overview of what Thundra provides for monitoring. As I already mentioned above, it has three basic monitoring features:
- Auditing and tracing
- Collecting stats
- Monitoring and enriching logs
Auditing is used for tracing executions (method calls), collecting metrics and publishing the audit data to be analyzed later. It starts at a time, collects audit metrics, and then finishes. It is generally used for tracing the execution during request/response based communication.
In the auditing infrastructure, the key point is the “AuditContext” which represents a logical scope started and finished at a time. Here, the scope depends on your logic and needs. It might be the whole request, single or multiple method calls. Scopes can contain children scopes so “AuditContext”s are hierarchical.
Let me explain in detail with the “message-handler” example mentioned before. There are three tiers of a typical web application: Controller, Service, and Repository layers.
- The Lambda handler can be considered as the Controller layer. Let’s call our Lambda function handler as “MessageHandler” class. “MessageHandler” opens the request and passes it to the Service layer via the “MessageService” class.
- Processing the message body with the supplied regex is done in the Service layer. Let’s call the class, which gets the message and then handles the processing message body logic, “MessageService”. “MessageService” uses “MessageNormalizer” class to normalize the given message body by given regex and passes the result to the Repository layer through “MessageRepository” class.
- Then the normalized message is saved to DynamoDB by the Repository layer. Let’s call the class that saves the message to DynamoDB “MessageRepository”.
Getting requests and returning responses are the root audit context. It starts with the request and ends with the response. Since this is a very high-level audit metric, for detailed analysis, we need to collect more sensitive audit data with deeper audit context levels.
Therefore, we should also trace the most important classes, which are the “MessageHandler”, “MessageService”, “MessageNormalizer” and “MessageRepository” classes. We can represent audit context names as “<class_name>::<method_name>” and each sub-method call is associated with the caller as sub audit context. Here are the durations of audit contexts:
With the help of this deep level auditing, it is easy to notice that the problematic part is the normalization of the message body with the given regex. If we have only the invocation duration (5100 milliseconds), we couldn’t say which part (normalization or save) is the long-running action.
You may think that since we can see the latency of the Repository layer via AWS XRay, DynamoDB metrics are not the problem; it is obvious that the normalization is the problematic one. It might make sense for this case, but the more parts/modules there are in your system, the harder it will be detected by guessing. You need concrete metrics to blame a specific part of your system.
They are all good, but manually opening and closing audit contexts is error-prone and messes up the code. They should be cross-cutting operations for configured/specified points (classes/methods). With this motivation, we have introduced an instrument plugin for our audit infrastructure. Using it, you can specify the classes/methods to be audited, and their bytecodes are instrumented to automatically open/close “AuditContext”s to injects their audit metrics.
Here are some of the highlighted features of our auditing infrastructure:
- “AuditContext”s can be opened and closed manually or automatically.
- “AuditContext”s are hierarchical so there might be sub-contexts.
- “AuditContext”s are inheritable so that thread local and audit trees of child threads attach to audit tree of their parent thread as sub audit tree.
- Custom properties can be added into “AuditContext” to be shown in the metrics.
- Audits are published via “AuditDataPublisher” API.
- The classes/methods to be audited can be specified/configured programmatically (“@Auditable” annotation or “AuditPointProvider” API) or by system property.
- Method arguments, return values and errors thrown from a method can be audited.
- You can audit methods line by line so you can see which line takes how many milliseconds.
- Custom actions can be defined to be executed before/after method call or when an exception is thrown.
- Custom actions can be defined to be triggered when the method calls exceed a defined time limit. In this case, it is considered a long-running method. Currently, CPU profiling is supported if the invocation exceeds its defined long-run limit. So if a method call takes more time than expected (long run limit), CPU profiling kicks in and, at the end of the method call, we have profiling information that says which parts of the application are hot.
- Audit infrastructure is integrated with AWS XRay. When there is an audit context — a mapped XRay sub-segment is created — so you can monitor and query audit metrics on AWS XRay. With this integration, we are on the road to perfect tracing by combining distributed tracing with local tracing.
- The audit Java agent can also be activated (attached to the JVM) on the fly. This is required, especially for AWS Lambda environments, because the agent cannot be specified as a VM argument to a Lambda function. Therefore, in a Lambda environment, our audit agent is activated at startup and applies instrumentation to audit points (classes/methods).
The following screenshots show how we use our collected audit data in production:
- Lambda cold start occurrences for each function:
- Integration of our audit infrastructure with AWS XRay (so we have full tracing by connecting local and distributed trace data):
Stat infrastructure collects stats (cumulative or instantaneous) and publishes the stats data to be analyzed later. Stats can be application/environment-specific (CPU stats, memory stats, etc …), module/layer-specific (cache stats, DynamoDB stats, etc …), or domain-specific (user stats, …).
Here are some of the important features of stat infrastructure:
- Stats are collected through the “StatInfoProvider” API on demand or periodically.
- The trigger of “StatInfoProvider”s can be configured and controlled by the “StatInfoProviderStrategy” API. Trigger behavior can be count-based, time-based, custom, or a composition of any of them with the AND/OR operator.
- Stats are published via “StatDataPublisher” API.
And these are the currently available stats:
- CPU Stats: Process/system CPU load, system load average, processor count, process CPU time, total/free physical memory sizes, total/free swap space sizes, committed virtual memory size, etc …
- Memory Stats: Heap and non-heap memory usages (init/used/committed/max), memory usages of each memory pool (Eden space, Survivor spaces, Tenured space).
- GC Stats: Minor/Major GC counts/durations, GC CPU percentages, etc …
- Thread Stats: Current thread count, peak thread count, total created thread count, metrics (CPU/user time, blocked time, waited time) of the most CPU-intensive threads.
- Custom Stats: Custom statistics (cache statistics, DynamoDB statistics, etc …) published to “StatContext” for the thread-local stat session (ex. during the request).
For example, the following screenshot shows CPU percentages of each function:
Logging infrastructure provides a log4j logger (“org.apache.log4j.Logger”), which decorates logs with application/environment information (application name, type, id, version, profile, hostname, host IP, etc …) and provides domain-specific custom log properties (customer name, user name, etc …).
After decoration, generated log messages are published via “MonitoredLogPublisher” API (through CloudWatch, Firehose, Kinesis, etc.) to log repositories such as Graylog/ElasticSearch. With the help of decorated properties, log messages can be identified where they have been collected.
In addition, log data can have a reference to active audit context, if there is any. So you can access the details and metrics of the execution where the log is created. If the log message is not one of the expected ones, the referenced audit context can be analyzed to see the big picture and understand why the code branched to the place where the log is created.
At OpsGenie, we are using AWS Lambda in production and facing many challenges. We will continue sharing our experiments in our engineering blog and observations on AWS Lambda. Stay tuned!