The Ultimate Guide to Monitoring Serverless Applications
What is serverless monitoring and why is it important?
Join the DZone community and get the full member experience.Join For Free
Serverless applications, more often than not, have logic distributed over multiple functions and services, which with growth and agents and wrappers attached, can get more complex and costly. This is where Serverless monitoring comes in to help. But what is Serverless monitoring?
Serverless monitoring allows developers to gain important insight on what happens during each execution and event, errors become more easily visible and measuring resource consumption for each invocation is possible. Simply put, there is no better way to optimize the costs and performance of your applications than using a serverless monitoring tool.
While the old tools for AWS logging and monitoring are obsolete here, the requirements for a good logging system remains:
information should be granular
data should be available on the shortest amount of time
log collection should not impact application performance
These are key elements to look out for when finding the most comprehensive serverless monitoring tool.
What is AWS Lambda?
Let’s go back to basics first and remind ourselves exactly what AWS Lambda is and its purpose.
Serverless architectures are an extension of the principles of the Service Oriented Architectures (SOA), where services (functions) communicate using messages (events). When using this approach correctly, serverless architectures can reduce code complexity and provide easier management of an application.
AWS Lambda is a service which runs your code deployed to a container with pre-allocated CPU, disk and memory. Together, your code and its associated configuration are called a Lambda function; these functions run in response to external events or triggers. Lambda functions are “stateless” with no affinity to the underlying infrastructure, allowing developers to focus solely on the code. Lambda is undoubtedly at the heart of Serverless applications.
Function-as-a-Service (FaaS) solves many problems that the previous architectural models had to cope with - from a developer’s point of view, the most important one being the ability to run code without having to consider server administration, scalability and availability. On the other hand however, there are aspects that have to be dealt with in a different manner than before, such as monitoring.
For more detailed technical knowledge and information around Serverless as a whole, head to this Serverless Knowledge Base.
What to Monitor
One of the contributing factors that makes serverless applications harder to monitor is the setup overhead of analytics services. In most cases with serverless, there are a lot more units to monitor, the life cycles are short and configuring agents directly contributes to latency and cost.
The good thing about such services, however, is that by default, they make themselves observable.
Observability does not mean that you have visibility but rather it means that the systems emit data that makes it possible to understand what is happening, from the outside.
Monitoring can be both specific and generic when it comes to Serverless applications, depending on your needs and the platforms used. First though, here’s what we think are the most important areas to monitor to gain maximum benefit of Serverless: latency; cold starts; invocation errors; and costs and usage.
For latency, large datasets can skew results making it hard to notice when an important user-facing function has started to take longer to execute. Usually, in these large data-sets, average metrics hide the outlying data points making it impossible to detect that even though the average execution speed is acceptable, some percentage of the user experience has significantly longer response times.
A good way to keep an eye on latencies is to construct a custom dashboard of all mission-critical functions and observe for outliers. Once you detect a function that is taking longer than expected, you can drill down into detailed metrics.
Additionally, as a developer, it’s not uncommon to be faced with SLA requirements that go something like this: “99% of requests must finish quicker than 1 second”. Even if this doesn’t apply, this type of requirement is good to use because it’s actionable and easily measurable. This is where percentile metrics come into play.
Each Lambda function runs inside a Docker container. When it’s invoked for the first time, AWS first spins up a new container and then executes the function inside it. This increases latency and may make your application seem slow to the user initially. It usually takes from a few hundred milliseconds to several seconds while the requester is waiting for a response. After this initial latency, however, the function is kept ‘warm’ for a period of time. During this time, new invocations don’t suffer similar latencies and feel much more responsive to the end-user.
There’s an additional complication which is the concurrency issue. If you receive a burst of traffic simultaneously, AWS scales the function by spinning up more containers to handle all the new requests. This causes a whole different sequence of cold starts, which has nothing to do with resources being left idle for too long.
There are many scenarios where cold starts are undesirable. If that’s the case for your application, you need to detect and monitor cold starts in your stack. Cloud services usually won’t provide this information directly, but monitoring services such as Dashbird will. Monitoring cold starts is a great way to keep on top of these issues, in order to assess where architectural improvements can be made.
We go more in depth in this article on how to solve cold starts, if you’d like to learn more.
An AWS Lambda invocation can raise errors for a variety of reasons. Invocation errors will make Lambda return a 400-series or 500-series HTTP status code, in other words, the invocation request is rejected before your function receives it. You can find out more from this list of some of the most common errors or see this complete list of invocation errors.
On average, Small Medium Businesses (SMBs) use about 20 SaaS tools; many of those are connected via an API to the service itself and often play a crucial role in the business logic. The problem arises when one of these endpoints goes down without anyone knowing. Understanding what doesn’t work or where the bottleneck is, is imperative to running a business and having insight into applications. Notification of the failure and pinpointing where and when the error happened saves hours searching through endless logs and that crucial downtime affecting end users.
Below is an example of a critical errors dashboard on Dashbird app. It’s important you quickly understand the severity, root cause and time the specific error occurred the first and last time in your app.
Costs & Usage
Lambda pricing is very straightforward, with its billable factors including:
Number of requests
Amount of memory provisioned
To break it down, let’s start with the number of requests. The first 1 million requests are free every month. After that you will be charged $0.20 per million requests for the remainder of that month. You can also use our free Lambda Cost Calculator to calculate the exact Lambda cost for your specific case.
Independent of the number of requests, you also pay for the memory allocated to your function along with the compute time that the function takes to do its job. You specify the memory allocation (GB) when you deploy a function, while the compute time (seconds) can vary from one invocation to the next. With these together, a GB-seconds charge is billed where the first 400,000 GB-seconds are free, and anything after is charged at $0.00001667 per GB-second.
There will likely be some additional charges for resources like an AWS S3 bucket, VPC, DynamoDB, etc.
As you can see, costs and usage can grow exponentially and in particular with distributed models, it’s easy to lose track of what’s been spent and what resources have been used efficiently and intentionally.
When it comes to the cost of the system, it is best to monitor at the account level and only if that metric experiences a significant change, does it make sense to drill down to function level.
Below is a 12h general statistics dashboard on Dashbird app. You can drill down information from a month’s view to a detailed 5 minutes view to get a quick under the hood understanding of your serverless application. From here, you’ll be able to spot trends and anomalies within your application.
Learn more about Serverless Best Practices in this free ebook.
Let’s start with what’s already in the box.
The built-in tool for Lambda, AWS CloudWatch, organises logs based on function, version and containers while Lambda adds metadata for each invocation. In addition, runtime and container errors and timestamps are included in the logs. With the use of these CloudWatch logs, metrics can be collected and tracked providing an infrastructure-wide view on resource usage, application performance and operational health.
The service also includes AWS Cloudwatch Alarm, which can be set up for both metric alarms and composite alarms. The latter takes into account other alarms set up, working together to help reduce alarm noise. Cloudwatch alarms let you know how much RAM a given application is using, how much CPU usage pops up across your entire system and more. You’re also able to set data limits by pre-determining events to notify you when you’ve made it to or approaching the limit, helping projects and applications to stick to budget.
When you start building your first FaaS application, chances are you will begin with using CloudWatch. CloudWatch lets you track issues and is a great starting point for serverless monitoring and instilling good monitoring habits. While CloudWatch provides just enough monitoring tools for some users, others with more volume might need something more comprehensive to significantly cut down their discovery and resolution time.
Instant Benefits of Serverless Monitoring
It’s pretty clear now that using a monitoring tool for your serverless architecture provides a multitude of benefits, from instilling good habits and best practice to creating a more productive and effective business, team and application. It’s a no brainer, but here are a few more reasons why it’s so important and beneficial.
Issue Management and Team Collaboration
Any cloud application, even that with minimal complexity, will generate a reasonable amount of issues on a frequent basis - especially those that are under active development. The development teams behind such applications need a way to effectively manage these issues.
Monitoring platforms allow teams to visualize and control in a user-friendly way issues that are open, those that are resolved and those that have been temporarily muted. This setup creates better cohesion and clear communication for the team and the resolution workflow.
Quickly visualizing past occurrences of the same issue can be important as some cases require further investigation. They can also indicate any current bug fixing approaches that aren’t working as expected.
In all, this can help avoid making the same mistakes in both initial development and error remediation, and improve expertise and knowledge by creating a continuous learning and assessment approach.
Developers won’t have the time to keep monitoring application logs for themselves, so they need a monitoring tool that alerts them proactively when something requires their attention.
An automated alerting system may sound like something that any service provider offers today, however the key to their effectiveness is what’s in the alerts. With an immense amount of application logs, it’s easy for the monitor to miss relevant signals.
The alerting mechanism should detect not only application errors, but also infrastructure faults that can affect the application indirectly. In the case of AWS Lambda, this would include timeouts, container crashes, memory exhaustion and even more. You can learn more about Dashbird’s automated alerting here.
For parts of the system that are more tolerant to faults, developers may disable individual issue alerting and set up aggregation metrics. This allows the attention to shift from development to debugging only when it’s really required.
Having customization abilities when it comes to alerts is crucial for successful monitoring and error debugging.
When something goes wrong in an application, developers are usually running against time to mitigate damages and fix the root cause. Not only is receiving alerts important, but getting them in the fastest and most convenient way is also essential to save time.
Most development teams today use instant messaging services such as Slack. Having a channel dedicated to receiving issue alerts can help developers cut through the noise enabling instant alerts and quicker responses for fixes.
And there you have it, our ultimate guide to serverless monitoring. Monitoring your serverless applications makes sense for so many reasons, but possibly the most important being that it actually helps make a developer’s job easier and more enjoyable, providing the confidence in your app’s reliability and freeing up all the time spent on debugging so that you can focus on developing your product.
Published at DZone with permission of Taavi Rehemägi, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.