AWS Adventures (Part 3): Post Mortem on Lambdas
AWS Adventures (Part 3): Post Mortem on Lambdas
This breakdown of AWS Lambda considers where the serverless offering shines and where your use case could be served better by other options.
Join the DZone community and get the full member experience.Join For Free
See why enterprise app developers love Cloud Foundry. Download the 2018 User Survey for a snapshot of Cloud Foundry users’ deployments and productivity.
Friends have asked that I report from the trenches to update on my thoughts on Lambdas. I fell in love about a year ago, so how do I feel now? I’ll cover the pros and cons I faced, what I learned, and what I still don’t know.
If you read my previous article on what Lambdas are, and how you build an automation workflow around them, you’ll see why I like them. In short, a front-end developer with a bit of back-end Node experience can use continuous deployment for working REST services (or non) quickly, without worry of maintenance, or sudden spikes in traffic. I’m a coder, and with the AWS Node and Python SDK, I can code my way onto the cloud vs. learning Unix, various weirdness with Terraform/Chef/Ansible/Docker, etc.
Regulation and Long Deployments
… except that’s not wholly true. I work in a large credit card company, Capital One. We are heavily regulated, some by the government, some by us as a preventative measure. While deploying the code in QA is easy as, basically, every developer gets a free AWS account within reason to play, learn, and develop on, the URL part is actually extremely complicated as is getting your code on prod. Sadly QA and Prod don’t always match depending on what team you are on.
API Gateway Too Insecure For Us
I tried and failed to get an AWS API Gateway exception in my company, even without PCI data. We have a huge list of what services on AWS are approved for us to build with, those that are not, and in between. The typical scenario is to create your own private one using approved Route53 that points to your ELB or ALB (load balancers, they find out which EC2 is the least busy, and drive traffic to it) which will then route to whatever code you have on your EC2s. However, that assumes you are using round-the-clock EC2s or Docker containers in an ECS cluster, which is exactly what I didn’t want: more crap to manage and look after and learn. I want to learn serverless, not servers. This project was a lift and shift, meaning getting the code to the cloud as quickly as possible vs. re-writing it FOR the cloud.
In short, everyone wants serverless, but none of the uber-security is there yet without a lot of work.
The company has their own API Gateway, homegrown and different from AWS API Gateway, managed by an internal team and they’ve solved a lot of common problems API Gateways have. Given we’re a company constantly moving various products from on-prem (servers nearby) to the cloud (AWS, I hear others use Azure and Google), this API Gateway can handle very company-specific issues we have around security, access controls, etc. It’s not as easy to setup as AWS API Gateway.
Lesson: Always check what’s approved by the company before I start building ideas around it. i.e. Lambda isn’t as fun without API Gateway. Even Lambda is only partially approved. Meaning, we can’t put un-encrypted PCI data on ’em because we have no clue what EC2s RAM we’re putting data into, so suddenly everything needs to be tokenized (encrypted).
No API Gateway?
Lambdas are built for burst traffic or the occasional code run for simple services. I was building a content deployment pipeline that I thought ran about 12 times a day, many of which were bursts in the early morning where I don’t want to worry about EC2s being awake. Perfect Lambda use case, right?
What good is a Lambda for a REST API without an API Gateway? Well, AWS is flexible, so I found another way. Lambdas are actually functions, not actual “API’s”. Meaning, anything can trigger them; they ARE functions. Files put on an S3 bucket (hard drive), some log messages in CloudWatch that match a string, and even periodic cron jobs managed by AWS.
More important, though, is that Lambdas can invoke other Lambdas.
Lesson: Even with AWS, you can code yourself out of any negative situation. *flexes bicep*
S3 Calls Lambda Calls Lambda Calls EC2
I’m an impatient person. It took 1 month for my company’s information security to respond to my initial request (I think they’re insanely busy). While waiting, the first week I experimented. Instead of calling API Gateway URL’s as a REST API, I just used lambda.invoke in the Python AWS SDK.
At first, this was amazing. I had 2 microservices, 1 in a different security group, that could talk to each other in a request/response fashion, easy to unit and integration test, and speedy. Speedy to me was less than 100 to 200 ms. I know to some that is insanely slow, but again, we’re dealing with what I thought was 12 requests a day, none of them concurrent. The pricing and speed was more than great for our needs.
That, and our Python code only had 2 libraries totaling 500 kb without any C++ dependencies having to compile things through pip. I don’t recall, and yes I looked, any “cold starts”; all our code seemed 100 ms or less usually when not counting REST calls. We could use the lowest RAM lambdas and they ran fine.
… except that was only fun on QA. For whatever reason, our IAM roles managed by an OPs team started to have “configuration drift”. Despite my drinking deeply of functional programming, immutability, and never throwing exceptions, the permission issues weren’t immediately obvious to debug in CloudWatch. As with most projects, your logs increased and then you have the opposite problem: Too much information.
The good news was both the Python and Node AWS SDK are solid around those types of exceptions, so once you see it, it’s predictable, and you can write code to detect, and appropriately log it.
Lesson: 99% of the time my code broke during a deployment, it was because of IAM, subnet, or security group reasons, not code broken reasons. I’m still learning how to better integration test these types of issues before and after deployments. For IAM reasons, you’ll sometimes get a permission error. However, when you’re blocked for subnet/security reasons, you’ll typically get nothing or timeout errors which may or may not be true which is frustrating determine the root cause.
Years ago, Lambdas were limited to 1 minute. Then they got 5 minutes. 5 minutes is still not enough once you have so much power at your fingertips.
The workflow started like this:
- Third party drops content on an S3 bucket
- drops a specifically named file that triggers our LambdaA function
- LambdaA uses AWS SDK to get a list of EC2 IP addresses behind ELB and tells LambdaB to update their content from S3.
- LambdaB, who can’t use the AWS SDK for security reasons, but CAN access the EC2s if given the IP addresses, calls a REST API on those EC2s
- EC2s respond pretty quickly once they’ve copied the changes
That enjoyable experience didn’t last long. We found that synching content was too hard to debug, so instead used the s3 sync command. That’s where things went downhill. While our code on our EC2s became insanely easy and simple, this took 8 minutes for even 1 file. Our guess was because of the CRC checksums of the thousands of files we had to verify against the S3 bucket.
Lambdas only last 5 minutes. Now what?
No Job Record
One solution proposed was to use DynamoDB, or another database, to log a job. You then use a CloudWatch cron job to have your Lambda run every minute, check the database for any outstanding jobs, and ask the EC2s for a status update. The EC2s, when doing their long jobs, will write their completion in Dynamo for the Lambdas to pick up and report on.
Dynamo isn’t approved for us for the same reason Lambdas have issues, specifically around non-encrypted data at rest, even with no PCI data. Given our project was a lift and shift, investing in a Postgres database in some container or EC2 strictly for the status of various EC2s copying content wasn’t going to fly.
No Step Functions
While step functions, a state machine around your Lambda work, were approved near towards the end of our project, it inverses the reactive architecture. S3 tells Lambda tells Lambda tells EC2 tells S3 tells CloudWatch tells Lambda… instead, it was EC2s polling the Step Function for work to do. The team had no appetite for the amount of code required to do this, including edge cases around when EC2s went down and new ones came up under the ELB. Instead it was EC2s polling the Step Function for work to do. The team had no appetite for the amount of code required to do this, including edge cases around when EC2s went down and new ones came up under the ELB.instead it was EC2s polling the Step Function for work to do. The team had no appetite for the amount of code required to do this, including edge cases around when EC2s went down and new ones came up under the ELB.
S3 Job Record
Our Lambda ran every minute from CloudWatch cron jobs to check for statues.
Lesson: There are multiple solutions for Lambdas that have things that could take longer than 5 minutes. Step Functions inverting the problem, or job records in a database or file system. 5 minutes sucks, I get it, but there are solutions around that “challenge”.
Lambda Invoke vs. Fire And Forget
The second was LambdaB waiting on 3 EC2s that took 8 minutes. Since we were using Python, it’s a blocking language. To use something like Twisted that allows concurrency in Python 2.x, you need to compile C++ libraries to get true concurrency. To compile C++ correctly, you need to ensure you do it for an Amazon compliant AMI (operating system virtual machine). To do that, you typically run your pip or npm commands in a Docker that houses the AWS AMI. This ensures the compiled binaries will work once uploaded to your Lambda.
Instead, I just had each EC2 IP REST call invoke its own Lambda. Instead of LambdaA invoking LambdaB to talk to the EC2s, I instead invoked 3 LambdaB instances and gave each one 1 IP Address to invoke. While they often timed out, it didn’t matter, since we had a log trail in CloudWatch and Splunk to verify all parties did what they were supposed to do.
Lesson: If your language’s concurrency is hard, use Lambda's built-in ability to spawn multiple Lambdas so you can use as many as you need...or just use Node for this type of thing. Or Python 3 (3 wasn’t available when we started).
Remember the “great S3 outage” in early 2017? You can view the AWS dashboard as well as your account one, to see both S3, EC2, Lambda, and other services that went down because of it.
… except, we have times when CloudWatch cron jobs went down, and weren’t reported on AWS’s public dashboard, nor our internal one. No notices at all. However, if you dig for the Lambda invocations, you can see clear gaps.
Here are the two pieces of good news. I don’t know what message system they are using behind the scenes, but those cron job timestamps are NOT actually lost. The 2 hours that CloudWatch was down, when it came up, it actually ran every single one of those missed “call this Lambda every minute”…all at once. There was then another 30 minutes of outage, and then another 30 calls, and everything was fine again.
The 2nd piece of good news is that 200 Lambda invocations worked because Lambdas (say it with me kids) “are made for burst traffic.”
I have no idea how to fix this. I usually throw beer and Teslas at the Ops team responsible for outages at 3 a.m. the next day when they come in all bleary-eyed.
Lesson: While CloudWatch cron jobs triggering your Lambda functions are second time mostly accurate, they DO go down. While the gaps are eventually resolved, you better hope your code doesn’t have any weird state when 200 of those events happen at the same time.
The search in a CloudWatch stream is ok. The search across CloudWatch groups DOES NOT EXIST. By default, Lambdas create logs in a log group based on a time stamp. After a few seconds or minute, another group is created. If you have to go back a day and see what went wrong, it’s a nightmare. Not only do you have to click each log group in the console, but you can’t middle click to open multiple at a time. Worse, each log message in the stream only shows about 20 at a time. If your Lambda has a ton of logs or concurrent execution, you can scroll for minutes at a time to find the surrounding log messages.
In short, I encourage you to do what other smart people already knew: Use Splunk or Logstash in the ELK stack.
A Plan B that I learned is to create 1 log group with a special name and log EVERYTHING to that. That way, you can search the past year of messages, and they come up pretty quickly with no hunting.
As a horrible Plan C, you can download the logs to S3 as a batch job from CloudWatch, and Python’s Whoosh library is very similar to ElasticSearch as you can do full-text searches and get back search results. I made some headway playing with this in my spare time, but in retrospect would have rather funneled my logs to Splunk or an ELK stack.
Tons of Lambda Invocations
There are 24 hours in a day. 60 minutes in an hour. That’s 1,440 minutes in a day. That’s also how many times our Lambdas are now running a day in both us-east-1 and us-west-2. 99% of the invocations are less than 100ms, and most are from CloudWatch. The S3 ones are about 12 in the wee hours of the morning, and the same Lambdas just take a different code base. Python 2 is fast for what we’re doing.
Are we saving money using these Lambdas vs. having an EC2 up all day? I haven’t done the math, but I’d say yes given they are uber fast, have no race conditions given our source of truth is 1 lock file on S3, and the EC2s remained focused on keeping legacy architecture running in the cloud with the latest security updates.
Lesson: Had I known I’d be running them 1,440 a day vs. the 12 I thought we were doing, I would have invested in seeing how to get the legacy content deployment to ping our EC2s or ELB to lest us know about new content to suck up from S3 vs. these 2 Lambdas + CloudWatch cron job in our deployment pipeline.
5,000 Invocation Limit
Apparently, Lambdas are limited to 5,000 invocations per AWS account. This means that in a company of my size, even in just my old department, that uses crazy amounts of AWS stuff, you actually risk your Lambda “not running” because there are no EC2s to run your code. I have no idea what happens here. Are you in a queue? Do you get an error from the trigger service?
Lesson: This fear has caused my colleagues to stop recommending Lambdas for Node services, and instead encourage automated ECS Docker clusters for now, and the upcoming “you build your container, AWS handles everything else.”
Deployment was a huge contention amongst my team. I was a huge proponent of automating everything. This passion wasn’t shared — and became less so over time.
In my company, developers who write the code are not allowed to push it to production per SEC rules. Yet, we still follow the “you build it, you maintain it” mantra. Well, some teams anyway. We have to use a dedicated OPS team whose sole job is to push and roll back code, infrastructure, and handle 2nd tier support requests. It’s an adversarial and generally horrible relationship. I hate them, they loathe us. Given QA and Production aren’t 100% the same, you sometimes have to fail a deployment on purpose just to learn about how your code behaves on production, and what errors were thrown in your logs so you can recover and debug and fix. Getting a deployment lined up is tons of emails and stupid paperwork that is zero fun and spaced out a week vs. “multiple times a day.” There are ways of bypassing this, but that’s a story for another day. We have to manually write the steps in a Word document, and depending on the Ops guy you get, they could fail the deployment based on a mis-ordered step or mispelling.
Thus, you have a huge incentive to make it as easy as possible for the Ops team to deploy. The ideal is a 1 click Jenkins job, or a simple shell script.
As I’ve written, deploying Lambdas is super easy. Debugging IAM permissions is not. I was close to solving this, ensuring your Jenkins jobs could run Python I wrote to destroy and rebuild and redeploy our Lambda code + CloudWatch Cron jobs + integration tests for us-east-1 and us-west-2. For reasons out of my control, they thought it’d be faster to go the Word doc route. While the AWS console is nice, it’s a nightmare to document setting up a Lambda using VPC, subnets, and security rules with the litany of other options required. We had multiple failed deployments because of “bad Word documents”, not our code.
In my professional opinion, you have zero excuse not to automate the deployment of immutable Lambda resources. The AWS SDK for Python and Node is solid, and if an art student can automate all that given his first time using AWS in just 5 months, so can your team.
Where Do We Go From Here?
A guy reached out to me on the company Slack asking for help in Python. He’s never coded in his life, about my age (38), and manages databases that are recently moved from on-prem to AWS. He wanted to basically have an alarm (an AWS log event) trigger a Lambda, and its purpose was that, if the database was running out of hard drive, to allocate 50 megs more. I gave him a crash course in Python and what I had learned of AWS and Lambdas, and he was up and running in about a month in prod with a 30-line Python Lambda.
The people using AWS Batch and other short-lived EC2s for minutes to seconds worth of work are moving some of that to Lambdas for price reasons. Various Lambdas are in the background, facilitating various logging, monitoring, or reactionary/reactive roles to help functionality in Ops or for developers. It’s become “just another tool in the toolbox,” not some new, shiny thing to not be trusted by the cynics.
There are even rumors that AWS waited so long for EC2s being charged by the second recently to ensure they didn’t cannibalize their Lambda offerings.
It’s also saying that 100 lines of Node + 1 shell command for internal security reasons at my company can provide your Continuous Deployment solution, whereas for Docker/ECS you need this shell + Terraform + Jenkins behemoth. Large swaths of code never last long. Just ask the C++ guys who moved to Java, or then moved to Ruby, or then moved to Elixir/Crystal/Go. Same reason why the Ansible/Chef crew likes Terraform instead, and yet still invests in Kubernetes. We’re still at the beginning of serverless, where cloud providers make it easy for people like me, yet hard when you’re in the Enterprise and need security and scale for everything.
If I were outside of CapitalOne, I’d use Lambda + API Gateway for my REST services, and S3 for my static websites. Internally, despite the fact they aren’t really that great for REST API’s, they still have a ton of use everywhere else on the edges of your infrastructure, or filling crucial, concurrent pipelines that can take bursts of traffic.
The constant in both places: They are growing, and more and more people from a variety of skillets want to use them at my company. That’s telling.
Published at DZone with permission of Jesse Warden , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.