Dev discussion with Rocky Madden, avid functional programming, computational science, and high-scalability engineer. Lover of the sciences. Core contributor on the Cloud Elements Platform team. Background: The Cloud Elements Platform is transitioning towards more microservices, including serverless microservices, in its architecture.
One of the key pieces in extending our platform is the use of a Functions-as-a-Service platform. Functions compute and execute code triggered by events, on a regular schedule, or in an ad-hoc manner. However, current commercial and open source services have issues with performance, cost, and most importantly lack features for our use cases, so we built our own Functions as a Service (FaaS) platform.
Was there a key driver in taking on more functional programming approaches?
Rocky Madden: As we looked at the growth rate and scaling needs we were moving towards, it was clear we needed a platform with amazing scaling to address our needs. For example, in building out our new Formulas microservice, we embraced deeply functional and logic programming constructs and were able to compile each customer's Formula into a single function. This yields performance and scaling characteristics far beyond what others on the market are able accomplish and, generally, has amazing scalability. As we look to upgrade Bulk, Transformations, and other Cloud Elements services, it is likely they too will adopt similar approaches.
Can you describe what it has been like to build out our own Function as a Service?
RM: When we were first scoping it out, we went through what needs a Function-as-a-Service engine would have in the general sense. Next, we took our day-to-day use cases, which turns out to be edge cases for most others, and determined that no commercial or open source FaaS was coming close to solving many of these. For example, all other FaaSes put a time limit on the execution duration of a function with a hard requirement executions can't exceed a set time limit. For AWS Lambda, this is 5 minutes. This is, in the general sense, a very reasonable limitation to prevent trying to solve Halting Problem-like concerns.
However, this is an example of limitation we absolutely cannot have in any FaaS we adopt. We have Formulas that our customers have created which execute for hours, days, or perhaps even longer. One of the ways we ended up solving this was by having our FaaS provide an object as an argument to all functions in the tail position. Functions can opt-in to using this object which contains, among other things, a shared event bus between the FaaS and function. Functions can then run however long they wish, but when they receive a "pause" event from the FaaS, must halt and emit a "paused" event in return with the arguments that can be called with to "resume".
Another example is our need to solve not only for long running functions, but supporting extremely low-latency use cases where a function might be used in a hot code path either internally or externally by our customers. Some FaaSes will use ephemeral containers. The idea is to spin up a Docker container, run the function, and then destroy the container. This works well, but introduces container creation latency and doesn't scale well once you get into the tens of thousands of function executions per second range. What we have instead is the idea of persistent customer containers. Each customer has a set of auto-scaling containers whose specs and count are determined by their needs. Each time a function is executed for a customer, it runs within their already running containers.
Can you walk through the difficulty of building out our own FaaS engine?
RM: We iterated on the initial design for a while trying different things and making pivots. Over time, for example, we moved to tighter coupling with the Node ecosystem to add features that were not possible to add without doing so. As an example, we wanted to make it dead-simple to publish a function, including functions which were capable of leveraging any NPM package. Some FaaSes have you bundle your dependencies, which ends up being flexible but requires more complexity and friction to getting your function published. Some FaaSes, which are also tightly coupled to the Node ecosystem, have an outdated and limited whitelist of allowed NPM modules.
Instead, we aligned with the ideas building a Node project. You declare your dependencies upfront as one property, just like they look in "package.json", and provide your function source in another property. The function source property is a 1:1 to a module in Node, where the module exports a single function. This has the neat feature of being able to write your functions locally as a Node project with tests and all the other tooling you wish. You can then publish the functions without modification to our FaaS and it just works. Upon being published, the FaaS automatically installs any NPM dependency for you at runtime, without restriction. We also handled for use cases in which one function uses version 1 of a package and another function uses version 2. We did this via a now open source project called "multi-tool" (GitHub Repo).
What are some of the cons we faced when building out our own FaaS?
RM: Well, I suppose, the biggest con is the fact that we had to build out our own FaaS. There is a lot of serious engineering and operational complexity to building your own FaaS that shouldn't be taken lightly. But, as I mentioned previously, after running through our use cases we knew we had to head down this path due to the problems we solve for our customers. Part of what makes Cloud Elements an amazing solution is our ability to solve really hard problems others can't. Part of solving really hard problems others can't is really hard edge cases and that permeates through to our FaaS requirements and needs.
Another difficulty that we have self-imposed is creating features in our FaaS that facilitates solving said hard problems, but doing it in a way that is generic and reusable. Internally on our engineering team, we think of the FaaS as a product itself. One that might be used in interesting new ways in the future. So, we often get into the territory where we solve problems for a problem at hand in one service but approach it in a manner where we make the solution generally applicable. For example, our FaaS is real-time and includes real-time event streams. This ended up solving for a handful of specific needs in the Formulas service but it is now also a generally reusable construct that is made available to any FaaS consumer in the future. All sorts of interesting new use cases have come up from this one feature we added.
What about testing? Has there been new practices/frameworks that have been implemented as a part of moving towards more FaaS approaches?
RM: New services have definitely changed from a testing perspective. All new services have 100% coverage of all branches, functions, lines, and statements. But that's just the starting point. Each individual service often includes benchmarking tests, leak tests, integration and system tests, vulnerability tests, and more. We compose all this up, so each new composition of services also have all these great tests. Additionally, 100% coverage is just the beginning. Just because the items I mentioned are covered doesn't mean they were truly tested. For example, did you simply test that an object from a function was returned in a unit test or did you actually test every part of the object? We do the latter.