Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Lessons Learned: Serverless Chatbot Architecture

DZone's Guide to

Lessons Learned: Serverless Chatbot Architecture

There's more to this prize-winning chatbot than meets the eye. Take a look at how this one project makes use of serverless architecture to help DevOps teams.

· Cloud Zone
Free Resource

See how the beta release of Kubernetes on DC/OS 1.10 delivers the most robust platform for building & operating data-intensive, containerized apps. Register now for tech preview.

marbot forwards alerts from AWS to your DevOps team via Slack. It was one of the winners of the AWS Serverless Chatbot Competition in 2016. Today, I want to show you how marbot works and what we've learned so far.

Let’s start with the architecture diagram.

marbot ArchitectureClick to enlarge.

Architecture

The marbot API is provided by an API Gateway. We get most of your requests from:

The API Gateway forwards HTTP requests to one of our Lambda functions. All of them are implemented in Node.js and store their state in DynamoDB tables.

One special case is the Slack Button API. When you press a button in a Slack message, marbot has 2 seconds to respond to this message. To respond to a button press, marbot may need to make a bunch of calls to the Slack API.

Lessons Learned

Decoupling the Process

We learned that we miss the 2-second timeout very often by looking at our CloudWatch data. To not miss the 2-second timeout, we now only put a record into a Kinesis stream that contains all relevant data before we respond to the API request. Writing to Kinesis is a quick operation, and we haven’t seen 2-second timeouts since we switched to Kinesis streams.

As soon as possible we read the Kinesis stream and process the records within a Lambda function. Kinesis comes with its challenges. If you fail to process a record, the Lambda Kinesis integration will retry this record as long as the record is deleted from the stream. All the newer records will not be processed until the failed record is deleted or you fix the bug!

We also thought about using SQS, but:

  • there is no native SQS Lambda integration
  • we can not build one on our own that is serverless and responds within a second

So we decided to use Kinesis knowing that an error can stop our whole processing pipeline.

Resilient Remote Calls

HTTP requests are hard. A lot of things can go wrong. Two things that we learned early when talking to the Slack API:

  1. Set timeouts: We use 3 seconds at the moment and think about reducing this to 2 seconds
  2. Retry on failures like timeouts or 5XX responses.

Our Node.js implementation of Slack API calls relies on the requestretry package:

const requestretry = require('requestretry');
const AWSXRay = require('aws-xray-sdk');
function invokeSlack(method, qs, cb) {
  requestretry({
    method: 'GET',
    url: `https://slack.com/api/${method}`,
    qs: qs,
    json: true,
    maxAttempts: 3, // retry only 3 times
    retryDelay: 100, // wait 0.1 seconds between two retries
    timeout: 3000, // timeout after 3 seconds
    httpModules: {
      'http:': AWSXRay.captureHTTPs(require('http')), // enable X-Ray tracing for http calls
      'https:': AWSXRay.captureHTTPs(require('https')) // enable X-Ray tracing for https calls
    }
  }, function(err, res, body) { /* ... */ });
}


The following screenshot shows an X-Ray trace where the code retried Slack API calls because of the 3 seconds timeout.

X-Ray trace

Implementing Timers on AWS

For every alert that arrives in marbot, we keep a timer. 5 minutes after the alert is received, we check if someone acknowledged the alert. If not, we escalate the alert to another engineer or the whole team. We have decided to use SQS queues for that. If you send a message to an SQS queue, you can set a delay. Only after the delay, the message becomes visible in the queue. Exactly what we need! The only downside to this solution is that there is no native way to connect Lambda and SQS. But with a few lines of code, you can implement this on your own.

Keeping Secrets Secure

We use git to version our source code. To communicate with the Slack API, we need to store a secret that we use to authenticate with Slack. We keep those secrets in a JSON file that is added to git as well. But we encrypt the whole file with KMS before we put it into git with the AWS CLI:

aws kms encrypt --key-id XXX --plaintext fileb://config_plain.json --output text --query CiphertextBlob | base64 --decode > config.json

Make sure to put config_plain.json into your .gitignore file!

Outside of the Lambda handler code, we use this code snippet to decrypt the configuration:

const fs = require('fs');
const AWSXRay = require('aws-xray-sdk');
const AWS = AWSXRay.captureAWS(require('aws-sdk'));
const kms = new AWS.KMS({apiVersion: '2014-11-01'});
const config = new Promise(function(resolve, reject) {
  fs.readFile(`config.json`, function(err, data) {
    if (err) {
      reject(err);
    } else {
      kms.decrypt({CiphertextBlob: data}, function(err, data) {
        if (err) {
          reject(err);
        } else {
          try {
            resolve(JSON.parse(new Buffer(data.Plaintext, 'base64')));
          } catch (err) {
            reject(err);
          }
        }
      });
    } 
  });
});


Inside the Lambda handler code, you can access the config like this:

config
  .then(function(c) {
    // do something
  })
  .catch(function(err) {
    // handle error
  });


Using this approach, you will only make one API call to KMS (for every Lambda runtime).

Getting Insights

We use custom CloudWatch metrics to get insights into:

  • How many Slack teams installed marbot
  • The number of alerts and escalations created

We use a CloudWatch Dashboard to display those business metrics together with some technical metrics.

marbot dashboard

Deploying the Infrastructure

Our pipeline for deploying marbot works like this:

  1. Download dependencies (npm install)
  2. Lint code
  3. Run unit tests (we mock all external HTTP calls with nock
  4. cloudformation package
  5. cloudformation deploy to an integration stack
  6. Run integration tests with newman
  7. cloudformation deploy to a prod stack

Jenkins runs the pipeline. Since our code is hosted on BitBucket, we can not easily use CodePipeline at the moment.

New Mesosphere DC/OS 1.10: Production-proven reliability, security & scalability for fast-data, modern apps. Register now for a live demo.

Topics:
cloud ,serverless architecture ,chatbot ,cloud infrastructure

Published at DZone with permission of Michael Wittig, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}