Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

How to Create a Downtime Dashboard Tool With Zapier and Google Sheets

DZone's Guide to

How to Create a Downtime Dashboard Tool With Zapier and Google Sheets

Sometimes, the most useful projects are hacked together with whatever resources happen to be available.

· Performance Zone ·
Free Resource

xMatters delivers integration-driven collaboration that relays data between systems, while engaging the right people to proactively resolve issues. Read the Monitoring in a Connected Enterprise whitepaper and learn about 3 tools for resolving incidents quickly.

At Lucid, we want to monitor the health of our services both individually and in aggregate. Last month, we worked on setting up a new downtime dashboard tool. We wanted the following functionalities:

  • A record of incidents containing the duration of downtime, the service(s) affected, and information about the cause and resolution.
  • Ability to mark incidents as false positives and exclude them from downtime calculations.
  • Ability to correct the downtime recorded for an incident.
  • Ability to see the number of minutes of downtime for one or more services over a given time period.
  • A graph of monthly downtime per service.
  • A graph of total downtime.

After several hours of searching, we weren’t able to find any tool that could do what we wanted better than a spreadsheet. So, we decided to use Google Sheets to record incidents and graph trends. The next problem was how to get the incident data into Google Sheets. At first, we were going to write a Cron job that would poll OpsGenie for alerts, process any alerts about downtime, and use Zapier to add rows to a Google Sheet. However, after looking into Zapier a little, we discovered Zapier could actually do everything we needed the Cron job to do.

Diagram of data flow from OpsGenie to Google Sheets

Zapier is a tool that makes it easier to send data between different cloud services. In many ways, it is similar to piping Unix commands together. You take data from a source, perform some transformations on the data, and feed it into a destination. In our case, the source was OpsGenie and the destination was a Google Sheet.

Step 1: Getting the Data

The first thing that we needed to do was get the incident information into Zapier. There are two kinds of web hooks that Zapier supports: catch hooks and polling. With a catch hook, Zapier provides you with a URL to which you can post JSON data to begin the pipeline. With polling, on the other hand, you give Zapier a URL which it will periodically fetch and then you look for differences. If there is a difference from the last poll, it will start processing the new data. Catch hooks are preferable since a request is only made when new data is actually available. However, depending on your application, catch hooks may not be feasible.

Selecting the type of web hook

Fortunately, OpsGenie has the capability to send data to a web hook on various events (the web hook integration), so we added a web hook that would send any alerts containing the string “downtime:” in the message to Zapier when the alert was closed.

Unfortunately, the incident data sent by OpsGenie didn’t contain the time the alert was closed at, which we needed to calculate the duration of the downtime. Thankfully, Zapier had another feature that allowed us to get the information we needed. As a second step in the Zapier pipeline, we added another web hook that makes a GET request to OpsGenie to get the full incident report using the ID sent from OpsGenie to Zapier in the first step.

At this point, we have all the information we need about the incident in Zapier.

Data values pulled in from OpsGenie

Step 2: Sending Data to Google Sheets

The main reason that we went with Zapier is that it makes it extremely easy to add data to Google Sheets. Essentially, you just create a template that uses data retrieved from previous steps in the pipeline and that template is used to create or update a row in a spreadsheet. The template is basically a form with fields that come from a header row on the spreadsheet that the data will be exported to. The sheets integration also has a “Lookup Spreadsheet Row” option, which allows you to look up a row based on a key column and possibly add a new row if one doesn’t exist. This is the option we used to avoid adding duplicate entries using the alert ID as the deduplication key.

Google Sheets Zapier template

One slight inconvenience is that Zapier doesn’t really have a good way to handle spreadsheet expressions in the template that reference cells in the same row. There are two ways to deal with this. The first is to output a spreadsheet expression in the template that uses the inputs from earlier in the pipeline directly. For example, the start field in the screenshot. This works in many cases, but not always. In our case, we wanted the user to be able to update two columns in the spreadsheet (“Additional Downtime” and “Include in calculation”) and automatically update the “Total downtime” column. In order to do this, we needed to create a formula that could somehow reference cells in the same row without the actual cell name.

This is possible using the INDIRECTand ROW functions. For example, in order to reference the D cell on the current cell, you would use INDIRECT("D"&ROW()). This concatenates the string “D” with the number of the row and then converts the string to a cell reference. This works pretty well but is somewhat brittle since if the order of the columns ever changes the Zapier template needs to be updated to match.

Finally, although not completely necessary, transforming the data a little bit in Zapier before sending it to Google Sheets made building the spreadsheet easier and more performant. This is possible using the “Code by Zapier” action, which allows you to run a simple javascript function to transform your data. We did three things with this:

  1. Parse the name of the service from the message of the OpsGenie alert.
  2. Convert the timestamp of the alert creation from nanoseconds to days because days are the units Google Sheets uses for datetimes.
  3. Convert the closeTime (i.e., the duration) from milliseconds to minutes.

Step 3: Analyzing Data

Once we had the incident data on a page of our spreadsheet, we needed to create pages for the analysis and graphs. The graphs themselves are pretty straightforward, but generating the data to graph is less so.

First, we wanted to be able to calculate the minutes of downtime and percentage of uptime for each service between a start and end date. In order to do so, we created a cell for the start date and a cell for the end date. Conveniently, Google sheets will pop up a date selector if you double click on a cell containing a date.

To sum all the downtime for all incidents in the time range split by services we use the following formula =sumifs(incidents!C:C,A2,arrayformula(datevalue(incidents!D:D)),">"&$D$2,arrayformula(datevalue(incidents!D:D)),"<"&$E$2, where incidents!C:C is a range containing the total downtime for the incidents, A2is the cell containing the name of the service, incidents!D:D is a range containing the timestamps corresponding to the durations in C, and D2 and E2 are the cells containing the start and end date, respectively. To make the dashboard easier to use, we also generate the service names using the formula =sort(unique(incidents!C:C)).

Spreadsheet formula and incidents sheet header

Creating data for graphs by service works very similarly, except that rather than using specific cells for the start and end date, we use the boundaries of months and create a grid of service vs month that can then be graphed.

Graphing total downtime is a little more complicated because if multiple services have downtime at the same time, then we don’t want to double-count the overlapping time (because typically that means they are related). To do that we wrote a javascript function to use in Google app script, which is accessible from Tools > Script Editor, which takes a range containing the start and end times for each incident and calculates the total downtime collapsing any overlapping time ranges. Here is the code:

function totalDowntime(range) {
  if (typeof(range) != 'object') {
    return 0;
  }
  var merged = [];
  var current = null;
  for (var i = 0; i < range.length; i++) {
    var start = range[i][0],
        end = range[i][1];
    if (current && start <= current[1]) {
      current[1] = end;
    } else {
      current = [start, end];
      merged.push(current);
    }
  }
  return merged.reduce(function(minutes, dur) {
    return minutes + (dur[1] - dur[0])/60000;
  }, 0);
}

Once defined, the function can be used from formulas in cells of the spreadsheet.

Conclusion

Although we were disappointed when we were unable to find a downtime dashboard tool that provided the functionality we needed, it was surprisingly easy to build something that worked using Zapier and Google Sheets. So far, we have been pretty happy with the result. Eventually, we may need something a little more robust and scalable, but for now, an automatically generating spreadsheet is sufficient. What was your most successful project that was quickly hacked together with minimal resources?

3 Steps to Monitoring in a Connected Enterprise. Check out xMatters.

Topics:
zapier ,performance ,google sheets ,downtime

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}