DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
  1. DZone
  2. Testing, Deployment, and Maintenance
  3. Maintenance
  4. Monitoring Your Customers with Selenium and Nagios

Monitoring Your Customers with Selenium and Nagios

Geoffrey Papilion user avatar by
Geoffrey Papilion
·
Sep. 23, 12 · Interview
Like (0)
Save
Tweet
Share
9.45K Views

Join the DZone community and get the full member experience.

Join For Free

In a brief conversation with Noah Sussman at DevOps Days, when discussing the challenges of continious deployment for B2B services with SLAs, we got side tracked discussing using Selenium and Nagios in production.

A few years back while working for a B2B company that was compensated by an attributable sales, I got on a phone call early in the morning to discuss fixing a client side display issue. The previous night, after a release a integration engineer modified a config that broke our service from rendering on almost every page at a single customer. The bug was fairly subtle, allowing what he was working on to display correctly, but breaking every other div on the site. This was pushed in early Spring, at off hours, and caught at the beginning of the day on the east coast.

At 9am PST, we held a post-mortem with all of our engineers. We discussed the impact of the issue on our revenue, which fortunately was pretty small, and laid out the timeline. Immediately, we discussed whether this was a testing issue or a monitoring failure. The CEO came back, and said while it was understandable that we missed the failure, our goal as an Ops team should be to catch any rendering issue within 5 minutes of a failure. I was a little annoyed, but agreed to build a series of test to try and catch this.

Why Other Metrics Failed Us

We had a fairly sophisticated monitoring setup at this point. We tracked daily revenue per customer, and we would generally know within 30 minutes if we had a problem. Our customers were very sites, but typically for US only sites had almost no traffic between 0:00 PST/PDT and 6:00 PST/PDT; in that time period it wasn’t unusual to have 0-2 sales. Once we got into a busier sales period the issue was spotted withing 30 minutes, and we were alerted. During this time period it turns out our primary monitoring metric for this type of failure was useless.

QA Tools Can Solve Ops Problems Too

I was familiar with Selenium from some acceptance tests I helped our QA guys write. I began to put together a test suite that met my needs(can’t provide the code for this, sorry). It consisted of:

  • rendering the main page
  • navigating to a page which we displayed content
  • clicking a link we provided
  • verifying that we displayed our content on a new page

This worked fairly well, but I had to tweak some timings to make sure the page was “visable”. I rigged this up to run through jUnit, and left a selenium server running all the time. Every 5 minutes the test suite would execute, leaving behind a log of successes and failures. We eventually built a test suite for every sizable customer. Every 5 minutes we checked the output of the jUnit with a custom Nagios test, that would tell us which customers had failures, an send an individual alert for each one.

Great Success!

I was really annoyed when I first had this conversation with the CEO; I thought this was a boondoggle that ops should not be responsible for. Within the first month my annoyance turned to delight as I started getting paged when our customers had site issues. I typically called them before their NOC had noticed, and most of the time these were issue they introduced to their site. I’d do it again in a heartbeat, and recommend that anyone else give it a try.

Nagios Test suite

Published at DZone with permission of Geoffrey Papilion, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Microservices Discovery With Eureka
  • How Observability Is Redefining Developer Roles
  • What Is a Kubernetes CI/CD Pipeline?
  • Mr. Over, the Engineer [Comic]

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: