This was originally posted on the Zenoss Community Blog.
The DataSift platform consumes, processes, filters, stores and streams terabytes upon terabytes of social data to businesses around the world and we in the Operations team needed to know of any problems then acknowledge and investigate them as quickly and easily as possible.
Zenoss had long been the mainstay of our monitoring for general server and application health but we were planning on extending our monitoring by coupling it with metrics that were to be emitted by every piece of software and hardware in the platform.
Utilising a fork of Etsy’s statsd, a fork of Graphite and coupled with our own interfaces into each; we started gathering thousands of metrics a second. These metrics are displayed on screens in the Operations area and are also evaluated for erroneous behaviour by another piece of inhouse software.
(The strobe light that can be seen in the picture is also hooked up to Zenoss; http://www.youtube.com/watch?v=OqkILJCdCA0 )
When erroneous events are detected they are sent to Zenoss (having been deployed and configured by Chef) which automatically knows about the sever or class to which the metric relates to and can set the severity accordingly (unless we have sent an override as part of the JSON API request).
This got us halfway to our goal, Zenoss was now monitoring the platform using a variety of Zenpacks as well as being informed of platform metrics and what they mean in regards to health. However we weren’t yet in control of the alerting and acknowledgement if we were on the move, in a meeting, asleep in bed or otherwise not near a desk or easy reach of a laptop.
Being open-source with native JSON support, HTTP libraries, background services, notifications and everything else we’d need to create an app that would alert us within moments of Zenoss detecting an issue, allow us to acknowledge the event, delve into the details and all from a phone or tablet, Android was an easy choice. The only problem was that someone had to write it.
As Rhybudd uses the Zenoss JSON API it is quite easy to perform the initial authentication:
HttpPost httpost = new HttpPost(URL + "/zport/acl_users/cookieAuthHelper/login"); List <NameValuePair> nvps = new ArrayList <NameValuePair>(); nvps.add(new BasicNameValuePair("__ac_name", UserName)); nvps.add(new BasicNameValuePair("__ac_password", Password)); nvps.add(new BasicNameValuePair("submitted", "true")); nvps.add(new BasicNameValuePair("came_from", URL + "/zport/dmd")); httpost.setEntity(new UrlEncodedFormEntity(nvps, HTTP.UTF_8)); // Response from POST not needed, just the cookie HttpResponse response = httpclient.execute(httpost); // Consume so we can reuse httpclient response.getEntity().consumeContent();
Once the httpclient object had the cookie returned by this request one can call other API functions quite simply:
HttpPost httpost = new HttpPost(ZENOSS_INSTANCE + "/zport/dmd/Events/evconsole_router"); httpost.addHeader("Content-type", "application/json; charset=utf-8"); httpost.setHeader("Accept", "application/json"); JSONObject dataContents = new JSONObject(); dataContents.put("evid",_EventID); JSONArray data = new JSONArray(); data.put(dataContents); JSONObject reqData = new JSONObject(); reqData.put("action", "EventsRouter"); reqData.put("method", "detail"); reqData.put("data", data); reqData.put("type", "rpc"); reqData.put("tid", String.valueOf(this.reqCount++)); httpost.setEntity(new StringEntity(reqData.toString())); HttpResponse response = httpclient.execute(httpost); String zenossJSON = EntityUtils.toString(response.getEntity()); response.getEntity().consumeContent(); JSONObject ZenossObject = new JSONObject(zenossJSON );
With Android versions of each API endpoint complete the app needed to utilise them on a regular basis to poll Zenoss and then process the alerts:
AlarmManager am = (AlarmManager)getSystemService(ALARM_SERVICE); Intent Poller = new Intent(this, ZenossPoller.class); PendingIntent Monitoring = PendingIntent.getService(this, 0, Poller, PendingIntent.FLAG_UPDATE_CURRENT); am.setRepeating(AlarmManager.ELAPSED_REALTIME_WAKEUP, 0, Long.parseLong(settings.getString("BackgroundServiceDelay", "60")) * 1000, Monitoring);
If Rhybudd detects an event that matches your configured filters it will then raise a notification with sound and vibration (if enabled) containing a brief overview of what caused the alert;
In older versions of the application clicking on the notification would then launch another poll which depending on your connection speed and other factors could cause an unacceptable level of delay before you could start acknowledging events. One of the key tenets of the Android Design Guide is to make an app feel responsive so Rhybudd will now only force a refresh on launch if the data in the local cache is determined to be stale. This means the latest events will be ready to view almost instantly each time you launch the app (with background polling enabled).
Performing a single tap on an Event will either display a popup on a phone allowing you to acknowledge or view details of the event whereas a tablet will shift itself around to instantly display information about the event in all the extra screen space available and provide an acknowledge button.
You don’t however have to tap on every event to acknowledge it, you can long press on several events to acknowledge all selected events at once or tap the acknowledge all button which would as its namesake suggests acknowledge all currently unacknowledged events.
Rhybudd has other features and functionality that can be read about in more detail on the Zenoss wiki but at it’s core it is an app that can poll your Zenoss instance from as often as every 30 seconds up to every hour and then can alert you via Android notifications with an alert sound of your choice (it can even be set to repeat the notification sound indefinitely until you acknowledge the alert if you’re a heavy sleeper!). Once alerted you can acknowledge the alert, delve a little deeper and even add log messages or escalate to colleagues.
Problems and Mistakes:
Not all UI elements need to be used. Originally Rhybudd offered the user a seek bar slider to specify the time between polling intervals, UI wise it made sense to me (no doubt in the early hours of a morning) but many people found that trying to move a tiny circle across 3” of seek bar with an accuracy of 1pixel / second incredibly infuriating. It’s now a drop down menu.
Backwards and forward compatibility with the remote platform is a requirement. When Rhybudd was first released it had only been tested against Zenoss 3.2 but people were already using the alpha version of 4. Unfortunately neither the Beta status of Rhybudd nor the Alpha status of Zenoss 4 prevented people from leaving negative feedback if they hit an issue.
People want to be notified about their Events not spammed about problems. Until the latest version the app would spawn a notification for every new event that was received during the poll cycle. In a cascade failure scenario that could result in hundreds of individual notifications crushing the Android notification queue, status bar, sound subsystem and everything else in between. As seen above the new version creates a single notification with extracts from the most important alerts and a count to indicate how many other events there are.
Rhybudd has met its initial goal of providing myself and the rest of our Operations team with a way of pulling alerts from Zenoss on our schedule and empowering us to acknowledge, respond and escalate to them as required but the challenge now is to ensure that the other people who have the app installed can do what they want to be able to do with it too.
There are still a few rough edges on the app that I plan to smooth out but what it really needs is some direction from the community as to which features you would find most beneficial.
Feel free to email me Gareth@NetworksAreMadeOfString.co.uk or send me a tweet; @NetworkString to let me know any feedback you may have.
The app is free, doesn’t have ads, is open source and available on Google Play now.