Over a million developers have joined DZone.

Configuration only Deployments: Reduce outage windows by 90%

DZone's Guide to

Configuration only Deployments: Reduce outage windows by 90%

· DevOps Zone
Free Resource

“Automated Testing: The Glue That Holds DevOps Together” to learn about the key role automated testing plays in a DevOps workflow, brought to you in partnership with Sauce Labs.

Monitoring configuration is complicated, and the depths that you can configure alerts and tests seems endless. It may seem like a waste of time to invest in some options, but others can really help you eliminate states that send hundreds of alerts. Your end goal in your configuration is to narrow down any alert sent to the pager to be immediately actionable, and that all other issues are ignored. Certain Failure states like failed switches, routers, can cause a flood of alerts since they take down the network infrastructure, and obscure the true cause of an outage.

Defining the Right Config

The first step you can take to prevent a flood of pages is to define all you routers, switches, and other network equipment in your Nagios config. After you have that defined you simply need to define a parent on the config object.
For example:

    # Primary Switch in VRRP Group
    define host {
    use switch
    host_name switch-1
    hostgroups switches

    #Secondary Switch in VRRP Group
    define host {
    use switch
    host_name switch-2
    hostgroups switches

    define host {
    use server
    host_name apache-server-1
    hostgroups servers, www
    parents switch-1, switch-2

This will configure the host apache-server-1 such that if switch-1 and switch-2 fail, alerts will be silence from the client. The alerts will remain off until either switch-1 or switch-2 becomes available again.

A Few Things to Keep in Mind

Nagios is pretty smart, and can handle multiple parents so that alerts will only be silenced if both parents become unavailable.

The availability of parent hosts is determined by the host health check, most commonly ping. If you need some other test of availability, make sure to define this in the host object.

Parent all the objects you can or that make sense to parent. For example, a router or transport failure at a remote data center should only send a single alert. This means you should define your routers, switches, and possibly your providers gateways. Do whatever you think makes sense, and take it as far as your can. Remember your goal is to make the number of alerts manageable, so the better you define the topology the less likely you are to get a useless page, or several hundred useless pages.

Learn about the importance of automated testing as part of a healthy DevOps practice, brought to you in partnership with Sauce Labs.


Published at DZone with permission of Geoffrey Papilion, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.


Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.


{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}