Archive for the ‘nagios cluster’ Tag

Fault Tolerant Nagios Cluster

I’ve been searching for a while for a solution of “how to build a fault tolerant Nagios installation” or “how to build a Nagios cluster”. Nada.
The concept is very simple, but it seems like the implementation lacks a bit, so I’ve decided to write a post about how I am doing it.

Cross Site Monitoring

The concept of cross site monitoring is very simple. Say you have nagios01 and nagios02, all that you have to setup is 2 tests:

  • nagios01 monitors nagios02
  • nagios02 monitors nagios01

Assuming you have puppet or chef managing the show, just make nagios01 and nagios02 (or even more nagiosXX servers) identical. Meaning all of them have the same configuration and can monitor all of your systems. A clone of each other if you’d like to call it that way.
Lets check the common use cases:

  • If nagios01 goes down you get an alert from nagios02.
  • If nagios02 goes down you get an alert from nagios01.

Great, I didn’t invent any wheel over here.
The main problem in this configuration is that if there is a problem (any problem) – you are going to get X alerts. X being the number of nagios servers you have.

Avoiding Duplicate Alerts

For the sake of simplicity, we’ll assume again we have just 2 nagios servers, but this would obviously scale for more.
What we actually want to do is prevent both servers from sending duplicate alerts as they are both configured the same way and will monitor the exact same thing.
One solution is to obviously have an active/passive type of cluster and all sort of complicated shenanigans, my solution is simpler than that.
We’ll “chain” nagios02 behind nagios01, making nagios02 fire alerts only if nagios01 is down.
Login to nagios02 and change /etc/nagios/private/resource.cfg, adding the line:

$USER2$="/usr/lib64/nagios/plugins/check_nrpe -H nagios01 -c check_nagios"
$USER2$ will be the condition of whether or not nagios is up on nagios01.

Still on nagios02, edit /etc/nagios/objects/commands.cfg, replacing your current alerting command to depend on the condition. Here is an example for the default one:

define command{
        command_name    notify-host-by-email
        command_line    /usr/bin/printf "%b" ...
}

Change to:

define command{
        command_name    notify-host-by-email
        command_line    eval $USER2$ || /usr/bin/printf "%b" ...
}

What we have done here is simply configure nagios02 to query nagios01 nagios status before firing an alert. Easy as. No more duplicated emails.

For the sake of robustness, if you would like to configure also nagios01 with a $USER2$ variable, simply login to nagios01, change the alerting command like in nagios02 and have in /etc/nagios/private/resource.cfg:

$USER2$="/bin/false"

Assuming you have puppet or chef configuring all that, you can just assign a master ($USER2$=/bin/false) and multiple slaves that query themselves in a chain.
For example:

  • nagios01 – $USER2$=”/bin/false”
  • nagios02 – $USER2$=”/usr/lib64/nagios/plugins/check_nrpe -H nagios01 -c check_nagios”
  • nagios03 – $USER2$=”/usr/lib64/nagios/plugins/check_nrpe -H nagios01 -c check_nagios && /usr/lib64/nagios/plugins/check_nrpe -H nagios02 -c check_nagios”

Enjoy!