One of the things we take very serious here at Kumina is our monitoring. We’ve always done so, but even we must admit that during the starting years, we sometimes forgot to include all possible checks for a new service or host. And it sucks when you forget to setup the monitoring for a specific item, because you generally only find out about it when it’s actually down already…
We like to check as much as possible (if not everything). For example, we check if a service is up and running, we check if a vhost is returning the expected response, if an SSL certificate is still valid or if it will expire within 30 days, we check if OpenVPN certificates are close to expiration and if all loaded Apache modules actually come from a Debian package. And we check often, generally every 30 seconds, but we would prefer to do it even more often. However, these are not things you want to configure manually over and over again.
We’re using Icinga in two datacenters in failover mode, the second node takes over if the primary is unreachable. We currently monitor 319 hosts (including some failover virtual hosts) and a grand total of nearly 10000 checks. Although this fluctuates daily, since most changes on a server also adds or removes checks. It is all done automatically. This prevents us from forgetting to setup monitoring for a specific item or host and also allows us to quickly deploy new checks on the entire infrastructure. Consider the Fokirtor check we created last year, it’s very easy for us to simply deploy it on all those machines.
Using the tools at hand
We’re currently pretty heavy Puppet users, so we leverage the infrastructure we already have in place for that.
Since a puppet agent runs on our monitoring hosts every few hours, it’ll deploy new configuration a few times per day. It’s not exactly continuous delivery, but close enough for our needs for now. Equally important, it removes checks we no longer need. For instance, if we’ve create a redirect that was changed into a full-fledged site, the check is automatically changed to no longer expect a 301 response but a 200 with a correct string (that we provided, of course, it’s not that automated).
We started out using the power of puppet’s exported resources but over time as our config grew, it started to take way too long for Puppet to deploy new configuration on the monitoring hosts. We now deploy the configuration for both Icinga instances using a script that reads the stored config from the Puppet database.
As you might imagine, we also do this for trending with Munin. We automatically deploy the Munin plugins on the clients when we deploy a new service and we automatically deploy the host configuration on the Munin server. As well as required firewall rules on the client side.