/ cloud

What I am learning: System monitoring, incident management and alerting

Last week I gave a talk on DevOps at the Facebook developer circles at Outbox. I was impressed by the reception and interest I got from the attendees. Because Devops is already something am incorporating in our organization, it's something that's going to occupy me for a while.

One of the tenets of Devops is Monitoring and alerting.

Monitoring is ideally knowing what's going on with your infrastructure. You can avoid multiple system and application outages by having a better monitoring and alerting system.

When I joined the .ug registry, one of the very first things I implemented was a custom monitoring system. Because this was so urgent and critical, I quickly developed one from scratch using Python flask as web framework, MongoDB as a datastore. A combination of bash and python scripts did the backend monitoring and reporting which would in turn fed data to the MongoDB and then displayed the results on a web interface.

What I was interested in monitoring was server uptime and specific services depending on the purpose of the server.

Checking whether or not a server is up is fairly straightforward in bash. The trick though is how to prevent false alerts. I overcome that by trying multiple times before finally registering a fail.

ip_up(){
    #0 for success, 1 for failure
    server_ip="$1"
    maxtrials=20
    trials=1
    while echo "server: $server_ip, trials $trials" >&2
          ! ping -i 1 -c 3 "$server_ip" > /dev/null 2>&1 || return 0
    do
        ((trials++))
        (( "$trials" > "$maxtrials" )) && break
        echo "Trial $trials: $server_ip is down, checking again after 1 sec" >&2
        sleep 10
    done

    echo "$(date +'%Y-%m-%d %H:%M:%S'): Server $server_ip is DOWN" >&2 
    return 1
}

And for services, I wrote a python/bash script that checks if the respective service ports are open or not. This would determine if a service is up or down.

+web server

  • http
  • https
  • ftp

+mail servers
-smtp
-submission
-imap(s)
-pop3(s)
-http

Something like this;

def run_shell_cmd(cmd):
	p = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE).communicate()[0]
	cmd_res = p.split("\n")[0]
	return cmd_res

def get_service_status(ip, port):
	cmd = "nc -z -v -w3 %s %s; echo $? " % (ip, port)
	status = "down"
	res = run_shell_cmd(cmd)
	trials, max_trials = [1, 5]
	while res == "1" and trials < max_trials:
		trials += 1
		print "service %s is still down, trying trial # %s " % (port, trials)
		res = run_shell_cmd(cmd)
		sleep(2)

	if res == "0": #1 is success, 0 is failed
		status = "up"
	return status

This custom monitoring system has been working for us for over two years. Now we want to take things a notch higher.

I have been researching on a number of tools that could automate our monitoring and alerting system in a more scalable and granular way.

After some good reading around, am spoilt for choice and somewhat confused too :).

I have tried to categorize existing tools according to my understanding of what they do. This will hopefully help us narrow down on the right choice.

I'll be interested in something that I can run on-premise, not a Saas solution. Something that's obviously scalable, something with a great web UI and something that enjoys wide community support.

Some of the tools I have come across include the following in my own categorization;

+Free Monitoring & alerting
-Nagios
-Icinga
-Sensu
-Prometheus
-Zabbix
-OpsView
-Sysdig
-alerta

+Time series database
-Bosun
-OpenTSDB
-InfluxDB
-Graphite

+Cloud vendors
-Cloud Monitor – Rackspace
-Stackdriver – Google
-CloudWatch – AWS

+Enterprise solutions
-LogicMonitor
-Anturis

+SaaS Monitoring
-serverdensity.com
-Anturis
-Monitis
-Datadog
-Outlyer

+Uptime monitoring Saas
-happyapps.io
-uptimerobot
-Pingdom

+Saas incident management/response
-OpsGenie
-VictorOps
-Pagerduty
-BigPanda

Saas application Monitoring
-New relic

As you can tell, that's a lot of options. But so far, I have identified Sensu.

Sensu is hard to install because it's composed of several independent components;

  • RabbitMQ as a message transport bus
  • Redis for preserving state
  • Sensu server which schedules monitoring checks
  • Sensu client/agent which executes checks and reports them to the server
  • Sensu API which provides an interface for third party apps to interact with Sensu
    -Uchiwa which provides a web interface for all checks, clients and other functionality.

While Sensu might be overwhelming at first to understand and install, I like it's very modular architecture. It's compatible with Nagios plugins, several community plugins and one can write their own monitoring scripts too in whatever language they wish and integrate with the system. This means, It'll be easier for me to re-use existing scripts which I already spent a lot of time writing in the existing monitoring system we have.

It also implements "handlers" which enable you add alerting system. For our existing system, all system alerts go via email and and more critical ones like server outage through SMS. The other thing I'll be interested in is real-time metrics monitoring -- specifically CPU and RAM for our web servers. Sensu integrates with graphite/grafana which provides real-time graphing of system metrics, so this is something I am definitely excited about.

I already have Sensu installed on my localhost, so I'll be learning more about it in the coming days and then see how I can roll out a whole new monitoring system for our organization. I will then share my experiences with the new system if all goes well.

Image: Pixabay.com

What I am learning: System monitoring, incident management and alerting
Share this

Subscribe to David Okwii dev blog