Sunday, June 16, 2013

Metrics, metrics, metrics...

A lot has been said about the importance of system and application metrics - I'll not repeat this, and will concentrate on of-the-shelf options available to implement a robust, usable and scalable metrics collection and monitoring system.

We will talk about three main areas:
  1. How to generate/create metrics
  2. How to collect and represent the metrics
  3. How to monitor the collected metrics

During the last 10 years I have used many different tools to collect and analyze metrics: Cacti (with many different plugins and graph templates), MRTG, Ganglia, Graphite, RTG, Nagios (with Nagiosgraph component). Eventually I ended up employing the following systems:
  • Cacti - primary because of a nice set of graph templates for different applications and devices. Cacti is not flexible and scriptable enough to use it for all hosts in the network, but it it still good enough because of existing graph templates for MySQL, PGSQL and Cisco components
  • Graphite - primary because of its simplicity, flexibility and scalability
  • Nagios - for great monitoring and alerting features
In this article I'll describe a solution based on Graphite open source realtime graphing system.

How to generate Graphite metrics

Any application metric can be easily exported from your code like described here.

There are several Graphite-compatible third-party agents available to collect and report OS related metrics (CPU/disk/RAM usage, TCP stats, etc), and I have very good experience using Collectl agent (it supports Graphite interface starting from version 3.6.1).

How to collect and represent Graphite metrics

A standard Graphite server with Wisper and Carbon component will do the work just fine. To increase the availability of the service it is possible to deploy it on an active-passive pair of servers with the data storage synchronized using DRBD, and health-checking and automatic failover implemented using standard Linux Heartbeat clustering software.

How to monitor the metrics

Nagios is my favorite monitoring tool as I already described in this and this article.

There are two Nagios plugins which can be effectively used to monitor metrics stored in a Graphite instance:
  1. check-graphite Ruby script https://github.com/pyr/check-graphite. The script is an interface to access any metric stored in Graphite database and alert when a metric is exceeding a defined warning/critical threshold
  2. !NADA project https://github.com/ricardomaraschini/nada - a baseline monitoring plugin which can be be used with any Nagios check script returning performance data (and the check-graphite plugin is returning the data)

Summary

In a modern production system with variety of deployed hardware and software components it is not easy to do everything with a single graphing and monitoring tool, and it is most likely you will end up using several such tools. The experience shows that Graphite, Nagios and Cacti is a very suitable set of graphing and monitoring tools.

No comments:

Post a Comment