Monday, June 17, 2013

What to monitor on a Linux box

This article is a kind of reminder for me (and for anyone else managing a monitoring system) about which metrics should/can be monitored for different kind of hardware and software components in a modern production system. The listed metrics primarily assume the usage of Nagios, Graphite, collectl and logcheck monitoring tools.

Relevant articles from the blog:

System metrics

System metric monitored on a Linux box using Nagios and SNMP-based plugins:
  • ICMP reachability (packet loss and delay)
  • CPU usage
  • System Load Average
  • RAM and swap memory usage
  • Incoming and outgoing traffic level on network interfaces
  • Number of dropped and bad packets on network interfaces
  • Disk space usage on all local partitions
  • The presence of required system processes (rsyslogd/sshd/snmpd/puppet/ntpd)
  • NTP synchronization status and current stratum level
  • The size of local mail queue
  • Server IPMI hardware parameters (power supplies, temperature, fans, etc)

It is also important to monitor syslog messages using a centralized syslog server and logcheck tool.

System metrics monitored on a Linux box using Nagios/collectl/Graphite combination of tools:
  • Disk I/O utilization level
  • Number of TCP connections
  • Number of TCP errors
  • Memory page faults
  • Number of CPU interrupts

Cisco metrics monitored using Nagios and SNMP-based plugins:
  • ICMP reachability (packet loss and delay)
  • CPU usage
  • Outlet temperature
  • Status of BGP peers
  • NTP synchronization status and current stratum level
  • Up/down status of network interfaces
  • Incoming and outgoing traffic level on network interfaces
  • Number of dropped and bad packets on network interfaces
  • Status of chassis fans
  • Status of power supplies (for redundant power configurations)

Database (MySQL/PGSQL) metrics monitored using Nagios and custom plugins:
  • The availability of database service
  • Database replication status and delay

Application metrics

The list of monitored application metrics highly depends on particular application; the following are examples of some typical applications.

Web applications:
  • The availability of HTTP/HTTPS service or custom web service port
  • The expiration status of used SSL certificates
  • Number of served requests per second (RPS)
  • Application response time for test backend-light and backend-heavy requests
  • Number of requests in internal application queue
  • Number of 4xx HTTP responses
  • Number of 5xx HTTP responses

More complex web application:

In many cases it will be required to code the application to serve a custom health check URL, and upon hitting the URL the web application will need to run a quick connectivity/status checks to all other components the application depends on (databases, caches, API servers, disk storage access, etc), and respond with a status code. The same health check URL can be used by local load balancers to check the availability of backend application web servers.

Monitoring of business logic:

It is important to deploy the monitoring of application business logic (full end-to-end functionality test) for all critical functions of an application. In many cases such monitoring will require the creation of a custom Nagios check script which will initiate a test request to the monitored business logic and verify that a proper result has been produced within the defined time threshold.

3 comments:

  1. HI, Victor. How do you measure outliers that may be causing performance problems but are buried in the averages? With a heat meap?

    ReplyDelete
  2. https://github.com/harisekhon/nagios-plugins

    ReplyDelete
  3. What's your standard monitoring software? I used Nagios, but today our company prefers package from softinventive labs, it's cheaper and more comfortable in use.

    ReplyDelete