Sunday, June 16, 2013

What to monitor on a Linux box

This article is a kind of reminder for me (and for anyone else managing a monitoring system) about which metrics should/can be monitored for different kind of hardware and software components in a modern production system. The listed metrics primarily assume the usage of Nagios, Graphite, collectl and logcheck monitoring tools.

Relevant articles from the blog:

System metrics

System metric monitored on a Linux box using Nagios and SNMP-based plugins:
  • ICMP reachability (packet loss and delay)
  • CPU usage
  • System Load Average
  • RAM and swap memory usage
  • Incoming and outgoing traffic level on network interfaces
  • Number of dropped and bad packets on network interfaces
  • Disk space usage on all local partitions
  • The presence of required system processes (rsyslogd/sshd/snmpd/puppet/ntpd)
  • NTP synchronization status and current stratum level
  • The size of local mail queue
  • Server IPMI hardware parameters (power supplies, temperature, fans, etc)

It is also important to monitor syslog messages using a centralized syslog server and logcheck tool.

System metrics monitored on a Linux box using Nagios/collectl/Graphite combination of tools:
  • Disk I/O utilization level
  • Number of TCP connections
  • Number of TCP errors
  • Memory page faults
  • Number of CPU interrupts

Cisco metrics monitored using Nagios and SNMP-based plugins:
  • ICMP reachability (packet loss and delay)
  • CPU usage
  • Outlet temperature
  • Status of BGP peers
  • NTP synchronization status and current stratum level
  • Up/down status of network interfaces
  • Incoming and outgoing traffic level on network interfaces
  • Number of dropped and bad packets on network interfaces
  • Status of chassis fans
  • Status of power supplies (for redundant power configurations)

Database (MySQL/PGSQL) metrics monitored using Nagios and custom plugins:
  • The availability of database service
  • Database replication status and delay

Application metrics

The list of monitored application metrics highly depends on particular application; the following are examples of some typical applications.

Web applications:
  • The availability of HTTP/HTTPS service or custom web service port
  • The expiration status of used SSL certificates
  • Number of served requests per second (RPS)
  • Application response time for test backend-light and backend-heavy requests
  • Number of requests in internal application queue
  • Number of 4xx HTTP responses
  • Number of 5xx HTTP responses

More complex web application:

In many cases it will be required to code the application to serve a custom health check URL, and upon hitting the URL the web application will need to run a quick connectivity/status checks to all other components the application depends on (databases, caches, API servers, disk storage access, etc), and respond with a status code. The same health check URL can be used by local load balancers to check the availability of backend application web servers.

Monitoring of business logic:

It is important to deploy the monitoring of application business logic (full end-to-end functionality test) for all critical functions of an application. In many cases such monitoring will require the creation of a custom Nagios check script which will initiate a test request to the monitored business logic and verify that a proper result has been produced within the defined time threshold.

4 comments:

  1. HI, Victor. How do you measure outliers that may be causing performance problems but are buried in the averages? With a heat meap?

    ReplyDelete
  2. https://github.com/harisekhon/nagios-plugins

    ReplyDelete
  3. What's your standard monitoring software? I used Nagios, but today our company prefers package from softinventive labs, it's cheaper and more comfortable in use.

    ReplyDelete
  4. The creator has so delightfully enchanted the thought of group by this brilliant blog.
    widescreen monitor

    ReplyDelete