Building Technical Operations: What to monitor on a Linux box

This article is a kind of reminder for me (and for anyone else managing a monitoring system) about which metrics should/can be monitored for different kind of hardware and software components in a modern production system. The listed metrics primarily assume the usage of Nagios, Graphite, collectl and logcheck monitoring tools.

Relevant articles from the blog:

System metrics

System metric monitored on a Linux box using Nagios and SNMP-based plugins:

ICMP reachability (packet loss and delay)
CPU usage
System Load Average
RAM and swap memory usage
Incoming and outgoing traffic level on network interfaces
Number of dropped and bad packets on network interfaces
Disk space usage on all local partitions
The presence of required system processes (rsyslogd/sshd/snmpd/puppet/ntpd)
NTP synchronization status and current stratum level
The size of local mail queue
Server IPMI hardware parameters (power supplies, temperature, fans, etc)

It is also important to monitor syslog messages using a centralized syslog server and logcheck tool.

System metrics monitored on a Linux box using Nagios/collectl/Graphite combination of tools:

Disk I/O utilization level
Number of TCP connections
Number of TCP errors
Memory page faults
Number of CPU interrupts

Cisco metrics monitored using Nagios and SNMP-based plugins:

ICMP reachability (packet loss and delay)
CPU usage
Outlet temperature
Status of BGP peers
NTP synchronization status and current stratum level
Up/down status of network interfaces
Incoming and outgoing traffic level on network interfaces
Number of dropped and bad packets on network interfaces
Status of chassis fans
Status of power supplies (for redundant power configurations)

Database (MySQL/PGSQL) metrics monitored using Nagios and custom plugins:

The availability of database service
Database replication status and delay

Application metrics

The list of monitored application metrics highly depends on particular application; the following are examples of some typical applications.

Web applications:

The availability of HTTP/HTTPS service or custom web service port
The expiration status of used SSL certificates
Number of served requests per second (RPS)
Application response time for test backend-light and backend-heavy requests
Number of requests in internal application queue
Number of 4xx HTTP responses
Number of 5xx HTTP responses

More complex web application:

In many cases it will be required to code the application to serve a custom health check URL, and upon hitting the URL the web application will need to run a quick connectivity/status checks to all other components the application depends on (databases, caches, API servers, disk storage access, etc), and respond with a status code. The same health check URL can be used by local load balancers to check the availability of backend application web servers.

Monitoring of business logic:

It is important to deploy the monitoring of application business logic (full end-to-end functionality test) for all critical functions of an application. In many cases such monitoring will require the creation of a custom Nagios check script which will initiate a test request to the monitored business logic and verify that a proper result has been produced within the defined time threshold.

4 comments:

HawaiiramaJune 21, 2013 at 9:06 AM
HI, Victor. How do you measure outliers that may be causing performance problems but are buried in the averages? With a heat meap?
ThomasOctober 23, 2013 at 2:13 PM
https://github.com/harisekhon/nagios-plugins
UnknownDecember 14, 2016 at 5:59 AM
What's your standard monitoring software? I used Nagios, but today our company prefers package from softinventive labs, it's cheaper and more comfortable in use.
RocketNovember 26, 2018 at 5:48 AM
The creator has so delightfully enchanted the thought of group by this brilliant blog.
widescreen monitor

Sunday, June 16, 2013

What to monitor on a Linux box

4 comments: