Building Technical Operations: September 2011

Friday, September 30, 2011

Handling of external monitoring alerts

If you have an Internet-facing production system it is always wise to use an external web availability and/or performance monitoring service like Pindom or Gomez/Keynote/Catchpoint to monitor the exposed resources. All the services support email alerting, and some even provide SMS alerting. As I already explained here I recommend using SMS alerting for critical monitoring events, but managing individual SMS alerting configurations in each of used external monitoring system does not scale.

Why you need an Operations Change Log

This post will explain why you need to manage an Operation Change Log, and will describe a simple and effective method on how to accomplish the task.

The goal of an Operations Change Log is to provide a possibility to trace back, on high level, all changes performed in the production environment. Change Log information is normally used during troubleshooting and post-mortem analysis.

Email and SMS alerting policy

In a regular production environment there are normally three types of monitoring events and corresponding notification methods:

Critical alerts requiring immediate attention; for example, a host is down, web service is not responding, disk is almost full, application response time is high, etc. Critical alerts should be dispatched to responsible on-call engineers and escalation list via email and SMS
Non-critical alerts which can be handled later; for example, alerts about problems with staging/pre-production servers, newly configured monitoring checks which are still in fine tuning process, etc. The alerts should be dispatched by email only
Informative events like expected high server CPU usage during peak time or special events. The alerts may appear in the monitoring system's web interface (for the ones who is watching the screen), but should not be dispatched by email or SMS (or require any acknowledgment)

Assigning DNS names to IP addresses

Why is it so important to configure proper DNS names for all used IP addresses? A few reasons:

To make people remember the names and not IP addresses
To simplify the communication with customers during a troubleshooting session
To easily detect and trace possible security events recorded in log files
To simplify the reading of ping and traceroute results

Or in one sentence - to simplify the troubleshooting process and system management.

My list of favorite Nagios check scripts

Nagios is a great monitoring tool - I used it to monitor networks with hundreds of hosts and thousands of service checks. One of the biggest Nagios advantages is a wide set of service check scripts available for public use, and the following two sites provide a great collection of the scripts:

What is confusing in Nagios is that there are of lot of different scripts from different authors doing more or less the same thing (checking the same stuff), and a system administrator deploying a new instance of Nagios should invest a lot of time selecting the best scripts from the wide variety of available solutions.

The goal of the post is to share with you my list of favorite Nagios check scripts, so next time you will need to deploy a Nagios instance just use the page as a reference for requires check modules.

Equipment naming convention

Why it is so important?
Having a clear and meaningful equipment naming convention will help you to:

minimize human errors of executing right commands on wrong servers
spend less time analyzing monitoring alerts and log messages
have less problems while automating system management tasks
have more understandable engineering and operational description of your network
make your network more manageable and scaleable

Building Technical Operations

Friday, September 30, 2011

Handling of external monitoring alerts

Sunday, September 25, 2011

Why you need an Operations Change Log

Friday, September 23, 2011

Email and SMS alerting policy

Thursday, September 22, 2011

Assigning DNS names to IP addresses

Monday, September 12, 2011

My list of favorite Nagios check scripts

Saturday, September 3, 2011

Equipment naming convention