If you have an Internet-facing production system it is always wise to use an external web availability and/or performance monitoring service like Pindom or Gomez/Keynote/Catchpoint to monitor the exposed resources. All the services support email alerting, and some even provide SMS alerting. As I already explained here I recommend using SMS alerting for critical monitoring events, but managing individual SMS alerting configurations in each of used external monitoring system does not scale.
Building and running a large production system is difficult. This blog is for people who build and manage large Internet and intranet systems and services
Friday, September 30, 2011
Handling of external monitoring alerts
Labels:
Monitoring,
monitoring alerting,
Pingdom,
RT,
SMS,
SMS via email
Sunday, September 25, 2011
Why you need an Operations Change Log
This post will explain why you need to manage an Operation Change Log, and will describe a simple and effective method on how to accomplish the task.
The goal of an Operations Change Log is to provide a possibility to trace back, on high level, all changes performed in the production environment. Change Log information is normally used during troubleshooting and post-mortem analysis.
The goal of an Operations Change Log is to provide a possibility to trace back, on high level, all changes performed in the production environment. Change Log information is normally used during troubleshooting and post-mortem analysis.
Labels:
Change Management,
Operations Change Log,
RT
Friday, September 23, 2011
Email and SMS alerting policy
In a regular production environment there are normally three types of monitoring events and corresponding notification methods:
- Critical alerts requiring immediate attention; for example, a host is down, web service is not responding, disk is almost full, application response time is high, etc. Critical alerts should be dispatched to responsible on-call engineers and escalation list via email and SMS
- Non-critical alerts which can be handled later; for example, alerts about problems with staging/pre-production servers, newly configured monitoring checks which are still in fine tuning process, etc. The alerts should be dispatched by email only
- Informative events like expected high server CPU usage during peak time or special events. The alerts may appear in the monitoring system's web interface (for the ones who is watching the screen), but should not be dispatched by email or SMS (or require any acknowledgment)
Labels:
monitoring alerting,
SMS,
SMS via email
Thursday, September 22, 2011
Assigning DNS names to IP addresses
Why is it so important to configure proper DNS names for all used IP addresses? A few reasons:
- To make people remember the names and not IP addresses
- To simplify the communication with customers during a troubleshooting session
- To easily detect and trace possible security events recorded in log files
- To simplify the reading of ping and traceroute results
Or in one sentence - to simplify the troubleshooting process and system management.
Labels:
DNS,
DNS name,
interface DNS name
Monday, September 12, 2011
My list of favorite Nagios check scripts
Nagios is a great monitoring tool - I used it to monitor networks with hundreds of hosts and thousands of service checks. One of the biggest Nagios advantages is a wide set of service check scripts available for public use, and the following two sites provide a great collection of the scripts:
What is confusing in Nagios is that there are of lot of different scripts from different authors doing more or less the same thing (checking the same stuff), and a system administrator deploying a new instance of Nagios should invest a lot of time selecting the best scripts from the wide variety of available solutions.
The goal of the post is to share with you my list of favorite Nagios check scripts, so next time you will need to deploy a Nagios instance just use the page as a reference for requires check modules.
What is confusing in Nagios is that there are of lot of different scripts from different authors doing more or less the same thing (checking the same stuff), and a system administrator deploying a new instance of Nagios should invest a lot of time selecting the best scripts from the wide variety of available solutions.
The goal of the post is to share with you my list of favorite Nagios check scripts, so next time you will need to deploy a Nagios instance just use the page as a reference for requires check modules.
Labels:
Monitoring,
Nagios,
Nagios plug-ins
Saturday, September 3, 2011
Equipment naming convention
Why it is so important?
Having a clear and meaningful equipment naming convention will help you to:
Having a clear and meaningful equipment naming convention will help you to:
- minimize human errors of executing right commands on wrong servers
- spend less time analyzing monitoring alerts and log messages
- have less problems while automating system management tasks
- have more understandable engineering and operational description of your network
- make your network more manageable and scaleable
Labels:
Equipment Naming Convention,
Host names
Subscribe to:
Posts (Atom)