Saturday, November 5, 2011

Don’t hesitate to ask your R&D for software documentation!


If you, as an Operations Engineer, have experience working with third-party software, it is most likely that you enjoy a full set of documentation for used products: deployment guide, installation manual, user’s guide, release notes, upgrade procedures, troubleshooting procedure, etc. When you switch to use in-house R&D products you see that the products, in most cases, are not accompanied by any usable Operations documentation and this makes the process of deploying and running the software quite challenging and not a scalable task. This blog will help you to define some minimal documentation requirements for in-house R&D products, and this addition should not provide any significant burden for your R&D team.

Thursday, October 13, 2011

Remote Control for Your Production Site

Have you ever found yourself rushing to the office data center in the middle of the night just because a critical server is down, and you don't have any means to remotely access the server's console or reset the power? How much time and money have you spent on line with data center technicians trying to recover critical pieces of your production infrastructure located at a remote colocation facility?

This post will explain which remote management options are available for you, and how to organize remote management of your critical equipment.

Saturday, October 8, 2011

Have a problem managing your tasks? Read this post!


If you are lucky enough to handle all your tasks and requests in time, without a need to manage and prioritize a backlog, then you probably don't need to read the post. If you are a manager and all your direct reports are excellent in managing their task lists and setting by themselves correct tasks priorities then you probably can skip the post too. The rest of you should keep reading.

This is the situation:
  • You have a team of bright engineers
  • You have many requests coming from different sources: monitoring alerts, customer support tickets, internal change requests, new software releases deployment requests, corporate IT requests (if you still wear the hat), internal requests from different departments like sales, marketing, etc

Sunday, October 2, 2011

My monitoring tools set

Now it is time to talk about my set of monitoring tools suitable to monitor medium and large systems.

Nagios monitoring service
I just love it! See my another article for a list of many Nagios plugins suitable to monitor the most of popular services. You can easily write your own plugins, and this is what I did to monitor many application-specific services.

Cacti graphing service
If you  have Nagiosgraph module deployed in your Nagios monitoring system you may think that Cacti is redundant. But it is not, especially when you start to use many available nice Cacti plugins and templates for MySQL, Tomcat, Cisco hardware, disk I/O and others.

Friday, September 30, 2011

Handling of external monitoring alerts

If you have an Internet-facing production system it is always wise to use an external web availability and/or performance monitoring service like Pindom or Gomez/Keynote/Catchpoint to monitor the exposed resources. All the services support email alerting, and some even provide SMS alerting. As I already explained here I recommend using SMS alerting for critical monitoring events, but managing individual SMS alerting configurations in each of used external monitoring system does not scale.

Sunday, September 25, 2011

Why you need an Operations Change Log

This post will explain why you need to manage an Operation Change Log, and will describe a simple and effective method on how to accomplish the task.

The goal of an Operations Change Log is to provide a possibility to trace back, on high level, all changes performed in the production environment. Change Log information is normally used during troubleshooting and post-mortem analysis.

Friday, September 23, 2011

Email and SMS alerting policy

In a regular production environment there are normally three types of monitoring events and corresponding notification methods:
  1. Critical alerts requiring immediate attention; for example, a host is down, web service is not responding, disk is almost full, application response time is high, etc. Critical alerts should be dispatched to responsible on-call engineers and escalation list via email and SMS
  2. Non-critical alerts which can be handled later; for example, alerts about problems with staging/pre-production servers, newly configured monitoring checks which are still in fine tuning process, etc. The alerts should be dispatched by email only
  3. Informative events like expected high server CPU usage during peak time or special events. The alerts may appear in the monitoring system's web interface (for the ones who is watching the screen), but should not be dispatched by email or SMS (or require any acknowledgment)

Thursday, September 22, 2011

Assigning DNS names to IP addresses

Why is it so important to configure proper DNS names for all used IP addresses? A few reasons:
  • To make people remember the names and not IP addresses
  • To simplify the communication with customers during a troubleshooting session
  • To easily detect and trace possible security events recorded in log files
  • To simplify the reading of ping and traceroute results
Or in one sentence - to simplify the troubleshooting process and system management.

Monday, September 12, 2011

My list of favorite Nagios check scripts

Nagios is a great monitoring tool - I used it to monitor networks with hundreds of hosts and thousands of service checks. One of the biggest Nagios advantages is a wide set of service check scripts available for public use, and the following two sites provide a great collection of the scripts:

What is confusing in Nagios is that there are of lot of different scripts from different authors doing more or less the same thing (checking the same stuff), and a system administrator deploying a new instance of Nagios should invest a lot of time selecting the best scripts from the wide variety of available solutions.

The goal of the post is to share with you my list of favorite Nagios check scripts, so next time you will need to deploy a Nagios instance just use the page as a reference for requires check modules.

Saturday, September 3, 2011

Equipment naming convention

Why it is so important?
Having a clear and meaningful equipment naming convention will help you to:
  • minimize human errors of executing right commands on wrong servers
  • spend less time analyzing monitoring alerts and log messages
  • have less problems while automating system management tasks
  • have more understandable engineering and operational description of your network
  • make your network more manageable and scaleable

Monday, August 29, 2011

How to start?

Some people ask me - how to start building a production system, and make sure that it will be reliable, scaleable and manageable when it will outgrow two cabinets in one location (a kind of my definition of a small system)?

Start with breaking down the large task to small pieces. The following are some examples of "small pieces":
  • Select the OS platform and distribution to go with (if your R&D leaves you a choice)
  • Select the hardware platform

Operations requirements for in-house R&D products

This post will help you to define specific requirements from Operations to R&D for all in-house software provided for production deployment if you faced with in-house R&D products.

Once the requirements are modified to suit the specific environment, they should be discussed with your company's R&D manager, and approved by both parties. From the moment all new software products should compliant with the standard, and a strict timetable should be defined to fix all existing products (and expect that this will take some time).

Packaging and package names
All in-house software should be packaged using the standard packaging format supported by used operating system: RPM for RHEL/CentOS, DEB for Debian-like systems, etc.

Thursday, June 23, 2011

Knowledge and information management

Knowledge and information management is one of the building blocks in an effective Operations department. It is really important to get a content management system in place before starting to build a production system. If you DO NOT do this, you and your team will suffer from the following:
  • equipment inventory details will be stored only in your Purchase Orders, most likely kept in your mailbox
  • low level design details can be found only in emails asking to perform cable wiring, or will be lost at all
  • many versions of systems design documents will be managed in emails or in locally stored documents, making it almost impossible to trace down and identify the latest versions
  • the documentation for managed systems will simply not exist since there is no place to store it
  • and many other problems related to the lack of a central place to store and manage all required information about the system

Wednesday, June 22, 2011

Why have I started this blog?


My reason for starting this blog is to share my knowledge and experience in building and managing large production systems.