Saturday, March 29, 2014

7 Tips For Successful PCI DSS Audit

In the light of my recent experience leading Instart Logic through the process of PCI DSS Level-1 certification, and some experience implementing a PCI environment in Cotendo, I would like to share my recommendations how to successfully pass an on-site audit for PCI DSS compliance.

Saturday, November 30, 2013

Vendor Management Tips


This post is for technical operations folks managing vendor contracts. For technical people dealing with vendors and legal paperwork can be a bit complicated, and I hope that the information from the post will be very handy.

The vendor management process can be roughly outlined in the following steps (described in details later in this post):
  1. Research
  2. Price negotiations and service testing
  3. Contract negotiation
  4. Management of vendor contracts

In this article I’ll primarily concentrate on vendors demanding long term commitment (6+ months) on small and medium volume contracts (from $1000/month to $20,000/month), but some tips are also relevant to month-to-month agreements and large volume deals.

Sunday, June 16, 2013

What to monitor on a Linux box

This article is a kind of reminder for me (and for anyone else managing a monitoring system) about which metrics should/can be monitored for different kind of hardware and software components in a modern production system. The listed metrics primarily assume the usage of Nagios, Graphite, collectl and logcheck monitoring tools.

Metrics, metrics, metrics...

A lot has been said about the importance of system and application metrics - I'll not repeat this, and will concentrate on of-the-shelf options available to implement a robust, usable and scalable metrics collection and monitoring system.

We will talk about three main areas:
  1. How to generate/create metrics
  2. How to collect and represent the metrics
  3. How to monitor the collected metrics

Thursday, June 28, 2012

The art of crontab jobs monitoring, part II

In part I of the discussion I've outlined some basic principals of crontab jobs monitoring. One of the biggest disadvantage of the described approach is the lack of centralized control panel (console) which can be used to see the current status of all configured periodic jobs.

The following is a description of a more advanced crontab jobs monitoring system which will provide a status of:
  • successfully and timely executed jobs
  • failed jobs
  • delayed or missed jobs

Saturday, May 26, 2012

The art of application logging

There are three frequently met problems with logging functionality in in-house software:
  • Application developers don't really care about proper application logging
  • Development managers cannot really define coding standards for application logging, or in the worst case - they just don't care about logging at all
  • Operations guys cannot troubleshoot in-house software without simple and human-readable logs
The goal of the post is to formulate in a single document all critical points about application logging as it is seen from the Operations point of view.

Saturday, April 21, 2012

The art of crontab jobs monitoring

In a regular production or development environment there are normally a lot of crontab jobs configured on running servers. The jobs can be a part of deployed applications or can perform different system administration tasks like backups, reporting, etc. This post will describe several key points which should be considered while configuring crontab jobs in a large environment.

Automatic deployment of crontab configuration
Linux distributions provide a convenient way to automatically deploy new crontab jobs while installing a new software package or using centralized configuration management systems like Puppet or Chef. You simply drop a properly formatted crontab configuration file in directory /etc/cron.d/, and the running crontab scheduler process will automatically read the new file and configure jobs specified in it.

Saturday, November 5, 2011

Don’t hesitate to ask your R&D for software documentation!

If you, as an Operations Engineer, have experience working with third-party software, it is most likely that you enjoy a full set of documentation for used products: deployment guide, installation manual, user’s guide, release notes, upgrade procedures, troubleshooting procedure, etc. When you switch to use in-house R&D products you see that the products, in most cases, are not accompanied by any usable Operations documentation and this makes the process of deploying and running the software quite challenging and not a scalable task. This blog will help you to define some minimal documentation requirements for in-house R&D products, and this addition should not provide any significant burden for your R&D team.

Thursday, October 13, 2011

Remote Control for Your Production Site

Have you ever found yourself rushing to the office data center in the middle of the night just because a critical server is down, and you don't have any means to remotely access the server's console or reset the power? How much time and money have you spent on line with data center technicians trying to recover critical pieces of your production infrastructure located at a remote colocation facility?

This post will explain which remote management options are available for you, and how to organize remote management of your critical equipment.

Saturday, October 8, 2011

Have a problem managing your tasks? Read this post!

If you are lucky enough to handle all your tasks and requests in time, without a need to manage and prioritize a backlog, then you probably don't need to read the post. If you are a manager and all your direct reports are excellent in managing their task lists and setting by themselves correct tasks priorities then you probably can skip the post too. The rest of you should keep reading.

This is the situation:
  • You have a team of bright engineers
  • You have many requests coming from different sources: monitoring alerts, customer support tickets, internal change requests, new software releases deployment requests, corporate IT requests (if you still wear the hat), internal requests from different departments like sales, marketing, etc

Sunday, October 2, 2011

My monitoring tools set

Now it is time to talk about my set of monitoring tools suitable to monitor medium and large systems.

Nagios monitoring service
I just love it! See my another article for a list of many Nagios plugins suitable to monitor the most of popular services. You can easily write your own plugins, and this is what I did to monitor many application-specific services.

Cacti graphing service
If you  have Nagiosgraph module deployed in your Nagios monitoring system you may think that Cacti is redundant. But it is not, especially when you start to use many available nice Cacti plugins and templates for MySQL, Tomcat, Cisco hardware, disk I/O and others.

Friday, September 30, 2011

Handling of external monitoring alerts

If you have an Internet-facing production system it is always wise to use an external web availability and/or performance monitoring service like Pindom or Gomez/Keynote/Catchpoint to monitor the exposed resources. All the services support email alerting, and some even provide SMS alerting. As I already explained here I recommend using SMS alerting for critical monitoring events, but managing individual SMS alerting configurations in each of used external monitoring system does not scale.

Sunday, September 25, 2011

Why you need an Operations Change Log

This post will explain why you need to manage an Operation Change Log, and will describe a simple and effective method on how to accomplish the task.

The goal of an Operations Change Log is to provide a possibility to trace back, on high level, all changes performed in the production environment. Change Log information is normally used during troubleshooting and post-mortem analysis.

Friday, September 23, 2011

Email and SMS alerting policy

In a regular production environment there are normally three types of monitoring events and corresponding notification methods:
  1. Critical alerts requiring immediate attention; for example, a host is down, web service is not responding, disk is almost full, application response time is high, etc. Critical alerts should be dispatched to responsible on-call engineers and escalation list via email and SMS
  2. Non-critical alerts which can be handled later; for example, alerts about problems with staging/pre-production servers, newly configured monitoring checks which are still in fine tuning process, etc. The alerts should be dispatched by email only
  3. Informative events like expected high server CPU usage during peak time or special events. The alerts may appear in the monitoring system's web interface (for the ones who is watching the screen), but should not be dispatched by email or SMS (or require any acknowledgment)

Thursday, September 22, 2011

Assigning DNS names to IP addresses

Why is it so important to configure proper DNS names for all used IP addresses? A few reasons:
  • To make people remember the names and not IP addresses
  • To simplify the communication with customers during a troubleshooting session
  • To easily detect and trace possible security events recorded in log files
  • To simplify the reading of ping and traceroute results
Or in one sentence - to simplify the troubleshooting process and system management.

Monday, September 12, 2011

My list of favorite Nagios check scripts

Nagios is a great monitoring tool - I used it to monitor networks with hundreds of hosts and thousands of service checks. One of the biggest Nagios advantages is a wide set of service check scripts available for public use, and the following two sites provide a great collection of the scripts:

What is confusing in Nagios is that there are of lot of different scripts from different authors doing more or less the same thing (checking the same stuff), and a system administrator deploying a new instance of Nagios should invest a lot of time selecting the best scripts from the wide variety of available solutions.

The goal of the post is to share with you my list of favorite Nagios check scripts, so next time you will need to deploy a Nagios instance just use the page as a reference for requires check modules.

Saturday, September 3, 2011

Equipment naming convention

Why it is so important?
Having a clear and meaningful equipment naming convention will help you to:
  • minimize human errors of executing right commands on wrong servers
  • spend less time analyzing monitoring alerts and log messages
  • have less problems while automating system management tasks
  • have more understandable engineering and operational description of your network
  • make your network more manageable and scaleable

Monday, August 29, 2011

How to start?

Some people ask me - how to start building a production system, and make sure that it will be reliable, scaleable and manageable when it will outgrow two cabinets in one location (a kind of my definition of a small system)?

Start with breaking down the large task to small pieces. The following are some examples of "small pieces":
  • Select the OS platform and distribution to go with (if your R&D leaves you a choice)
  • Select the hardware platform

Operations requirements for in-house R&D products

This post will help you to define specific requirements from Operations to R&D for all in-house software provided for production deployment if you faced with in-house R&D products.

Once the requirements are modified to suit the specific environment, they should be discussed with your company's R&D manager, and approved by both parties. From the moment all new software products should compliant with the standard, and a strict timetable should be defined to fix all existing products (and expect that this will take some time).

Packaging and package names
All in-house software should be packaged using the standard packaging format supported by used operating system: RPM for RHEL/CentOS, DEB for Debian-like systems, etc.

Thursday, June 23, 2011

Knowledge and information management

Knowledge and information management is one of the building blocks in an effective Operations department. It is really important to get a content management system in place before starting to build a production system. If you DO NOT do this, you and your team will suffer from the following:
  • equipment inventory details will be stored only in your Purchase Orders, most likely kept in your mailbox
  • low level design details can be found only in emails asking to perform cable wiring, or will be lost at all
  • many versions of systems design documents will be managed in emails or in locally stored documents, making it almost impossible to trace down and identify the latest versions
  • the documentation for managed systems will simply not exist since there is no place to store it
  • and many other problems related to the lack of a central place to store and manage all required information about the system