Sunday, September 25, 2011

Why you need an Operations Change Log

This post will explain why you need to manage an Operation Change Log, and will describe a simple and effective method on how to accomplish the task.

The goal of an Operations Change Log is to provide a possibility to trace back, on high level, all changes performed in the production environment. Change Log information is normally used during troubleshooting and post-mortem analysis.

A Change Log should normally cover all changes in the production environment, including but not limited to:
  • Custom software installations and upgrades
  • OS and third party packages installation and upgrades
  • Changes in any software or hardware component configuration
  • Changes in network configuration and topology
  • Changes in firewall configuration and rules
  • Changes in traffic load balancing configuration
  • Installation of new server and network equipment
  • Decommissioning of obsolete equipment

A Change Log should NOT include transient changes in the system configuration or behavior; for example:
  • Temporary network/system failures and implemented workarounds
  • Traffic spikes
  • Performance and functionality issues with the service
  • Customer issues

Ideally your operations team should manage an event log which should register all events happening in the system, including the abovementioned transient issues, but in companies without a dedicated 24x7 NOC the procedure is really difficult to implement.

How to manage an Operations Change Log
A change log can be managed using a shared Word document file, a page on corporate Wiki server, or even in emails. However, these methods are difficult to control and scale.

I have quite extensive experience in managing an Operations Change Log as per-month tickets in separate RT queue called, for example, "ChangeLog".

The following are some rules that should be applied when managing a change log in RT:
  • The tickets should have a subject referring to the month the tickets record changes for. For example, ticket for September 2011 can be named "Change Log for September 2011", and all changes for September 2011 should be added as replies or comments to the ticket
  • Since RT already records for each new ticket message the date and author of the message, in most cases it is not necessary to specify the information inside the record body. However, when a change has been performed long before creating a log entry or not performed by the person recording the change, the information should be included in the change log message. For example: "Jeff upgraded XYZ application to version 2.3.4 on server APP01-SANJ01 yesterday 20/06/2011 at 12:30"
  • Each change log entry should include at least the following information:
    • Who performed the change
    • When the change was performed
    • Which server/network equipment was involved in the event
    • Which software/OS component was affected by the change
    • For software installations/upgrades/rollbacks - what is the new version of software running on the server
    • For significant changes performed using an NMR procedure - a reference to the NMR document
  • All change log records should be registered as soon as the performed change has been confirmed to be stable, but not later than EOD
  • A rollback for already recorded change should be also documented in the change log using a new change log record
  • All members of the operations team should be configured as "watchers" of the change log queue, so that all of them automatically receive email notifications when there is a new record in the log

The following are some examples of change log entries:
  • Solr application running on server SOLR02-ASHB01 has been upgraded to version 2.2.0
  • New Hadoop slave node HADOOPDN09-LOND01 has been added to the production cluster (Hadoop software version CDH3u1)
  • HAProxy software running on server LRT01-NYNY01 has been upgraded to version 1.4.3 due a critical bug in load balancing mechanism
  • Checkpoint timeout in Secondary NameNode configuration HADOOPNN02-NYNY01 has been decreased from 1 hour to 10 minutes

No comments:

Post a Comment