Thursday, June 28, 2012

The art of crontab jobs monitoring, part II

In part I of the discussion I've outlined some basic principals of crontab jobs monitoring. One of the biggest disadvantage of the described approach is the lack of centralized control panel (console) which can be used to see the current status of all configured periodic jobs.

The following is a description of a more advanced crontab jobs monitoring system which will provide a status of:
  • successfully and timely executed jobs
  • failed jobs
  • delayed or missed jobs
The crontab jobs monitoring system is composed from the following components:
  • Crontab scripts reporting the execution status via standard syslog facility
  • Centralized syslog server used to ship the log to a central location
  • Simple Event Correlator (http://simple-evcorr.sourceforge.net/) log monitoring software
  • Nagios/Zabbix monitoring system configured to accept passive check results


Crontab scripts
The monitored crontab scripts should be designed to:
  • Write the execution status (success/failure) to the local syslog facility
  • Provide the script name in all logged messages
For example, a script performing data sync to a remote server may produce the following syslog messages:
  • In case of success:
Jun 10 11:35:25 DB01-LGA01 sync-backup-files[14837]: Successfully completed data sync to server DB01-SFO01
  • In case of failure:
May 12 21:34:54 SOLR01-LGA01 run-daily-report[5332]: Failed to complete the report generation; reason: missing source log files

Other components of the described monitoring system will catch and analyze the messages.

Centralized syslog server
A centralized syslog server will receive syslog messages from all relevant servers in the network and store the messages in a local log file. The central log file will be used as the source of information for another component (SEC) which will watch for presence or lack of specific log messages, and act accordingly.

The two most popular Linux syslog daemons - rsyslog and syslog-ng - have a great support for remote logging functionality. This article describes how to configure remote logging for rsyslog, and this - for syslog-ng.

Simple Event Correlator software
SEC log monitoring software should be running on the central syslog server and configured to monitor the following events in the central log file:
  • Successful completion or failure of a specific job on specific server within defined time frames. For example, if server SOLR01-LGA01 is configured to run the "run-daily-report" job every day at 09:00 and the process normally takes 10 minutes to complete then the SEC service should be configured to look for "Successfully completed the report generation process" from "run-daily-report" process within 15 minutes time frame (10 minutes for script run time and 5 minutes headroom) after 09:00. If the condition is met SEC should send an UP alert to the central monitoring console like Nagios or Zabbix. Another rule should be configured to watch for "Failed to complete the report generation" message, and send a DOWN alert if the condition is met
  • Missing log records about the status of crontab job completion (success or failure). The SEC rule should be activated at the time of crontab job start time, and if no status report has been received with the expected time frame the rule should send an UNKNOWN alert to the central monitoring console
A detailed description (with examples) how to monitor missing log events is provided in section 3.4 of http://www.cs.umb.edu/~rouilj/sec/sec_paper_full.pdf document.


Nagios/Zabbix monitoring system

Nagios monitoring system can be used to provide a central reporting panel for statuses of all monitored crontab jobs. For example, if we are monitoring the status of "run-daily-report" job on SOLR01-LGA01 server it is possible to configure a passive check called "run-daily-report job status", and submit the check status from the SEC component using Nagios passive checks submission mechanism.

A similar facility is provided by Zabbix monitoring software, and it is possible to use "zabbix_send" command to submit passive check results to a Zabbix server.


2 comments:

  1. Hi Victor. Thanks for the blog post. For the less technically inclined, or those who want something they can setup in a few minutes, I suggest they look at PushMon. PushMon will alert you of failed, delayed or missed jobs. All you need to do is create a PushMon URL and call this URL when your cron job successfully completes. Only requirement is internet connectivity so you can call the URL. Integration with Nagios passive checks can be implemented via the URL alert option.

    ReplyDelete
  2. This comment has been removed by the author.

    ReplyDelete