The following is a description of a more advanced crontab jobs monitoring system which will provide a status of:
- successfully and timely executed jobs
- failed jobs
- delayed or missed jobs
- Crontab scripts reporting the execution status via standard syslog facility
- Centralized syslog server used to ship the log to a central location
- Simple Event Correlator (http://simple-evcorr.sourceforge.net/) log monitoring software
- Nagios/Zabbix monitoring system configured to accept passive check results
Crontab scripts
The monitored crontab scripts should be designed to:
- Write the execution status (success/failure) to the local syslog facility
- Provide the script name in all logged messages
- In case of success:
- In case of failure:
Other components of the described monitoring system will catch and analyze the messages.
Centralized syslog server
A centralized syslog server will receive syslog messages from all relevant servers in the network and store the messages in a local log file. The central log file will be used as the source of information for another component (SEC) which will watch for presence or lack of specific log messages, and act accordingly.
The two most popular Linux syslog daemons - rsyslog and syslog-ng - have a great support for remote logging functionality. This article describes how to configure remote logging for rsyslog, and this - for syslog-ng.
Simple Event Correlator software
SEC log monitoring software should be running on the central syslog server and configured to monitor the following events in the central log file:
- Successful completion or failure of a specific job on specific server within defined time frames. For example, if server SOLR01-LGA01 is configured to run the "run-daily-report" job every day at 09:00 and the process normally takes 10 minutes to complete then the SEC service should be configured to look for "Successfully completed the report generation process" from "run-daily-report" process within 15 minutes time frame (10 minutes for script run time and 5 minutes headroom) after 09:00. If the condition is met SEC should send an UP alert to the central monitoring console like Nagios or Zabbix. Another rule should be configured to watch for "Failed to complete the report generation" message, and send a DOWN alert if the condition is met
- Missing log records about the status of crontab job completion (success or failure). The SEC rule should be activated at the time of crontab job start time, and if no status report has been received with the expected time frame the rule should send an UNKNOWN alert to the central monitoring console
A detailed description (with examples) how to monitor missing log events is provided in section 3.4 of http://www.cs.umb.edu/~rouilj/sec/sec_paper_full.pdf document.
Nagios/Zabbix monitoring system
Nagios monitoring system can be used to provide a central reporting panel for statuses of all monitored crontab jobs. For example, if we are monitoring the status of "run-daily-report" job on SOLR01-LGA01 server it is possible to configure a passive check called "run-daily-report job status", and submit the check status from the SEC component using Nagios passive checks submission mechanism.
A similar facility is provided by Zabbix monitoring software, and it is possible to use "zabbix_send" command to submit passive check results to a Zabbix server.
Hi Victor. Thanks for the blog post. For the less technically inclined, or those who want something they can setup in a few minutes, I suggest they look at PushMon. PushMon will alert you of failed, delayed or missed jobs. All you need to do is create a PushMon URL and call this URL when your cron job successfully completes. Only requirement is internet connectivity so you can call the URL. Integration with Nagios passive checks can be implemented via the URL alert option.
ReplyDeleteThis comment has been removed by the author.
ReplyDelete