Saturday, May 26, 2012

The art of application logging

There are three frequently met problems with logging functionality in in-house software:
  • Application developers don't really care about proper application logging
  • Development managers cannot really define coding standards for application logging, or in the worst case - they just don't care about logging at all
  • Operations guys cannot troubleshoot in-house software without simple and human-readable logs
The goal of the post is to formulate in a single document all critical points about application logging as it is seen from the Operations point of view.
Actually, there no need to re-invent the wheel - it is enough to see how many Linux applications are managing their logging functionality, and implement the same for in-house software. 

In this blog post I'll cover the following topics:
  • Where to log
  • What to log and which event severity to use
  • Which logging format to use
  • How to rotate the log files

Where to log
There are two primary destinations for application logs:
  1. Syslog subsystem. Syslog messages from many distributed servers can be easily aggregated in a single location and monitored using popular tools like logcheck and SEC.
  2. Local file; an example of standard for location of local log files is provided in this blog post, section "Logging"

You can use the following simple rules to select which logging destination to use in your application:
  • application activity related events should be logged using syslog subsystem specifying one of seven available "local" facilities. Examples of application activity events are application start/stop points, execution of major functional block, notifications about recoverable and unrecoverable events, etc.
  • during application development stage some programmers may find more convenient to log application events to local log files (instead or in addition to the syslog subsystem), and this is pretty logical once the application has a configuration option which switches the logging between local files and syslog
  • high volume audit/billing events like customer API or HTTP requests should be logged in a local log file which later may be processed by another billing/reporting application

What to log and which event severity to use

There are five widely used logging severity levels, and they are enough to cover almost all logging requirements in an average software product:
  1. DEBUG
  2. INFO
  4. ERROR

DEBUG means that the information is primary designated to software developers and is not useful for Operations. Examples of messages with DEBUG severity are function entry/exit points, status of variables, configuration details, etc.

INFO means regular informative details about the software activity. INFO events normally do not require any action from the Operations team. Examples are application start/stop points, software versions, entry/exit points for main procedures, major application events like configuration reload, access to major application functions, user sign in/up, etc.

WARNING means that something suspicious is happening in the system, but it can continue working without affecting the system functionality. Examples of WARNING messages are reports about increased response time of internal and external application components, wrong user input (like access credentials), drop of messages on recoverable connections, etc.

ERROR means that something really wrong is happening in the system and immediate attention is required, but the software can continue to run (even in functionality degraded mode). Examples are database or network errors resulting in service outage (but the application's re-try mechanism should try to restore the dropped connections), out of free disk space or memory, etc.

CRITICAL means that the application cannot function and will be terminated. Examples are missing application components or configuration files, low system resources on application startup, wrong configuration, etc.

Each application should have a configuration option which defines the minimum severity of logged messages (it is OK to provide only three options: DEBUG, INFO and WARNING). In large applications a separate logging severity switch can be used for all major application components (like Apache Tomcat is doing).

It is also recommended to implement a possibility for the operator to change application minimum severity level "on-the-fly" by sending Unix signals to the application. For example, signal USR1 may increase the logging level, and USR2 - decrease it.

Which logging format to use

Defining an R&D-wide standard for log messages format is not a difficult task, and it will help a lot to both parties (R&D and Operations) to easily read and understand logged messages.

I recommended to use the following format for log messages:

Month Day Time Server_Name Process_Name[Process_ID]: Message

For example:
May 26 14:35:28 ACCEL01-EC2WC DataStore[15391]: Completed data store scan process, total objects scanned 14336233, process run time 12 seconds.

For syslog messages the syslog subsystem will take care to put the current time stamp, and the rest of the parameters can be specified as parameters to the logging function.

Automatic rotation of log files

To prevent the disk to run out of free space the generated log files should be rotated on daily or weekly basis.

With syslog logging the system administrators will take care about log rotation (normally using standard and widely popular tool like logrotate).

For local logging the same logrotate tool can be used for log rotation, and it is recommended that R&D will supply custom logrotate configuration file (packaged with the released software) placed in local /etc/logrotate.d/ directory - see this link for some details about how to prepare a logrotate configuration file.

Some general application logging recommendations
  • In INFO logging mode I recommend to print out execution timing and statistics details for all major functions, and in DEBUG mode - for all functions. This information will be extremely useful while troubleshooting application performance issues
  • It is important to log the fact of execution and performance data for all internal processes (threads) initiated by the application - this will help a lot while troubleshooting strange application behavior, hiccups, timeouts, etc

This post does not cover all aspects of application logging, but the provided information should be enough for the most of R&D teams to design and implement a simple and useful logging service.

1 comment: