We now return you...

By Hal Stern

Many companies rely on pushing batch jobs through a fixed time window and are caught if tasks have not completed. When everything runs as expected, managers and staff take pride in their ability to squeeze every cycle out of their batch resource pool. A single failure, however, can lead to a processing pileup. If your company is depending on a seven-hour nightly "sweep," and the host server crashes, there's no way you'll meet the deadline without violating a few laws of physics.

You need an insurance plan for critical, long-running jobs that lets you return to a previously scheduled program after a failure. The ability to restart a process midstream, known as checkpointing or restart/resume, is not a new idea. Implementing a checkpoint algorithm, though, requires planning and joint efforts between the sysadmin staff and application developers. We'll take a closer look at the kinds of failures that spring up on the batch freeway and then explore mechanisms for checkpointing Unix processes. Our usual discussion of the advantages and disadvantages of each approach follows, concluding with some suggestions for providing extremely high-quality application service in a transaction-processing environment.

Long train running
Checkpoints make the most sense when the cost of running a process from initialization to the point of its abnormal termination is prohibitively high in terms of the resources used or wall-clock time lost. Consider both components of the restart time: process initialization and the execution that must be retraced. If initialization time is long compared to execution time, checkpointing makes sense for relatively short jobs that are run frequently. Building a new executable at the post-setup phase could trim minutes off of each invocation of the process. This trick is used by popular software, including GNU Emacs.

In addition to the obvious fatal events, such as power failures and system panics, you need to protect against load-dependent problems that are simply impossible to test out of the software. No amount of rigorous testing will cover every combination of transactions, load, data values, and machine environment. Somebody -- namely the system management staff -- has to build the insurance policy for the subtle but equally mortal cases:

You run out of swap space, and your calls to malloc() fail.
Your database log disk fills up, and you can't journal update or insert transactions.
A network resource or a fileserver becomes unavailable, hanging the process.
One of your dynamically linked libraries has a memory corruption problem that you only hit after a few million random trips through the code.
An application begins to loop after digesting unusual input, grinding processing to a halt.

Long-lived processes with large, dynamically created address spaces are the best candidates for checkpointing. As a special case, look for processes that run in stages, utilizing compute cycles off-hours to crank out a few thousand iterations every night for a week. If you want a virtual batch environment that is available outside of normal business hours, you'll need to provide for those jobs that run past the end of the graveyard shift. When weighing the advantages of checkpointing, think about the service the application provides and if a "pause button" is useful. You might want to save the state of a large, resource-intensive process so that you can effectively halt it and release the memory and swap space it was holding from other processes. Simply putting the process in the background doesn't give you back the swap space, and the scheduler will frantically try to page your victim into memory during quiet times.

Checkpoints are not a necessity for service recovery. Starting the entire process over, or providing multiple service streams sometimes offers cost and complexity savings. Video-on-demand uses staggered starts so a failure can be handled by jumping to the next feed of the same content. But if the cost of fast-forwarding to the failure makes checkpoints viable for your reliability scheme, it's time to think about developing a scheme for fast restarts.

I'll be back
Unix processes contain significant state information that needs to be recreated when resuming operation from a checkpoint. Stack frames, automatic (local) variables, changes to global data, dynamically allocated data structures, and any filesystem I/O are all transient pieces of information that must be saved (on disk) to rebuild a process. There are several alternatives, ranging from simple logging operations to a complete dump of the process image. We'll look at them in order of increasing complexity.

Tasks that contain little or no global state information can be checkpointed by logging a "placeholder" to indicate your progress through the input data. Bulk updates or inserts into a database or simple data reduction fit the placeholder profile. To implement the checkpoint, write a placeholder to disk every few minutes or after each group of records. For data reduction (averages, summations, or regressions), save the intermediate results along with the counters that indicate the amount of data processed to that point. Application changes should be fairly minimal:

Make sure your main() routine is re-entrant. You can't make assumptions about having a blank slate when you start the process, because you might be reloading state variables from files. Modify the application's initialization routine to look for a valid log file and to retrieve checkpoint data from it when restarting.
Keep track of pointers in databases and flat files. For database work, track a row number for sequential scans or a key value for an indexed parse. Flat-file pointers are harder to determine, because they are incremented automatically when you call read() and write(). Maintain a current offset for each file you manipulate, so you can lseek() back to that location during recovery. You'll have to modify code to update the offset after each operation, but this solves the problem of re-creating file I/O state during recovery -- reset the pointer from the saved value in the checkpoint log.
Add code to write placeholders at logical points in the main processing loop.
Periodically write out the key values, being sure to use synchronous writes to get the data onto disk reliably. If you rely on the Unix default asynchronous write policy, you might crash with inconsistent checkpoint information committed to disk. Open your logging file with the O_SYNC flag, or use the fcntl() system call to turn on synchronous writes.

At restart, a crashed application zips to the end of the log file, reloads some key variables, and picks up near where it left off. This recovery process mimics a database log roll, ideal for repetitive, easily logged transaction-oriented work.

Mind meld, Unix style
If the placeholders sound trivial, you're probably dealing with involved processes that manipulate large address spaces. Most scientific apps iterate functions on large matrices, requiring that a partially cooked version of the data be built into the checkpoint. Save these values to disk using sequences of write() calls:

int a[1024]; int index;

write(checkfd, a, sizeof(a)); write(checkfd, &index, sizeof(index));

Knowing you'll restart the application on the same machine, you don't have to worry about data-formatting issues such as byte ordering. A binary snapshot suffices for the restarted application image.

Save global scalar variables and any key uninitialized data structures. If you don't change the value of initialized data, you can skip those items, but if you've modified them you'll need to save the last known values as well. When building the checkpoint with several write() calls, or splitting data structures out into separate files, use different files for each checkpoint. Separating the recovery data lets you take a crash in the midst of writing a checkpoint and still recover from the last complete disk image.

The processes that tend to rely heavily on dynamically allocated data structures or shared-memory regions can complicate the process. Global variables are easily saved because they can be located by name, while chunks of memory handed out by malloc() may only be referenced by pointers buried inside of other data structures. To capture portions of your heap in the checkpoint image, walk through dynamically allocated data structures and assign them ordinal numbers so you can rebuild pointer values later on. Once data structures are separated from their original address spaces, pointer values are meaningless, so the ordinal values become a way to identify items and later repair pointer values. When writing the checkpoint, save both the data items and a table cross-referencing pointer values and ordinal numbers so you can reset pointers when reloading data items from the checkpoint.

Saving dynamically allocated memory is difficult but necessary: The restarted process will have a nicely compacted address space, having survived a forced garbage collection. If a lack of swap space is contributing to the application failure in the first place, this approach buys you additional execution time by restarting a leaner executable.

The more your application resembles a typical C program (not to mention C++ and its heart of class), the more attractive a brute-force mind meld with the disk becomes. Ideally, you'd like to call a routine that writes out the entire process to disk, in executable format, so you can restart it later just by invoking the preserved process image. Fortunately, Spencer Thomas (of the University of Utah) solved this problem more than a decade ago with his unexec() routine. The code does the inverse of a Unix exec(), namely, it turns an executable back into a disk file. GNU Emacs uses unexec() to create a new executable reflecting all initialization, which can take several minutes with a reasonably large set of macros. Find the unexec() code in unexec.c, part of the Emacs distribution available for ftp from prep.ai.mit.edu.

The default values used by unexec() are reasonable enough to save the entire initialized and uninitialized (BSS) segments of your process, along with the text segment needed to create an executable. You could modify unexec() to save the entire heap and capture dynamically allocated data structures as well, but doing so requires knowing a few low-level details about the process image layout on your Unix variant.

Re-enter at your own risk
The implementation details regarding checkpoints have been left sketchy due to the numerous variations that need to be addressed. There is no single best solution; you have to weigh the complexity of the checkpoint mechanism against the benefits you'll reap from being able to perform CPR on a flat-lined application. Pulling off a Lazarus-type escapade is best done with safety guidelines. Above all, ensure that your checkpoint generates a consistent view of your data landscape. Typically, checkpoints are fired off using an alarm timer:

#define CHECKPT_TIMER 600

/* do_checkpoint () writes out the checkpoint image */ signal(SIGALRM, do_checkpoint); alarm(CHECKPT_TIMER);

Signals are asynchronous events. You might take a signal while updating data, or between single-value modification instructions. You can surround noninterruptible code bits with sigaction() calls to hold signals pending, but you pay a high price in additional system calls.

Compute-intensive apps will probably find this intolerable, as most calculations would have to be protected. Alternatively, use the signal handler to set a flag which is checked periodically at points where the application can guarantee a consistent data view, typically at the entrance to a major loop or between blocks of database records. Take a more direct path and generate a checkpoint every n iterations, but if you have significant variation in loop-execution times you'll want to use the timer to meter out checkpoints a bit more evenly.

Safety check
Administrators with mainframe experience grin when the Unix data-center group talks checkpoints, because mainframes have tools that identify looping transactions, abort them, and identify the offending code segment on the fly. In 25 years of Unix, we have gotten better at hardening systems. Reliability and quality are achieved in layers. First ensure that the disks holding checkpoints are redundant, or a major system failure may stop your job and wipe out the checkpoint image. The net result: you are without data and without a prayer of completing the job. Layer checkpoint features above all other reliability and redundancy steps. Integrate them with machine- and system-availability features so you can always go back to a previously scheduled program.

About the Author
Hal Stern is an area technology manager for Sun. He can be reached at hal.stern@sunworld.com.

[Amazon.com Books] You can buy Hal Stern's Managing NFS and NIS at Amazon.com Books.

(A list of Hal Stern's Sysadmin columns in SunWorld Online.)

Seven safety guidelines

Once you've made main() re-entrant, make sure you can tell you are restarting from old state information. You don't want to overlay restored information with new initializations nor start a fresh invocation without proper setup.
Be sure to remove the checkpoint log or datafile when your process exits cleanly. As a safeguard, make sure you can recover from the case where you complete processing but crash before cleaning up the checkpoint data. If initialization recognizes the case where the last transaction in the checkpoint is the last to be executed, it can flag an error or ignore the checkpoint entirely.
Balance the time to write a checkpoint against the time to execute the process. Writing a 500-megabyte datafile may take two minutes or more, so inserting five-minute checkpoints in a large app may slow you down 20 percent. Get as much processing done as possible in between grand pauses for disk I/O.
Avoid recursive subroutines and their algorithms. You lose all stack frames when your process terminates, and the checkpoint recovery starts you off with a brand-new, single-call-deep stack. Similarly, don't rely on the values of automatic (local) variables, because they are put on the stack and vaporize during a crash.
Identify the down sides of reprocessing input records or transactions. You'll repeat some work when you back up to the last checkpoint, so predetermine how to handle repeat transactions. The saved pointers should be consistent with each other. An output pointer should correlate to the input one, so you can safely repeat output steps associated with reprocessing the input records handled between the checkpoint and the process termination.
If possible, don't use absolute filenames, machine names, or TCP/IP port numbers. When the app is restarted, it'll bind to different TCP/IP ports. You may also find it convenient to restart a checkpoint on a different machine with more swap space, more memory, and a better set of patches, but with a different naming convention. Accept as much configuration information from the environment as you can to maximize the portability of the resulting checkpoint images.
Force apps to participate in their own management. If you provide a restart mechanism, it's fair to insist on application suicide for poorly behaved or unterminated app transactions. Put "deadman timers" into key loops to be executed a few, known number of times. Also put counters into critical code to be executed on every transaction. Examine these counters in the signal handler that drives the checkpoint routine or sets the flag to be sure your app isn't looping or skipping important code chunks. If the count suggests that an app is spinning its wheels, have the checkpoint mechanism do a "warm boot" and restart the app.

If you have problems with this magazine, contact webmaster@sunworld.com
URL: http://www.sunworld.com/asm-12-1994/asm-12-sysadmin.html
Last updated: 1 December 1994