Many companies rely on pushing batch jobs through a fixed time window and are caught if tasks have not completed. When everything runs as expected, managers and staff take pride in their ability to squeeze every cycle out of their batch resource pool. A single failure, however, can lead to a processing pileup. If your company is depending on a seven-hour nightly "sweep," and the host server crashes, there's no way you'll meet the deadline without violating a few laws of physics.
You need an insurance plan for critical, long-running jobs that lets you return to a previously scheduled program after a failure. The ability to restart a process midstream, known as checkpointing or restart/resume, is not a new idea. Implementing a checkpoint algorithm, though, requires planning and joint efforts between the sysadmin staff and application developers. We'll take a closer look at the kinds of failures that spring up on the batch freeway and then explore mechanisms for checkpointing Unix processes. Our usual discussion of the advantages and disadvantages of each approach follows, concluding with some suggestions for providing extremely high-quality application service in a transaction-processing environment.
Long train running
Checkpoints make the most sense when the cost of running a process from initialization to the point of its abnormal termination is prohibitively high in terms of the resources used or wall-clock time lost. Consider both components of the restart time: process initialization and the execution that must be retraced. If initialization time is long compared to execution time, checkpointing makes sense for relatively short jobs that are run frequently. Building a new executable at the post-setup phase could trim minutes off of each invocation of the process. This trick is used by popular software, including GNU Emacs.
In addition to the obvious fatal events, such as power failures and system panics, you need to protect against load-dependent problems that are simply impossible to test out of the software. No amount of rigorous testing will cover every combination of transactions, load, data values, and machine environment. Somebody -- namely the system management staff -- has to build the insurance policy for the subtle but equally mortal cases:
Long-lived processes with large, dynamically created address spaces are the best candidates for checkpointing. As a special case, look for processes that run in stages, utilizing compute cycles off-hours to crank out a few thousand iterations every night for a week. If you want a virtual batch environment that is available outside of normal business hours, you'll need to provide for those jobs that run past the end of the graveyard shift. When weighing the advantages of checkpointing, think about the service the application provides and if a "pause button" is useful. You might want to save the state of a large, resource-intensive process so that you can effectively halt it and release the memory and swap space it was holding from other processes. Simply putting the process in the background doesn't give you back the swap space, and the scheduler will frantically try to page your victim into memory during quiet times.
Checkpoints are not a necessity for service recovery. Starting the entire process over, or providing multiple service streams sometimes offers cost and complexity savings. Video-on-demand uses staggered starts so a failure can be handled by jumping to the next feed of the same content. But if the cost of fast-forwarding to the failure makes checkpoints viable for your reliability scheme, it's time to think about developing a scheme for fast restarts.
I'll be back
Unix processes contain significant state information that needs to be recreated when resuming operation from a checkpoint. Stack frames, automatic (local) variables, changes to global data, dynamically allocated data structures, and any filesystem I/O are all transient pieces of information that must be saved (on disk) to rebuild a process. There are several alternatives, ranging from simple logging operations to a complete dump of the process image. We'll look at them in order of increasing complexity.
Tasks that contain little or no global state information can be checkpointed by logging a "placeholder" to indicate your progress through the input data. Bulk updates or inserts into a database or simple data reduction fit the placeholder profile. To implement the checkpoint, write a placeholder to disk every few minutes or after each group of records. For data reduction (averages, summations, or regressions), save the intermediate results along with the counters that indicate the amount of data processed to that point. Application changes should be fairly minimal:
At restart, a crashed application zips to the end of the log file, reloads some key variables, and picks up near where it left off. This recovery process mimics a database log roll, ideal for repetitive, easily logged transaction-oriented work.
Mind meld, Unix style
If the placeholders sound trivial, you're probably dealing with involved processes that manipulate large address spaces. Most scientific apps iterate functions on large matrices, requiring that a partially cooked version of the data be built into the checkpoint. Save these values to disk using sequences of write() calls:
write(checkfd, a, sizeof(a));
Knowing you'll restart the application on the same machine, you don't have to worry about data-formatting issues such as byte ordering. A binary snapshot suffices for the restarted application image.
write(checkfd, &index, sizeof(index));
write(checkfd, a, sizeof(a));
Save global scalar variables and any key uninitialized data structures. If you don't change the value of initialized data, you can skip those items, but if you've modified them you'll need to save the last known values as well. When building the checkpoint with several write() calls, or splitting data structures out into separate files, use different files for each checkpoint. Separating the recovery data lets you take a crash in the midst of writing a checkpoint and still recover from the last complete disk image.
The processes that tend to rely heavily on dynamically allocated data structures or shared-memory regions can complicate the process. Global variables are easily saved because they can be located by name, while chunks of memory handed out by malloc() may only be referenced by pointers buried inside of other data structures. To capture portions of your heap in the checkpoint image, walk through dynamically allocated data structures and assign them ordinal numbers so you can rebuild pointer values later on. Once data structures are separated from their original address spaces, pointer values are meaningless, so the ordinal values become a way to identify items and later repair pointer values. When writing the checkpoint, save both the data items and a table cross-referencing pointer values and ordinal numbers so you can reset pointers when reloading data items from the checkpoint.
Saving dynamically allocated memory is difficult but necessary: The restarted process will have a nicely compacted address space, having survived a forced garbage collection. If a lack of swap space is contributing to the application failure in the first place, this approach buys you additional execution time by restarting a leaner executable.
The more your application resembles a typical C program (not to mention C++ and its heart of class), the more attractive a brute-force mind meld with the disk becomes. Ideally, you'd like to call a routine that writes out the entire process to disk, in executable format, so you can restart it later just by invoking the preserved process image. Fortunately, Spencer Thomas (of the University of Utah) solved this problem more than a decade ago with his unexec() routine. The code does the inverse of a Unix exec(), namely, it turns an executable back into a disk file. GNU Emacs uses unexec() to create a new executable reflecting all initialization, which can take several minutes with a reasonably large set of macros. Find the unexec() code in unexec.c, part of the Emacs distribution available for ftp from prep.ai.mit.edu.
The default values used by unexec() are reasonable enough to save the entire initialized and uninitialized (BSS) segments of your process, along with the text segment needed to create an executable. You could modify unexec() to save the entire heap and capture dynamically allocated data structures as well, but doing so requires knowing a few low-level details about the process image layout on your Unix variant.
Re-enter at your own risk
The implementation details regarding checkpoints have been left sketchy due to the numerous variations that need to be addressed. There is no single best solution; you have to weigh the complexity of the checkpoint mechanism against the benefits you'll reap from being able to perform CPR on a flat-lined application. Pulling off a Lazarus-type escapade is best done with safety guidelines. Above all, ensure that your checkpoint generates a consistent view of your data landscape. Typically, checkpoints are fired off using an alarm timer:
/* do_checkpoint () writes out the checkpoint image */
#define CHECKPT_TIMER 600
/* do_checkpoint () writes out the checkpoint image */
Signals are asynchronous events. You might take a signal while updating data, or between single-value modification instructions. You can surround noninterruptible code bits with sigaction() calls to hold signals pending, but you pay a high price in additional system calls.
Compute-intensive apps will probably find this intolerable, as most calculations would have to be protected. Alternatively, use the signal handler to set a flag which is checked periodically at points where the application can guarantee a consistent data view, typically at the entrance to a major loop or between blocks of database records. Take a more direct path and generate a checkpoint every n iterations, but if you have significant variation in loop-execution times you'll want to use the timer to meter out checkpoints a bit more evenly.
Administrators with mainframe experience grin when the Unix data-center group talks checkpoints, because mainframes have tools that identify looping transactions, abort them, and identify the offending code segment on the fly. In 25 years of Unix, we have gotten better at hardening systems. Reliability and quality are achieved in layers. First ensure that the disks holding checkpoints are redundant, or a major system failure may stop your job and wipe out the checkpoint image. The net result: you are without data and without a prayer of completing the job. Layer checkpoint features above all other reliability and redundancy steps. Integrate them with machine- and system-availability features so you can always go back to a previously scheduled program.
About the Author
Hal Stern is an area technology manager for Sun. He can be reached at email@example.com.
You can buy Hal Stern's Managing NFS and NIS at Amazon.com Books.
(A list of Hal Stern's Sysadmin columns in SunWorld Online.)
Seven safety guidelines
If you have problems with this magazine, contact firstname.lastname@example.org
Last updated: 1 December 1994