Inside Solaris by Jim Mauro

It's time to wake up!

Into another dimension of kernel thread support -- the mechanics of sleep and wakeup in Solaris

July  1999
[Next story]
[Table of Contents]
Subscribe to SunWorld, it's free!

The typical lifetime of a kernel thread includes not only execution time on a processor, but time spent waiting for requested resources to become available. The most obvious example is a read or write from disk, where the kernel thread will issue the read(2) or write(2) system call, then sleep so that another thread can make use of the processor while the I/O is being processed by the kernel. Once the I/O has been completed, the kernel will wake up the thread so it can continue its work.

There are other examples as well that involve a kthread blocking, such as waiting for a lock or other type of synchronization object that is currently being held by another kernel thread.

This month Jim examines kernel sleep queues and the synchronization objects used to manage kernel thread sleep and wakeup. (3,300 words)

Mail this
article to
a friend
We've spent a lot of time during the last several months talking about dispatch queues, which exist on a per-processor basis, and queue runnable kernel threads (or kthreads) waiting to be scheduled on a processor. There are other queues besides the dispatch queues that are maintained in the operating system and designed to hold kernel threads. One of these is the sleep queue (sleepq). A kernel thread is placed on a sleep queue when it needs to sleep and wait for the availability of a resource (e.g., a mutex lock, reader/writer lock, etc.) or for the kernel to perform some service (e.g., make a system call). There are a couple variations on sleep queues as implemented in the kernel, although they all use the same underlying sleep queue structures. Turnstiles are implemented using sleep queues. They are used specifically for sleep/wakeup support in the context of priority inheritance, mutex locks, and reader/writer locks. If a kernel thread is put to sleep for something beyond those items covered by turnstiles, it uses a separate sleep queue. Turnstiles and priority inheritance will be the subject of next month's column.

The underlying synchronization mechanism used for sleep/wakeup in Solaris is known as the condition variable. Condition variables are datastructures that identify an event for which a kernel thread is waiting, and they are used in many places around the operating system. The structure itself is quite small and can be examined in /usr/include/sys/condvar.h and /usr/include/sys/condvar_impl.h:

typedef struct _condvar_impl {
        ushort_t        cv_waiters;
} condvar_impl_t;

The condition variable itself is simply a 2 byte (16 bit) data type. When initialized for a specific event, it stores the number of threads waiting on this event. The implementation is such that the various kernel subsystems that use condition variables declare a condition variable data type with a unique name, either as a standalone data item or as an embedded element in a datastructure. A generic kernel cv_init() function is used to set the condition variable to all zeros during the initialization phase of the kernel module. Other kernel-level condition variable interfaces are defined and called by different areas of the operating system, in order to set up a thread to block a particular event and insert the kernel thread on a sleep queue.

The basic flow goes as follows: If there's a point in an operating system code segment that warrants blocking a kernel thread, the code will call any one of several condition variable interfaces, such as cv_wait(), cv_wait_sig(), cv_timedwait(), cv_wait_stop(), etc., passing a pointer to the condition variable.

These interfaces provide some flexibility in altering behavior based on the condition for which the kernel thread must wait. Ultimately, the cv_block() interface -- the kernel routine that actually sets the t_wchan value in the kernel thread and calls sleepq_insert() to place the thread on a sleep queue -- is called. The t_wchan, or wait channel, contains the address of the conditional variable that the thread is blocking. This address appears in the output of a ps -efl command:

sunsys> ps -efl | more
19 T     root     0     0  0   0 SY f0274e38      0            Jun 12 ?        0:01 sched
 8 S     root     1     0  0  41 20 f5c9c888    162 f5c9ca80   Jun 12 ?        0:02 /etc/init -
19 S     root     2     0  0   0 SY f5c9c1c8      0 f02886a4   Jun 12 ?        0:00 pageout
.						    ^^^^^^^^

The notion of a wait channel, or wchan, is nothing new or unique to Solaris. Traditional implementations of Unix had a wchan maintained at the process level, and the wchan was directly related to the process sleep state. Naturally, in the Solaris multithreaded model, we moved the wait channel into the kernel thread, because kernel threads execute independently of other kernel threads in the same process and can execute system calls and block. Let's move on to sleep queues, and then we'll get back to condition variables and sleep queue insertion.

Sleep queues
Sleep queues are organized as a linked list of kernel threads -- each linked list is rooted in an array referenced via a sleepq_head kernel pointer. Some changes were made in Solaris 7 to facilitate faster sleepq operations (insertion, removal, and queue traversal). In Solaris releases prior to Solaris 7, an entry in the sleepq_head array begins a linked list of kernel threads waiting on the same condition variable. The list is singly linked via the kernel thread's t_link pointer. In Solaris 7, several additional pointers were added to the kernel thread to support a doubly linked sublist of threads at the same priority. Figure 1 illustrates both implementations.

Figure 1. Sleep queues in Solaris 2.x and Solaris 7

In all implementations (pre- and post-Solaris 7, that is), a hashing function is used to index the sleepq_head array hashing on the address of the condition variable. In the pre-Solaris 7 implementation, kernel threads are inserted on the list based in order of their priority (in descending order). In Solaris 7, the singly linked list, from which the doubly linked list derives, is also in ascending order based on priority. (The doubly linked list is the one drawn along the horizontal line in Figure 1.) The sublist is implemented using a t_priforw (forward pointer) and t_priback (previous pointer) in the kernel thread. Also, a t_sleepq pointer was added that points back to the array entry in sleepq_head, identifying which sleepq the thread is on, and also providing a quick method to determine if a thread is on a sleep queue at all. (If t_sleepq == NULL the thread is not on a sleep queue.)

The number of kernel interfaces to the sleep queue facility is minimal, as there are really only a few operations that are performed on sleep queues: the insertion of a kernel thread on a sleep queue (putting a thread to sleep); the removal of a thread from a sleep queue (waking a thread up); and the traversal of the sleep queue in search of a kernel thread. There are interfaces that provide for waking up only one thread or all threads sleeping on the same condition variable. Insertion simply involves indexing into the sleepq_head array to find the appropriate sleep queue based on the condition variable address, then traversing the list, checking thread priorities along the way to determine the proper insertion point. This is conceptually the same process that happens on Solaris 7, only it's slightly more work because we maintain sublists of kthreads at the same priority. The kernel thread is inserted and pointers properly set up on one of two conditions. Either the appropriate sublist has been found, with at least one kernel thread at the same priority; or it has been determined that there are no other threads on the sleep queue at the same priority, resulting in the starting of a new sublist.

The removal of a kthread will either involve searching for and removing a specific thread that has been specified by the code calling into sleepq_dequeue() or sleepq_unsleep(), waking up all the threads blocking on a particular condition variable, or waking up one thread (the kthread not specified by the caller) that is blocked on a condition variable. Waking up all threads or a specified thread is (hopefully) straightforward -- hash into the sleepq_head array based on the address of the condition variable and walk the list. You can either wake each one up or search for a particular kthread and wake that specific one. If you need to remove a single, unspecified kthread, the code implements the list as a FIFO (first in, first out), so the kthread that has been sleeping the longest on a condition variable will be selected for wakeup.


Putting a kernel thread to sleep
Now that we've introduced condition variables and sleep queues, let's tie them together to form the complete sleep/wakeup picture in Solaris. The interfaces to the sleep queue (sleepq_insert(), etc.) are, for the most part, called only from the condition variables and turnstile's subsystems. From a hierarchical perspective, you can visualize the sleep/wakeup implementation as it is depicted in Figure 2, where a well-defined and limited set of interfaces are used to manipulate kernel threads on the sleep queues.

Figure 2. Conceptual view of sleep/wakeup interfaces

To illustrate the sequence of events and code flow, we'll use an example from the UFS (Unix filesystem) code segment involved in file writes. When a kernel thread issues a write(2) system call to a UFS file, the code will enter the ufs_write() routine through the VFS (Virtual filesystem) switch table (see links to Richard McDougall's series of articles on filesystems in the Resources section below). The UFS code implements a write throttle intended to prevent an inordinate amount of write data from overrunning the page cache. The throttle mechanism uses a couple of fields in the file's inodes: the i_writes field, which maintains the number of bytes outstanding to be written to the file, and i_wrcv, the inode's write condition variable for the file. Inside the ufs_write() code, a check is made to determine if the number of bytes to be written to the file exceed the high-water mark for the throttle (ufs_HW); if that condition exists, the thread is put to sleep until some of the write is drained from the queue, and the number of outstanding bytes drops below the low-water mark (ufs_LW). Inside the throttle code, a call is made to cv_wait(), passing the address of the i_wrcv condition variable and a mutex lock (i_tlock), which synchronizes access to parts of the field in the inode.

if (write throttles enabled AND (i_writes > ufs_HW)) {
	get the i_tlock mutex
	while (write_bytes_outstanding > ufs_HW) {
		cv_wait(address of i_wrcv, address of i_tlock)
	release i_tlock mutex

The kernel thread that issues the write(2) call will be put to sleep in the cv_wait() call within the while loop. Let's have a look.

The cv_wait() function just does a little housekeeping when entered, and calls the cv_block() code, passing cv_block() the address of the condition variable (i_wrcv in this example). cv_block() does some additional checking of various state flags and invokes the scheduling class-specific sleep function via the CL_SLEEP() macro, which will resolve to ts_sleep() for a TS or IA class thread (there is no RT class sleep function). The intent of the ts_sleep() code is to boost the priority of the sleeping thread to a SYS priority, which will result in the kthread being placed in an optimal position on the sleep queue for early wakeup, and in quick rescheduling when the wakeup occurs. The assignment of a SYS priority to the kernel thread is not guaranteed every time ts_sleep() is entered. There are flags used in the kthread structure, along with the kthread's class-specific data (ts_data in the case of a TS class thread) that specify whether or not a kernel mode (SYS) priority is required. When we talk about turnstiles next month, you'll see an example of a kthread getting a SYS class priority because it's holding either a reader/writer lock or a page lock on a memory page. For most other cases, such as our example of the UFS write throttle condition variable, a SYS class priority is not required, and this will not be assigned to the thread.

The code will return to cv_block() from ts_sleep(), not having done much in ts_sleep() because we didn't need a kernel mode priority. Back in cv_block(), the kthread's t_wchan is set to the address of the condition variable (in our example, the inode's i_wrcv address), and the kthread's t_sobj_ops is set to the address of the condition variable's operations structure. If you take a look at /usr/include/sys/sobject.h, you'll find a definition for a synchronization object operation's structure:

typedef struct _sobj_ops {
        syncobj_t       sobj_type;
        kthread_t       *(*sobj_owner)();
        void            (*sobj_unsleep)(kthread_t *);
        void            (*sobj_change_pri)(kthread_t *, pri_t, pri_t *);
} sobj_ops_t;

This is a generic structure that is used for all types of synchronization objects supported by the operating system. Note the enumerated types in the same header file, which describe mutex locks, reader/writer locks, semaphores, condition variables, etc. Essentially, this object provides a placeholder for a few routines specific to the synchronization object that may require invocation while the kernel thread is sleeping. In the case of condition variables (our example), the sobj_ops structure is populated with the address of the cv_owner(), cv_unsleep(), and cv_change_pri() functions, with the sobj_type field set to SOBJ_CV. t_sobj_ops is set to the address of this structure in the cv_block() code.

Next, the correct sleep queue is located using the hashing function on the condition variable address to index into the sleepq_head array; the cv_waiters field in the condition variable is incremented; the thread state is set to TS_SLEEP; and the sleepq_insert() function is called to insert the kernel thread into the correct position (based on priority) in the sleep queue. The kthread is now sitting on a sleep queue in a TS_SLEEP state, waiting for a wakeup.

The wakeup mechanism
For every cv_wait() or equivalent call on a condition variable, there will be a corresponding wakeup call using either cv_signal(), cv_broadcast() or cv_unsleep(). We'll continue along with our example using the i_wrcv condition variable from a UFS file inode that's implemented with the aforementioned write throttle scheme. The UFS support code has a function, ufs_iodone(), which is called when a UFS I/O is completed. ufs_iodone() will be called from the UFS function that hands physical writes down to the device driver layer (e.g., calling the SCSI driver's block device strategy routine for a write to a SCSI disk). Inside ufs_iodone(), a check is made to determine if write throttles are enabled and if the number of bytes ready to be written to the file have dropped below the low-water mark (ufs_LW). If those conditions are met, ufs_iodone() calls cv_broadcast(), passing the address of the i_wrcv condition variable. Naturally, there probably will be multiple threads blocked on UFS file writes, and UFS I/Os are completed system-wide potentially hundreds or thousands of times a minute. The address of the block buffer used for the I/O is passed to ufs_iodone() and is converted to the vnode, and ultimately the inode, for the file. Once the inode is known, we have the address of the i_wrcv condition variable -- this is all we need to locate the correct sleep queue.

The cv_broadcast() function simply locates the address of the sleep queue by invoking the hash function on the address of the condition variable (which was passed to cv_broadcast() as an argument), and calls sleepq_wakeall_chan(). sleepq_wakeall_chan() will walk through the linked list of kernel threads waiting on that particular condition variable and, for each kthread, sleepq_unlink() is called. sleepq_unlink() then simply removes the kthread from the sleep queue linked list, adjusts the pointers (to NULL for the t_priforw, t_priback, etc., pointers), and returns to sleepq_wakeall_chan(). On the return back to sleepq_wakeall_chan(), the kthread's t_wchan and t_sobj_ops fields are cleared, and the scheduling class-specific wakeup code is called -- ts_wakeup() for IA/TS threads and rt_wakeup() for RT class threads.

The ts_wakeup() code is responsible for putting the kernel thread back on a dispatch queue so it can be scheduled for execution on a processor. Threads that have a kernel mode priority (as indicated by the TSKPRI flag in the class-specific datastructure) are placed on the front of the appropriate dispatch queue if the thread is in the IA scheduling class. If it is a TS class thread, the setbackq() function is called to place the thread on the back of the appropriate dispatch queue. Threads that are not at a SYS priority are tested to see if they've waited longer than the ts_maxwait field to be scheduled in the dispatch table for the thread's priority level. (See Inside Solaris, October 1998 for a description of the dispatch table fields.) If the thread has been waiting an inordinate amount of time to run (if dispwait > dispatch_table[priority]dispwait), the thread's priority will be recalculated, using the ts_slpret value from the dispatch table. (See Inside Solaris, June 1999 for a description of how kernel thread priorities are calculated.)

Whether or not the thread's priority is recalculated, the dispatcher setfrontq() or setbackq() function will be called to place the thread on the front or back of the appropriate dispatch queue. There are conditions that determine queue front or queue back: if the threads are interactive (IA) or if the thread has waited a long time to run (based on the kernel thread's t_disp_wait value as compared to the current kernel lbolt value), it gets placed on the front of a queue. The thread state will be switched from TS_SLEEP to TS_RUN in the dispatcher queue insertion code (setfrontdq(), setbackdq()). It's also in these functions that we determine if the thread we just placed on a queue is of a higher priority than the currently running thread; if so, we should force a preemption. At this point, the kthread has awakened and is sitting on a dispatch queue, ready to be context-switched onto a processor.

On a related note, operating system texts that discuss sleep/wakeup often refer to something known as the thundering herd problem. This is a situation in which you have multiple threads all sleeping and waiting for the same resource. The resource is made available, and all the sleeping threads are awakened, scheduled, and executed -- they all race for the resource that was just freed. Only one thread will get the resource, however, forcing the remaining threads to block again.

The Solaris kernel does not use the cv_broadcast() every time an event warrants the wakeup of sleeping threads. There are interfaces, such as cv_signal(), that are sometimes used to wake up only one thread that is blocked on the condition variable. cv_signal() will wake up the highest priority thread, and in the event of multiple threads at the same priority, the thread in the front of the queue is selected.

The sleep/wakeup mechanism in Solaris is built around synchronization objects known as condition variables and sleep queues. The current implementation is fast and efficient, scaling up to support hundreds to thousands of kernel threads on a large Solaris system. The general flow is depicted below.

Table 1. Sleep/wakeup mechanism in Solaris
Sleep Wakeup
Kernel thread executing in user mode Event completes or resource is made available
System call is invoked Kernel calls cv_broadcast(), or cv_signal()
Enter kernel segment that requires waiting for a resource or event cv_broadcast() calls sleepq_unsleep()
Kernel calls cv_wait() Kernel thread is removed from sleep queue
cv_wait() calls sleepq_insert() Dispatcher queue insertion code is called
Thread is now sleeping on a sleepq Kernel thread is inserted on a dispatch queue

Kernel thread is now ready to run

Next month, we'll get into another area of sleep/wakeup in Solaris -- turnstiles and the notion of priority inheritance.


Additional SunWorld resources

About the author
Jim MauroJim Mauro is currently an area technology manager for Sun Microsystems in the Northeast, focusing on server systems, clusters, and high availability. He has a total of 18 years of industry experience, working in educational services (he developed and delivered courses on Unix internals and administration) and software consulting. Reach Jim at

What did you think of this article?
-Very worth reading
-Worth reading
-Not worth reading
-Too long
-Just right
-Too short
-Too technical
-Just right
-Not technical enough

[Table of Contents]
Subscribe to SunWorld, it's free!
[Next story]
Sun's Site

[(c) Copyright  Web Publishing Inc., and IDG Communication company]

If you have technical problems with this magazine, contact

Last modified: