It's time to wake up!Into another dimension of kernel thread support -- the mechanics of sleep and wakeup in Solaris |
The typical lifetime of a kernel thread includes not only execution time on a processor, but time spent waiting for requested resources to become available. The most obvious example is a read or write from disk, where the kernel thread will issue theread(2)
orwrite(2)
system call, then sleep so that another thread can make use of the processor while the I/O is being processed by the kernel. Once the I/O has been completed, the kernel will wake up the thread so it can continue its work.There are other examples as well that involve a
kthread
blocking, such as waiting for a lock or other type of synchronization object that is currently being held by another kernel thread.This month Jim examines kernel sleep queues and the synchronization objects used to manage kernel thread sleep and wakeup. (3,300 words)
Mail this article to a friend |
kthread
s) waiting to be scheduled on a processor.
There are other queues besides the dispatch queues that are
maintained in the operating system and designed to hold kernel
threads. One of these is the sleep queue (sleepq
). A kernel thread is placed on
a sleep queue when it needs to sleep and wait for the
availability of a resource (e.g., a mutex lock, reader/writer lock,
etc.) or for the kernel to perform some service (e.g., make a system call). There are a couple variations on sleep queues as implemented in the kernel,
although they all use the same underlying sleep queue structures.
Turnstiles are implemented using sleep queues. They are used
specifically for sleep/wakeup support in the context of priority
inheritance, mutex locks, and reader/writer locks. If a kernel thread is
put to sleep for something beyond those items covered by turnstiles,
it uses a separate sleep queue. Turnstiles and priority inheritance will
be the subject of next month's column.
The underlying synchronization mechanism used for sleep/wakeup in
Solaris is known as the condition variable. Condition variables are
datastructures that identify an event for which a kernel thread is waiting, and they are used in many places around the operating system.
The structure itself is quite small and can be examined in
/usr/include/sys/condvar.h
and /usr/include/sys/condvar_impl.h
:
typedef struct _condvar_impl { ushort_t cv_waiters; } condvar_impl_t;
The condition variable itself is simply a 2 byte (16 bit) data
type. When initialized for a specific event, it stores the number of
threads waiting on this event. The implementation is such that the
various kernel subsystems that use condition variables declare a
condition variable data type with a unique name, either as a
standalone data item or as an embedded element in a datastructure.
A generic kernel cv_init()
function is used to set the condition
variable to all zeros during the initialization phase of the kernel
module. Other kernel-level condition variable interfaces are defined
and called by different areas of the operating system, in order to set up a
thread to block a particular event and insert the kernel thread on a
sleep queue.
The basic flow goes as follows: If there's a point in an operating
system code segment that warrants blocking a kernel thread, the code
will call any one of several condition variable interfaces, such as
cv_wait()
, cv_wait_sig()
, cv_timedwait()
, cv_wait_stop()
, etc., passing a pointer to the condition variable.
These interfaces provide some flexibility in altering behavior based
on the condition for which the kernel thread must wait. Ultimately, the
cv_block()
interface -- the kernel routine that
actually sets the t_wchan
value in the kernel thread and calls
sleepq_insert()
to place the thread on a sleep queue -- is called. The t_wchan
,
or wait channel, contains the address of the conditional
variable that the thread is blocking. This address appears in the
output of a ps -efl
command:
sunsys> ps -efl | more F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD 19 T root 0 0 0 0 SY f0274e38 0 Jun 12 ? 0:01 sched 8 S root 1 0 0 41 20 f5c9c888 162 f5c9ca80 Jun 12 ? 0:02 /etc/init - 19 S root 2 0 0 0 SY f5c9c1c8 0 f02886a4 Jun 12 ? 0:00 pageout . ^^^^^^^^ . .
The notion of a wait channel, or wchan
, is nothing new or unique to Solaris. Traditional implementations of Unix had a wchan
maintained at the process level, and the wchan
was directly related to the process sleep
state. Naturally, in the Solaris multithreaded model, we moved the wait
channel into the kernel thread, because kernel threads execute
independently of other kernel threads in the same process and can
execute system calls and block. Let's move on to sleep queues, and then
we'll get back to condition variables and sleep queue insertion.
Sleep queues
Sleep queues are organized as a linked list of kernel threads --
each linked list is rooted in an array referenced via a sleepq_head
kernel pointer. Some changes were made in Solaris 7 to facilitate
faster sleepq
operations (insertion, removal, and queue traversal).
In Solaris releases prior to Solaris 7, an entry in the sleepq_head
array begins a linked list of kernel threads waiting on the same
condition variable. The list is singly linked via the kernel thread's
t_link
pointer. In Solaris 7, several additional pointers were added
to the kernel thread to support a doubly linked sublist of threads
at the same priority. Figure 1 illustrates both implementations.
Figure 1. Sleep queues in Solaris 2.x and Solaris 7 |
In all implementations (pre- and post-Solaris 7, that is), a hashing
function is used to index the sleepq_head
array hashing on the
address of the condition variable. In the pre-Solaris 7
implementation, kernel threads are inserted on the list based in
order of their priority (in descending order). In Solaris 7, the singly linked list, from which the doubly linked list
derives, is also in ascending order based on priority. (The doubly linked
list is the one drawn along the horizontal line in Figure 1.)
The sublist is implemented using a t_priforw
(forward pointer) and
t_priback
(previous pointer) in the kernel thread. Also, a t_sleepq
pointer was added that points back to the array entry in
sleepq_head
, identifying which sleepq
the thread is on, and also providing a quick method to determine if a thread is on a sleep
queue at all. (If t_sleepq == NULL
the thread is not on a sleep
queue.)
The number of kernel interfaces to the sleep queue facility is
minimal, as there are really only a few operations that are
performed on sleep queues: the insertion of a kernel thread on a
sleep queue (putting a thread to sleep); the removal of a thread
from a sleep queue (waking a thread up); and the traversal of the sleep
queue in search of a kernel thread. There are interfaces that provide for waking up only one thread or all threads sleeping on the same condition variable. Insertion simply involves indexing into the
sleepq_head
array to find the appropriate sleep queue based on the
condition variable address, then traversing the list, checking
thread priorities along the way to determine the proper insertion
point. This is conceptually the same process that happens on Solaris
7, only it's slightly more work because we maintain sublists of
kthreads at the same priority. The kernel thread is inserted and pointers properly set up on one of two conditions. Either the appropriate sublist has been found, with at least one kernel thread at the same priority; or it has been determined that there are no other threads on the sleep queue
at the same priority, resulting in the starting of a new sublist.
The removal of a kthread
will either involve searching for and
removing a specific thread that has been specified by the code
calling into sleepq_dequeue()
or sleepq_unsleep()
, waking up all the threads blocking on a particular condition variable, or waking up one thread (the kthread
not specified by the caller) that is
blocked on a condition variable. Waking up all threads or a
specified thread is (hopefully) straightforward -- hash into the
sleepq_head
array based on the address of the condition variable and walk the list. You can either wake each one up or search for a
particular kthread
and wake that specific one. If you need to remove
a single, unspecified kthread
, the code implements the list as a
FIFO (first in, first out), so the kthread
that has been sleeping
the longest on a condition variable will be selected for wakeup.
|
|
|
|
Putting a kernel thread to sleep
Now that we've introduced condition variables and sleep queues,
let's tie them together to form the complete sleep/wakeup picture in
Solaris. The interfaces to the sleep queue (sleepq_insert()
, etc.)
are, for the most part, called only from the condition variables and
turnstile's subsystems. From a hierarchical perspective, you can visualize the sleep/wakeup implementation as it is depicted in Figure 2, where
a well-defined and limited set of interfaces are used to manipulate
kernel threads on the sleep queues.
Figure 2. Conceptual view of sleep/wakeup interfaces |
To illustrate the sequence of events and code flow, we'll use an
example from the UFS (Unix filesystem) code segment involved in file
writes. When a kernel thread issues a write(2)
system call to a UFS file, the code will enter the ufs_write()
routine through the VFS
(Virtual filesystem) switch table (see links to Richard McDougall's
series of articles on filesystems in the Resources section below). The UFS code
implements a write throttle intended to prevent an inordinate amount
of write data from overrunning the page cache. The throttle
mechanism uses a couple of fields in the file's inode
s: the i_writes
field, which maintains the number of bytes outstanding to be written
to the file, and i_wrcv
, the inode
's write condition variable for
the file. Inside the ufs_write()
code, a check is made to determine
if the number of bytes to be written to the file exceed the
high-water mark for the throttle (ufs_HW
); if that condition
exists, the thread is put to sleep until some of the write
is drained from the queue, and the number of outstanding bytes drops
below the low-water mark (ufs_LW
). Inside the throttle code, a call
is made to cv_wait()
, passing the address of the i_wrcv
condition
variable and a mutex lock (i_tlock
), which synchronizes access to
parts of the field in the inode
.
if (write throttles enabled AND (i_writes > ufs_HW)) { get the i_tlock mutex while (write_bytes_outstanding > ufs_HW) { cv_wait(address of i_wrcv, address of i_tlock) { release i_tlock mutex }
The kernel thread that issues the write(2)
call will be put to
sleep in the cv_wait()
call within the while loop. Let's have a
look.
The cv_wait()
function just does a little housekeeping when entered,
and calls the cv_block()
code, passing cv_block()
the address of the
condition variable (i_wrcv
in this example). cv_block()
does some
additional checking of various state flags and invokes the
scheduling class-specific sleep function via the CL_SLEEP()
macro,
which will resolve to ts_sleep()
for a TS or IA class thread (there
is no RT class sleep function). The intent of the ts_sleep()
code
is to boost the priority of the sleeping thread to a SYS priority,
which will result in the kthread
being placed in an optimal
position on the sleep queue for early wakeup, and in quick
rescheduling when the wakeup occurs. The assignment of a SYS
priority to the kernel thread is not guaranteed every time
ts_sleep()
is entered. There are flags used in the kthread
structure, along with the kthread
's class-specific data (ts_data
in
the case of a TS class thread) that specify whether or not a kernel
mode (SYS) priority is required. When we talk about turnstiles next
month, you'll see an example of a kthread
getting a SYS class
priority because it's holding either a reader/writer lock or a page
lock on a memory page. For most other cases, such as our example of
the UFS write throttle condition variable, a SYS class priority is
not required, and this will not be assigned to the thread.
The code will return to cv_block()
from ts_sleep()
, not having done
much in ts_sleep()
because we didn't need a kernel mode priority.
Back in cv_block()
, the kthread
's t_wchan
is set to the address of the condition variable (in our example, the inode
's i_wrcv
address),
and the kthread
's t_sobj_ops
is set to the address of the condition
variable's operations structure. If you take a look at
/usr/include/sys/sobject.h
, you'll find a definition for a
synchronization object operation's structure:
typedef struct _sobj_ops { syncobj_t sobj_type; kthread_t *(*sobj_owner)(); void (*sobj_unsleep)(kthread_t *); void (*sobj_change_pri)(kthread_t *, pri_t, pri_t *); } sobj_ops_t;
This is a generic structure that is used for all types of
synchronization objects supported by the operating system. Note the
enumerated types in the same header file, which describe mutex
locks, reader/writer locks, semaphores, condition variables, etc.
Essentially, this object provides a placeholder for a few routines
specific to the synchronization object that may require invocation
while the kernel thread is sleeping. In the case of condition
variables (our example), the sobj_ops
structure is populated with
the address of the cv_owner()
, cv_unsleep()
, and cv_change_pri()
functions, with the sobj_type
field set to SOBJ_CV
. t_sobj_ops
is set to the address of this structure in the cv_block()
code.
Next, the correct sleep queue is located using the hashing function
on the condition variable address to index into the sleepq_head
array; the cv_waiters
field in the condition
variable is incremented; the thread state is set to TS_SLEEP
; and
the sleepq_insert()
function is called to insert the kernel thread
into the correct position (based on priority) in the sleep queue.
The kthread
is now sitting on a sleep queue in a TS_SLEEP
state,
waiting for a wakeup.
The wakeup mechanism
For every cv_wait()
or equivalent call on a condition variable,
there will be a corresponding wakeup call using either
cv_signal()
, cv_broadcast()
or cv_unsleep()
. We'll continue along
with our example using the i_wrcv
condition variable from a UFS file
inode
that's implemented with the aforementioned write throttle
scheme. The UFS support code has a function, ufs_iodone()
, which
is called when a UFS I/O is completed. ufs_iodone()
will be
called from the UFS function that hands physical writes down to the
device driver layer (e.g., calling the SCSI driver's block device strategy routine for a write to a SCSI disk). Inside ufs_iodone()
,
a check is made to determine if write throttles are enabled and if
the number of bytes ready to be written to the file have dropped
below the low-water mark (ufs_LW
). If those conditions are met,
ufs_iodone()
calls cv_broadcast()
, passing the address of the i_wrcv
condition variable. Naturally, there probably will be multiple threads blocked on UFS file writes,
and UFS I/Os are completed system-wide potentially hundreds or
thousands of times a minute. The address of the block buffer used
for the I/O is passed to ufs_iodone()
and is converted to the vnode
, and ultimately the inode
, for the file. Once the inode
is known, we
have the address of the i_wrcv
condition variable -- this is all we need to locate the correct sleep queue.
The cv_broadcast()
function simply locates the address of the sleep
queue by invoking the hash
function on the address of the condition
variable (which was passed to cv_broadcast()
as an argument), and
calls sleepq_wakeall_chan()
. sleepq_wakeall_chan()
will walk through
the linked list of kernel threads waiting on that particular
condition variable and, for each kthread
, sleepq_unlink()
is
called. sleepq_unlink()
then simply removes the kthread
from the
sleep queue linked list, adjusts the pointers (to NULL
for the t_priforw
,
t_priback
, etc., pointers), and returns to sleepq_wakeall_chan()
. On
the return back to sleepq_wakeall_chan()
, the kthread
's t_wchan
and
t_sobj_ops
fields are cleared, and the scheduling class-specific
wakeup code is called -- ts_wakeup()
for IA/TS threads and
rt_wakeup()
for RT class threads.
The ts_wakeup()
code is responsible for putting the kernel thread
back on a dispatch queue so it can be scheduled for execution on a
processor. Threads that have a kernel mode priority (as indicated by
the TSKPRI
flag in the class-specific datastructure) are placed on
the front of the appropriate dispatch queue if the thread is in the
IA scheduling class. If it is a TS class thread, the setbackq()
function is called to place the thread on the back of the
appropriate dispatch queue. Threads that are not at a SYS priority
are tested to see if they've waited longer than the ts_maxwait
field
to be scheduled in the dispatch table for the thread's priority
level. (See Inside
Solaris, October 1998 for a description of the dispatch
table fields.) If the thread has been waiting an inordinate amount
of time to run (if dispwait > dispatch_table[priority]dispwait
),
the thread's priority will be recalculated, using the ts_slpret
value from the dispatch table. (See Inside
Solaris, June 1999 for a description of how kernel
thread priorities are calculated.)
Whether or not the thread's priority is recalculated, the
dispatcher setfrontq()
or setbackq()
function will be called to place the thread on the front or back of the appropriate dispatch
queue. There are conditions that determine queue front or queue
back: if the threads are interactive (IA) or if the thread
has waited a long time to run (based on the kernel thread's
t_disp_wait
value as compared to the current kernel lbolt
value), it
gets placed on the front of a queue. The thread state will be
switched from TS_SLEEP
to TS_RUN
in the dispatcher queue insertion
code (setfrontdq()
, setbackdq()
). It's also in these functions that
we determine if the thread we just placed on a queue is of a higher
priority than the currently running thread; if so, we should force
a preemption. At this point, the kthread
has awakened and is sitting
on a dispatch queue, ready to be context-switched onto a
processor.
On a related note, operating system texts that discuss sleep/wakeup often refer to something known as the thundering herd problem. This is a situation in which you have multiple threads all sleeping and waiting for the same resource. The resource is made available, and all the sleeping threads are awakened, scheduled, and executed -- they all race for the resource that was just freed. Only one thread will get the resource, however, forcing the remaining threads to block again.
The Solaris kernel does not use the cv_broadcast()
every time an
event warrants the wakeup of sleeping threads. There are interfaces,
such as cv_signal()
, that are sometimes used to wake up only one
thread that is blocked on the condition variable. cv_signal()
will wake up the highest priority thread, and in the event of multiple
threads at the same priority, the thread in the front of the queue
is selected.
Summary
The sleep/wakeup mechanism in Solaris is built around synchronization objects known as condition variables and sleep queues. The current implementation is fast and efficient, scaling up to support hundreds to thousands of kernel threads on a large Solaris system. The general flow is depicted below.
Sleep | Wakeup |
Kernel thread executing in user mode | Event completes or resource is made available |
System call is invoked | Kernel calls cv_broadcast() , or cv_signal() |
Enter kernel segment that requires waiting for a resource or event | cv_broadcast() calls sleepq_unsleep() |
Kernel calls cv_wait() |
Kernel thread is removed from sleep queue |
cv_wait() calls sleepq_insert() |
Dispatcher queue insertion code is called |
Thread is now sleeping on a sleepq |
Kernel thread is inserted on a dispatch queue |
Kernel thread is now ready to run |
Next month, we'll get into another area of sleep/wakeup in Solaris -- turnstiles and the notion of priority inheritance.
Resources
About the author
Jim
Mauro is currently an area technology manager for Sun Microsystems
in the Northeast, focusing on server systems, clusters, and high
availability. He has a total of 18 years of industry experience,
working in educational services (he developed and delivered courses on
Unix internals and administration) and software consulting.
Reach Jim at jim.mauro@sunworld.com.
If you have technical problems with this magazine, contact webmaster@sunworld.com
URL: http://www.sunworld.com/swol-07-1999/swol-07-insidesolaris.html
Last modified: