The Solaris process model, Part 4: More on the kernel dispatcher

An examination of the realtime, system, and interactive scheduling classes, and a look at how dispatch queues are put together

December 1998

Abstract

Last month Jim switched gears to cover the announcement of Solaris 7 and highlight some of the technical features of that release. This month Inside Solaris is back on track, with coverage of the Solaris multithreaded architecture.
Jim began his coverage of the kernel dispatcher in October, with an overview of the dispatcher and the timeshare dispatch table. This month he takes a closer look at the dispatch tables and the major data structures that make up the queues of runnable kthreads. (2,200 words)

Mail this
article to
a friend

he October column introduced the realtime scheduling class and several additional features that need to be added to the kernel (and to the scheduling class itself) in order to provide some degree of bounded dispatch latency for realtime applications. As we left it, the issues of kernel preemption and page-fault-imposed latency had yet to be addressed. We'll get into the specifics of kernel preemption when we go through the dispatcher algorithms. The page fault solution comes in the form of an application programming interface (API), mlock(3C) and memcntl(2), which allows the developer of realtime applications to lock pages in memory.

The dispatch table for the realtime scheduling class has considerably fewer columns than the TS or IA classes. This is because RT threads run at a fixed priority -- there is no priority readjustment done during the execution and lifetime of realtime threads. They remain fixed at their original priority unless the priority is explicitly changed via the priocntl(1) command or priocntl(2) system call. The priocntl(1) command only allows for altering priorities and scheduling classes at the process level. This means any change to the priority or scheduling class made via priocntl(1) affects all the threads in the process, and there's no way to specify a particular thread within the process from the command line. In practice, it's possible for different threads within the same process to execute in different scheduling classes; you can have TS and RT threads execute within the same process. Using the priocntl(2) system call, you can specify an LWP ID, and thus programmatically alter the scheduling class of a thread within a process.

Here's a partial listing of the default RT table:

sunsys> dispadmin -c RT -g
# Real Time Dispatcher Configuration
RES=1000

# TIME QUANTUM                    PRIORITY
# (rt_quantum)                      LEVEL
      1000                    #        0
      1000                    #        1
       800                    #       10
       800                    #       11
       600                    #       20
       600                    #       21
       400                    #       30
       400                    #       31
       200                    #       40
       200                    #       41
       100                    #       50
       100                    #       51
       100                    #       59
sunsys>

Each entry in the table conforms to an entry in the RT dispatch table, as defined by a rtdpent_t structure.

/*
 * Realtime dispatcher parameter table entry
 */
typedef struct  rtdpent {
        pri_t   rt_globpri;     /* global (class independent) priority */
        long    rt_quantum;     /* default quantum associated with this level */
} rtdpent_t;

A realtime thread at a given priority is assigned a time quantum, similar to the ts_quantum field in the TS dispatch table for timeshare threads. The unit of time described by the value on the rt_quantum field is determined by the RES value. We covered the meaning of RES, and how the time unit is determined, in October. With the default RES value of 1000, the rt_quantum units are in milliseconds. For example, RT threads at priority 0 to 9 will get a time quantum of one second before the dispatcher moves the thread off the processor and puts it at the end of the dispatch queue.

The Solaris kernel implements a notion of global priorities, such that every priority level for each of the loaded scheduling classes has a unique ordinal value assigned to it. The global priority value is determined at boot time and recalculated if another scheduling class is loaded. Both the realtime and timeshare scheduling classes have 60 priority levels (0 to 59), and the system class has 40. (The IA class doesn't have a separate dispatch table of its own -- it uses the TS table.) As the classes are loaded into the system, the kernel calculates the global priority based on each class's position relative to the others. By default, the TS class gets global priorities 0 to 59. The next higher level class is SYS, which gets global priorities 60 to 99, followed by 10 levels for interrupt threads (100 to 109). If the RT class is loaded, priorities are recalculated, such that RT global priorities are set to levels 100 to 159 and interrupt levels are pushed up to 160 to 169.

As I mentioned, the IA class uses the TS dispatch table. There is no dispatch table for the SYS class. Threads running in the SYS class don't have a time quantum; there is no time-slicing and reprioritization isn't applied. The priorities for the SYS class are defined in a simple array:

/*
 * array of global priorities used by ts procs sleeping or
 * running in kernel mode after sleep. Must have at least
 * 40 values.
 */

pri_t config_ts_kmdpris[] = {
       60,61,62,63,64,65,66,67,68,69,
       70,71,72,73,74,75,76,77,78,79,
       80,81,82,83,84,85,86,87,88,89,
       90,91,92,93,94,95,96,97,98,99,
};

With all the scheduling classes loaded and global priorities calculated, the system sets the values for the total number of global priorities and maximum system priority in the var structure, which stores various bits of information related to the kernel configuration. The v function in /etc/crash dumps the kernel var structure:

# /etc/crash
dumpfile = /dev/mem, namelist = /dev/ksyms, outfile = stdout
> v
v_buf: 100
v_call:   0
v_proc: 4058
v_nglobpris: 170
v_maxsyspri:  99
v_clist:   0
v_maxup: 4053
v_hbuf: 256
v_hmask: 255
v_pbuf:   0
v_sptmap:   0
v_maxpmem: 0
v_autoup: 30
v_bufhwm: 5196
>

The v_nglobpris value is the number of global priorities, and v_maxsyspri is the highest (best) system priority. Because this system has the RT class loaded, there are 170 total global priorities, with the highest SYS class priority being the default value of 99. If the RT class wasn't loaded, v_nglobpris would be 110 and v_maxsyspri would still be 99.

The default values have been tested extensively under a variety of workloads and, in general, provide well-balanced performance and a reasonable distribution of processor execution time. Nonetheless, it may at times be desirable to alter the default dispatch table values for better application performance. The simplest way to alter dispatch table values is to first get the current table and save it to a file. Then, edit the table as required and reload it using dispadmin(1M). The dispadmin -c TS -g > /tmp/ts.tbl command line will dump the TS table and save it in a file under /tmp, called ts.tbl.

After editing /tmp/ts.tbl, load it into the kernel with dispadmin -c TS -s /tmp/ts.tbl. Any alteration to the default dispatch table values should be done with caution and should be tested extensively before being put into production to ensure desired application behavior is achieved. Note also that the kernel only performs very rudimentary checking of dispatch table values when a new table is loaded (values less than 0, values larger than the maximum number of priority entries, etc.). Be extra careful about checking the values before loading the table.

Advertisements

Dispatcher data and dispatch queues
Every kernel thread maintains a pointer to a scheduling class-specific data structure. (This was shown in Figure 1 in the September column and in Figure 2 in the October column.) The TS, IA, and RT classes each require that some class-specific data be maintained for use by the dispatcher to track per-thread data, such as execution time, wait time, nice value, status flags, and thread priority. You can examine the header file definition of these structures for each scheduling class (except the SYS class -- there is no class-specific structure for the SYS class) in /usr/include/sys/ts.h, /usr/include/sys/ia.h, and /usr/include/sys/rt.h. The structure name is xxproc, where xx is either ts, ia, or rt. As we move through the dispatcher kernel code, we'll see how various members of the class-specific structures are used.

The kernel dispatch queue model is essentially a series of arrays of linked lists of threads, where each linked list is comprised of threads at the same priority level. Older versions of Solaris 2 implemented a simple dispatch queue structure, maintaining a single, systemwide queue. Access to the queue was synchronized through a single kernel lock; inserting a thread onto the run queue or removing a thread for execution required acquisition of the queue lock. This coarse-grained locking resulted in poor scalability on larger multiprocessor systems. In Solaris 2.3, a new design was implemented where every CPU on the system had its own dispatch queue with more fine-grained locking, providing significantly improved scalability. In Solaris 2.5, an algorithmic change was made to the dispatcher code that simplified the process of selecting threads for execution time.

The dispatcher queue model in Solaris involves not only per-processor dispatch queues, but also a global queue of threads that run at a priority level high enough to cause kernel preemption. Threads in the RT class and interrupt threads meet this criteria on typical Solaris systems and are placed on the kernel preemption queue. In Solaris releases up to 2.6, the kernel preemption queue is referenced through a disp_kp_queue pointer, accessed by all processors on the system. Solaris 2.6 introduced processor sets that allow for the partitioning of the processors on a Solaris system into multiple sets, each made up of one or more processors. A systems administrator can bind processes or applications to a given processor set such that exclusive use of the processors in that set is guaranteed by the kernel dispatcher. Because the kernel effectively isolates dispatch queues within a processor set from queues in other sets, a kernel preempt queue is created for each processor set configured. There is one default partition in Solaris 2.6 and Solaris 7 systems; it's created at initialization time and is made up of all the processors on the system. The kernel preemption queue is referenced through this default partition. See the psrset(1M) man page for the creation and management of processor sets. Figure 2 below illustrates the per-processor queue model.

The kernel preemption queue looks similar to the per-processor queues shown in Figure 2. An array of dispq structures is the base for linked lists of threads at each priority level. (kthreads at the same priority are linked on the same dispq.) If processor sets haven't been explicitly created with psradm(1M), the CPU structure for all the processors in a multiprocessor system will link to the same cpu_part (CPU partition) structure, and will reference the kernel preemption queue through the cp_kp_queue pointer in the CPU partition structure.

When Solaris boots, a linked list of CPU structures is created for every processor on the system. Additionally, the dispatcher initialization routine is executed to build and link the necessary structures together for the dispatch queues shown in Figure 2. As the diagram depicts, the linkage to per-processor queues is rooted in a disp data structure, which is embedded in each CPU structure. Along with scheduling information used by the dispatcher, a link to an array of dispq (dispatcher queue) structures connects the CPU to its array of linked kthreads. There is an array rooted for every non-kernel preemptable priority level. Each kernel thread at a given priority will be linked through the kthreads' t_link pointer. A subset of the CPU structure (from /usr/include/sys/cpuvar.h) is included below.

/*
 * Scheduling variables.
 */
disp_t          cpu_disp;         /* dispatch queue data */
char            cpu_runrun;       /* scheduling flag - set to preempt */
char            cpu_kprunrun;     /* force kernel preemption */
pri_t           cpu_chosen_level; /* priority level at which cpu */
                                  /* was chosen for scheduling */
kthread_id_t    cpu_dispthread;   /* thread selected for dispatch */
disp_lock_t     cpu_thread_lock;  /* dispatcher lock on current thread */
clock_t         cpu_last_swtch;   /* last time switched to new thread */

struct cpupart  *cpu_part;        /* partition with this CPU */

As we've shown, the dispatcher in Solaris maintains scheduling-related data in several places:

In each processor's CPU structure, which contains pointers and dispatcher-related data
In a processor partition structure (cpu_part) in releases 2.6 and beyond for linkage to the global kernel preempt queue, or per-partition preempt queue if processor sets have been created
In the kernel threads, each of which has an associated scheduling class-related data structure
In the dispatch queues, each comprised of three distinct data structures, disp_queue_info, disp, and dispq, as shown in Figure 2
In the per-scheduling class dispatch tables

There are two primary user-level commands that apply to the dispatcher: the dispadmin(1M) command for dispatch table administration and priocntl(1) for priority and scheduling class control for processes.

Next month, we'll get into the flow of the dispatcher code and the execution phases of some sample threads. There are several features implemented in the dispatcher that make for an interesting study in operating system scheduling. These features include:

Preemption: The process of taking a thread off of a processor before it has used its time quantum or issued a blocking system call. Solaris does both kernel and user preemption.
Priority inheritance: A problem called priority inversion occurs when a lower priority thread prevents a higher priority thread from executing because it's holding a critical resource (i.e., a kernel mutex lock or a write lock). When this occurs, Solaris implements a mechanism whereby the lower priority thread "inherits" the higher priority of the thread it's blocking, so that it gets execution time sooner and can consequently complete what it's doing and release the resource it's holding.
Select and verify: The dispatcher code makes a decision on which thread gets executed next, updates the dispatcher state information accordingly, and rechecks the queues to ensure the right selection was made. Once the selection is "ratified," the selected thread is switched to the corresponding processor (based on the queue) for execution.

That's it for now. We'll have fun next month seeing how all the pieces discussed thus far fit together and how the Solaris kernel manages the complex task of scheduling large numbers of threads on large, multiprocessor systems.

Resources

"The Solaris process model, Part 3" October 1998 Inside Solaris column
http://www.sunworld.com/swol-10-1998/swol-10-insidesolaris.html
"The Solaris process model, Part 2," September 1998 Inside Solaris column
http://www.sunworld.com/swol-09-1998/swol-09-insidesolaris.html
"Peeling back the process layers, Part 1" August 1998 Inside Solaris column
http://www.sunworld.com/swol-08-1998/swol-08-insidesolaris.html
Full listing of previous Inside Solaris columns
http://www.sunworld.com/common/swol-backissues-columns.html#insidesolaris
Goodheart, B. & Cox, J. "The Magic Garden Explained: The Internals of Unix System V Release 4," Prentice Hall
http://www.amazon.com/exec/obidos/ISBN=0130981389/sunworldonlineA/
Vahalia, Uresh. "Unix Internals: The New Frontiers," Prentice Hall
http://www.amazon.com/exec/obidos/ISBN=0131019082/sunworldonlineA/
Kleiman, S., Shah, D., Smaalders, B., "Programming with Threads," Sun Microsystems Press/Prentice Hall
http://www.amazon.com/exec/obidos/ISBN=0131723898/sunworldonlineA

About the author
Jim Mauro is currently an area technology manager for Sun Microsystems in the Northeast, focusing on server systems, clusters, and high availability. He has a total of 18 years of industry experience, working in educational services (he developed and delivered courses on Unix internals and administration) and software consulting. Reach Jim at jim.mauro@sunworld.com.

If you have technical problems with this magazine, contact webmaster@sunworld.com

URL: http://www.sunworld.com/swol-12-1998/swol-12-insidesolaris.html
Last modified:

Comments:
Name:
Email:
Company Name: