The Solaris process model, Part 5

All the pieces start to come together this month, as Jim embarks on a tour through the kernel dispatcher

January 1999

Abstract

Jim began discussing the Solaris multithreaded process architecture in the August 1998 column. A great deal of ground has been covered since then, with coverage of the major kernel data structures that represent processes, lightweight processes and kernel threads, the different scheduling classes, dispatch tables, and other kernel components that relate to the priority and scheduling of threads. This month he brings all the pieces together as he begins moving through the flow of the kernel dispatcher. (2,600 words)
Error note: The August 1998 column indicated that Solaris 7 supports up to 999,999 processes. That is incorrect. The maximum number of processes in Solaris 7 is 30,000, which applies to Solaris 2.5, 2.5.1, and 2.6 as well.

Mail this
article to
a friend

here are a couple of important points to make with regard to Solaris scheduling classes. The class-specific data structures, tsproc, iaproc, and rtproc, introduced in the October column, aren't all currently implemented in Solaris. Specifically, the iaproc structure isn't in use. The IA (interactive) scheduling class actually isn't very different from the TS (timeshare) class. As such, there's a lot of sharing of code and data for threads running in the IA and TS classes. In fact, there are only six IA class-specific kernel functions for initialization, loading, getting and setting parameters, and for a set-process-group function. All other kernel scheduling class functions that run for IA threads are TS class routines. All IA class threads have a link to a tsproc structure, not an iaproc structure. For IA class threads, the ts_flags field in the tsproc structure will have the TSIA (thread is an interactive) bit set, so the dispatcher can distinguish between IA and TS threads. Note also that there isn't a separate dispatch table for IA class threads. The TS dispatch table is used for both TS and IA thread execution.

The interactive class was created to provide snappier interactive performance for GUI-based desktop Solaris systems. All processes that start under the windowing system, either OpenWindows or CDE, will be placed in the IA class. Here's a partial snapshot of my Solaris 2.6 desktop:

sunsys> ps -ec | grep IA
   322   IA  59 ?        1:14 Xsun
   325   IA  39 ?        0:00 fbconsol
   323   IA  36 ?        0:00 dtlogin
   355   IA  10 ?        0:00 speckeys
   341   IA  59 ?        0:01 Xsession
   413   IA  59 ??       0:01 dtterm
   351   IA  10 ?        0:00 fbconsol
   412   IA  59 ?        0:11 dtwm
   389   IA  39 pts/2    0:02 dtsessio
   387   IA  59 ?        0:00 dsdm
   415   IA  59 ??       0:06 dtterm
   416   IA  49 ?        0:02 dtfile
   426   IA  58 pts/4    0:00 ksh
   418   IA  59 ?        0:00 sdtvolch
   427   IA  20 pts/3    0:00 ksh
   459   IA  48 ?        0:00 dtfile
sunsys>

Processes are put in the IA class via an IA class-specific routine, ia_set_process_group(), which is called from a STREAMS I/O control call when a tty is taken over by a new process group. The call into ia_set_process_group() originates from the STREAMS-based terminal driver code in the kernel. If you were to boot your Solaris desktop system and not start a windowing system, you wouldn't have any IA processes running (just TS and the SYS class Solaris daemons). When the windowing system starts, it takes over control of the "terminal," which in the case of a desktop is your keyboard, mouse, and graphics card interface to your monitor. This will generate the set-process-group ioctl call, which ultimately calls the CL_SET_PROCESS_GROUP macro. This macro will resolve to the ia_set_process_group() IA class-specific function because the caller will be an interactive process (the windowing system software sees to that). All the processes associated with the windowing system are put in the IA class. And because processes and threads inherit their scheduling class from the parent, newly created processes (terminal windows, applications) are also put in the IA class. Here's a quick, simple example:

sunsys> ps -c
   PID  CLS PRI TTY      TIME CMD
 21621   IA  58 pts/4    0:01 ksh
sunsys> p2 &
[1]     22162
Parent PID: 22162
Child PID: 22163
sunsys> ps -c
   PID  CLS PRI TTY      TIME CMD
 22162   IA  46 pts/4    0:00 p2
 22163   IA  56 pts/4    0:00 c1
 21621   IA  58 pts/4    0:01 ksh

The Korn shell running in my terminal window is a priority 58 IA class process. I start a process called p2, which simply outputs its PID, then forks/execs a child process called c1, which also dumps its PID. Running the ps(1) command for the second time, we see that the processes we started are also in the IA class.

A process in the IA class has its priority boosted when it is the foreground process for the window that has the current input focus. Basically, the boost involves increasing the thread's priority by 10. This can happen in both the aforementioned ia_set_process_group() code, or ia_parmsset(), which is called via an internal (kernel) version of the priocntl(2) system call (issued from the windowing system when the input focus changes). The class-specific data structure, tsproc in the case of IA and TS class processes, contains a ts_boost member, which is used as part of the priority recalculation procedure, and is set to 10 (IA_BOOST in /usr/include/sys/iapriocntl.h) for foreground IA processes in the ia_set_process_group() routine. Non-IA threads have their priority lowered in the same kernel code segment by setting ts_boost to a negative 10 (tspp->ts_boost = -ia_boost).

This priority recalculation involves adding several values in the tsproc structure: the ts_cpupri, ts_upri, and ts_boost values are summed. A boundary check is done to ensure that the resulting sum is less than the maximum allowed priority for the class (59 for TS and IA) and greater than zero. The ts_cpupri member in tsproc is defined as the "kernel portion" of the thread priority. This field is set to one of three possible values from the TS/IA dispatch table during the execution life of a thread. Recall from the October 1998 column's discussion of the TS dispatch table that there are priority values in the table for events such as a thread returning from a sleep (ts_slpret), using up its time quantum (ts_tqexp), and waiting a long time for a processor (ts_lwait). The ts_cpupri field is set to one of the three priority values, depending on which condition is being handled by the kernel at a particular point in time, and of course on where we index into the dispatch table based on the current value of ts_cpupri. For example, during the regular clock tick processing, if a thread has reached its time quantum, ts_cpupri is set as follows:

	struct tsproc *tspp;
	tspp->ts_cpupri = ts_dptbl[tspp->tscpupri].ts_tqexp;

In the above example, ts_dptbl is the TS/IA dispatch table, and tspp is a pointer to the tsproc structure for the kernel thread. The ts_upri field is for user priorities. That is, Solaris provides some level of support for users tweaking the priorities of threads via the priocntl(1) command or priocntl(2) system call. There's support for setting a user priority limit, which is implemented through the ts_uprilim member of the tsproc structure. The actual user priority (set using priocntl(1)) is checked against the limit and maintained in the ts_upri field of tsproc. These user priorities do not map directly to the kernel global priorities defined in the dispatch tables. Rather, they allow a user to coarsely adjust the priority of a thread within the boundaries of the global priorities. Depending on the variety of conditions used by the kernel to determine a thread's priority, a user priority change may or may not significantly alter the execution characteristics of a thread. Use the priocntl(1) command to display and change user priority limits and user priorities:

1  sunsys> priocntl -l
2  CONFIGURED CLASSES
3  ==================
4  
5  SYS (System Class)
6  
7  TS (Time Sharing)
8          Configured TS User Priority Range: -60 through 60
9  
10 IA (Interactive)
11         Configured IA User Priority Range: -60 through 60
12 sunsys> ps -c
13    PID  CLS PRI TTY      TIME CMD
14   2373   IA  58 pts/10   0:00 ksh
15 sunsys> priocntl -s -p -50 -i pid 2373
16 sunsys> ps -c
17    PID  CLS PRI TTY      TIME CMD
18   2373   IA   8 pts/10   0:00 ksh
19 sunsys> ps -c   
20    PID  CLS PRI TTY      TIME CMD
21   2373   IA   8 pts/10   0:00 ksh
22 sunsys> priocntl -s -p 30 -i pid 2373
23 sunsys> ps -c
24    PID  CLS PRI TTY      TIME CMD
25   2373   IA  58 pts/10   0:00 ksh
26 sunsys>

The priocntl(1) command with the -l flag will list the configured classes and user priority ranges. We examine the class and priority of the ksh running in my window (line 12), and see it's IA class process/thread, priority 58 (lines 13 to 14). We set the priority lower with a value of -50 (line 15) and take another look (lines 16 to 18). The new priority reflects a difference of 50 (which won't always be the case). We set it to a better priority (line 22) by adding a value of 30. When we examine the process priority again, we see it's back at 58 (lines 23 to 25), which, of course, isn't the sum of 8 and 30. As I said earlier, the user priority settings are, at best, "hints" to the operating system, so the actual thread priority may not precisely reflect user-defined values. Once a user priority value is set via priocntl(1), it's plugged into the ts_upri field and used for subsequent priority calculations. Once the priority has been calculated (sum of ts_cpupri, ts_upri, and ts_boost) and determined to fall within the limits for the scheduling class (a priority greater than the max value will be set to the max value; a priority less than zero will be set to zero), a kernel ts_change_pri() function is used to plug the new priority into the kthread t_pri field. This kthread t_pri field is what the dispatcher uses to place the thread on the appropriate dispatch queue and schedule the thread. Note that the priocntl(1) command doesn't provide for specifying a unique kthread/LWP, and thus will implement the user requested change for all the threads in a process, or processes if a process group, UID, etc. is specified on the command line.

While we're on the subject of user-level priority adjustment, Solaris includes and supports the traditional nice(1) command for compatibility purposes. The semantics of nice comply with the traditional Unix priority scheme, where lower priorities are better priorities, and the use of a negative value as an argument to nice(1) to make the priority better requires super-user privileges. At the kernel level, the TS code includes a ts_donice() function to support the nice(1) command. The user-provided nice value is indirectly applied to the ts_upri value and a priority recalculation. Naturally, the sign of the argument to nice is flipped in the ts_donice() code to account for the priority scheme (higher number, better priority) used in Solaris.

System class priorities (SYS), in addition to being used for Solaris daemons, will be assigned to TS class priorities under limited conditions. Specifically, if a thread is holding a critical resource, such as a reader/writer lock or an exclusive lock on a page structure (memory page), a SYS priority will be assigned to the thread. The kernel thread t_kpri_req (kernel priority request) field is a flag used in the TS/IA class support code to signal that the thread priority should be set to a kernel priority, which results in the thread priority (t_pri) being set from an indexed point in the ts_kmdpri[] array of system priorities. This provides a very short-term boost to get the thread running before all other TS/IA threads. Once the thread begins execution, the priority will be readjusted to something back in the TS/IA priority range.

Advertisements

The dispatcher
As I mentioned last month, the dispatcher uses data stored in several key structures during the course of scheduling and managing thread priorities:

In each processor's CPU structure, which contains pointers and dispatcher-related data as well as linkage to the per-CPU dispatch queue, each comprised of three distinct data structures -- disp_queue_info, disp, and dispq.
In a processor partition structure (cpu_part) in releases 2.6 and beyond for linkage to the global kernel preempt queue, or to the per-partition preempt queue if processor sets have been created. This queue is for threads that can cause kernel preemption (realtime and interrupt threads).
In the kernel threads, each of which has an associated scheduling class-related data structure.
In the per-scheduling class dispatch tables.
In various others bits of information in the lwp and kernel thread structures.

The dispatcher performs two basic functions: it finds the best priority runnable thread, places it on a processor for execution, and manages the insertion and removal of threads on the dispatch queues. At a high level, the dispatcher looks for work at regular intervals on the queues. The kp_prempt queue is searched first for high-priority threads, followed by the per-CPU dispatch queues. Each CPU searches its own dispatch queues for work. This provides very good scalability, as multiple CPUs can be in the dispatcher code searching their queues concurrently; it also reduces potential CPU-to-CPU synchronization. If there are no waiting threads on a CPU's queue, it will search the queues of other processors for work to do, which may result in thread migrations. The dispatcher will also attempt to keep the queue depths relatively balanced across processors, such that there won't be a large difference in the length of the per-processor dispatch queues.

There are three kernel functions that place threads on a dispatch queue: setfrontdq, setbackdq, and setkpdq. The first two functions will place a kthread on the front or back of a dispatch queue, and the third puts unbound RT kthreads on the kernel preempt queue. (Bound threads are placed on the dispatch queue of the processor they've been bound to.) These functions are called from several places in the dispatcher and the scheduling class-specific kernel code segments. For example, the kernel will enter the dispatch queue placement routines when a thread is made runnable, or woken up, or when a scheduling decision isn't ratified, etc. The kernel cpu_choose() function, which is called from the queue placement functions, handles the process of determining placement of a thread on a specific processor dispatch queue. Solaris attempts some level of affinity when placing threads on a queue, such that a thread will be put on the queue of the processor it last ran on, in order to get some potential performance benefit from a "warm" processor cache. If it has been too long since the thread last ran, the likelihood of there being any thread data or instructions in the cache diminishes, at which point a processor running the lowest priority thread is chosen. Obviously, in the case of bound threads, the processor the thread is bound to is always chosen. You may recall from my description of the dispatch queues that a linked list of threads exists on each processor queue for each priority. Thus, once the processor has been determined, the thread priority (t_pri) determines which queue the thread will be placed on.

The process of selecting which thread runs next is implemented as a "select and ratify" algorithm. Given the level of concurrency in the current Solaris implementation, with its per-CPU dispatch queues and fine-grained kernel locking, a given scheduling decision is based on a quick point-in-time snapshot of the dispatcher state. The dispatcher finds the highest priority loaded, runnable thread by examining a few structure members and examining the dispatch queues. Each dispatch queue maintains a disp_maxrunpri field in the disp structure, which contains the priority of the highest priority thread on the queue. The dispatcher will first check the kernel-preempt queue (RT and interrupt threads). Recall that there is one such queue systemwide, unless processor sets have been created, in which case there is a kernel preempt queue for each processor set. If the kernel-preempt queues are empty, the per-CPU queues are examined, and the highest priority thread is selected. Once the selection has been made, a ratify() routine is entered to verify that the selection made was in fact best for the processor. The ratify code checks max priority on the kernel preempt queue and on the CPU queue's disp_maxrunpri. If the selected thread has a priority greater then both, the choice is ratified, and the thread is placed on a processor for execution. During the ratification process, the processor's cpu_runrun and cpu_kprunrun flags are cleared. These flags are stored in each processor's CPU structure, and are used to indicate when a thread preemption is required. Thread preemption can occur if a higher priority thread is placed on a dispatch queue, or if a thread has used up its time quantum.

That's a wrap for this month. We still have a fair amount of ground to cover with regard to the dispatcher and related topics, such as turnstiles, the sleep/wakeup mechanism, and preemption and priority inheritance.

Resources

"The Solaris process model, Part 4," December 1998 Inside Solaris column
http://www.sunworld.com/swol-12-1998/swol-12-insidesolaris.html
"The Solaris process model, Part 3," October 1998 Inside Solaris column
http://www.sunworld.com/swol-10-1998/swol-10-insidesolaris.html
"The Solaris process model, Part 2," September 1998 Inside Solaris column
http://www.sunworld.com/swol-09-1998/swol-09-insidesolaris.html
"Peeling back the process layers, Part 1," August 1998 Inside Solaris column
http://www.sunworld.com/swol-08-1998/swol-08-insidesolaris.html
Full listing of previous Inside Solaris columns
http://www.sunworld.com/common/swol-backissues-columns.html#insidesolaris
Goodheart, B. & Cox, J. "The Magic Garden Explained: The Internals of Unix System V Release 4," Prentice Hall
http://www.amazon.com/exec/obidos/ISBN=0130981389/sunworldonlineA/
Vahalia, Uresh. "Unix Internals: The New Frontiers," Prentice Hall
http://www.amazon.com/exec/obidos/ISBN=0131019082/sunworldonlineA/
Kleiman, S., Shah, D., Smaalders, B., "Programming with Threads," Sun Microsystems Press/Prentice Hall
http://www.amazon.com/exec/obidos/ISBN=0131723898/sunworldonlineA

About the author
Jim Mauro is currently an area technology manager for Sun Microsystems in the Northeast, focusing on server systems, clusters, and high availability. He has a total of 18 years of industry experience, working in educational services (he developed and delivered courses on Unix internals and administration) and software consulting. Reach Jim at jim.mauro@sunworld.com.

If you have technical problems with this magazine, contact webmaster@sunworld.com

URL: http://www.sunworld.com/swol-01-1999/swol-01-insidesolaris.html
Last modified:

Comments:
Name:
Email:
Company Name: