The Solaris process model, Part 5All the pieces start to come together this month, as Jim embarks on a tour through the kernel dispatcher |
Jim began discussing the Solaris multithreaded process architecture in the August 1998 column. A great deal of ground has been covered since then, with coverage of the major kernel data structures that represent processes, lightweight processes and kernel threads, the different scheduling classes, dispatch tables, and other kernel components that relate to the priority and scheduling of threads. This month he brings all the pieces together as he begins moving through the flow of the kernel dispatcher. (2,600 words)Error note: The August 1998 column indicated that Solaris 7 supports up to 999,999 processes. That is incorrect. The maximum number of processes in Solaris 7 is 30,000, which applies to Solaris 2.5, 2.5.1, and 2.6 as well.
|
Mail this article to a friend |
The interactive class was created to provide snappier interactive performance for GUI-based desktop Solaris systems. All processes that start under the windowing system, either OpenWindows or CDE, will be placed in the IA class. Here's a partial snapshot of my Solaris 2.6 desktop:
sunsys> ps -ec | grep IA 322 IA 59 ? 1:14 Xsun 325 IA 39 ? 0:00 fbconsol 323 IA 36 ? 0:00 dtlogin 355 IA 10 ? 0:00 speckeys 341 IA 59 ? 0:01 Xsession 413 IA 59 ?? 0:01 dtterm 351 IA 10 ? 0:00 fbconsol 412 IA 59 ? 0:11 dtwm 389 IA 39 pts/2 0:02 dtsessio 387 IA 59 ? 0:00 dsdm 415 IA 59 ?? 0:06 dtterm 416 IA 49 ? 0:02 dtfile 426 IA 58 pts/4 0:00 ksh 418 IA 59 ? 0:00 sdtvolch 427 IA 20 pts/3 0:00 ksh 459 IA 48 ? 0:00 dtfile sunsys>
Processes are put in the IA class via an IA class-specific routine, ia_set_process_group(), which is called from a STREAMS I/O control call when a tty is taken over by a new process group. The call into ia_set_process_group() originates from the STREAMS-based terminal driver code in the kernel. If you were to boot your Solaris desktop system and not start a windowing system, you wouldn't have any IA processes running (just TS and the SYS class Solaris daemons). When the windowing system starts, it takes over control of the "terminal," which in the case of a desktop is your keyboard, mouse, and graphics card interface to your monitor. This will generate the set-process-group ioctl call, which ultimately calls the CL_SET_PROCESS_GROUP macro. This macro will resolve to the ia_set_process_group() IA class-specific function because the caller will be an interactive process (the windowing system software sees to that). All the processes associated with the windowing system are put in the IA class. And because processes and threads inherit their scheduling class from the parent, newly created processes (terminal windows, applications) are also put in the IA class. Here's a quick, simple example:
sunsys> ps -c PID CLS PRI TTY TIME CMD 21621 IA 58 pts/4 0:01 ksh sunsys> p2 & [1] 22162 Parent PID: 22162 Child PID: 22163 sunsys> ps -c PID CLS PRI TTY TIME CMD 22162 IA 46 pts/4 0:00 p2 22163 IA 56 pts/4 0:00 c1 21621 IA 58 pts/4 0:01 ksh
The Korn shell running in my terminal window is a priority 58 IA class process. I start a process called p2, which simply outputs its PID, then forks/execs a child process called c1, which also dumps its PID. Running the ps(1) command for the second time, we see that the processes we started are also in the IA class.
A process in the IA class has its priority boosted when it is the foreground process for the window that has the current input focus. Basically, the boost involves increasing the thread's priority by 10. This can happen in both the aforementioned ia_set_process_group() code, or ia_parmsset(), which is called via an internal (kernel) version of the priocntl(2) system call (issued from the windowing system when the input focus changes). The class-specific data structure, tsproc in the case of IA and TS class processes, contains a ts_boost member, which is used as part of the priority recalculation procedure, and is set to 10 (IA_BOOST in /usr/include/sys/iapriocntl.h) for foreground IA processes in the ia_set_process_group() routine. Non-IA threads have their priority lowered in the same kernel code segment by setting ts_boost to a negative 10 (tspp->ts_boost = -ia_boost).
This priority recalculation involves adding several values in the tsproc structure: the ts_cpupri, ts_upri, and ts_boost values are summed. A boundary check is done to ensure that the resulting sum is less than the maximum allowed priority for the class (59 for TS and IA) and greater than zero. The ts_cpupri member in tsproc is defined as the "kernel portion" of the thread priority. This field is set to one of three possible values from the TS/IA dispatch table during the execution life of a thread. Recall from the October 1998 column's discussion of the TS dispatch table that there are priority values in the table for events such as a thread returning from a sleep (ts_slpret), using up its time quantum (ts_tqexp), and waiting a long time for a processor (ts_lwait). The ts_cpupri field is set to one of the three priority values, depending on which condition is being handled by the kernel at a particular point in time, and of course on where we index into the dispatch table based on the current value of ts_cpupri. For example, during the regular clock tick processing, if a thread has reached its time quantum, ts_cpupri is set as follows:
struct tsproc *tspp; tspp->ts_cpupri = ts_dptbl[tspp->tscpupri].ts_tqexp;
In the above example, ts_dptbl is the TS/IA dispatch table, and tspp is a pointer to the tsproc structure for the kernel thread. The ts_upri field is for user priorities. That is, Solaris provides some level of support for users tweaking the priorities of threads via the priocntl(1) command or priocntl(2) system call. There's support for setting a user priority limit, which is implemented through the ts_uprilim member of the tsproc structure. The actual user priority (set using priocntl(1)) is checked against the limit and maintained in the ts_upri field of tsproc. These user priorities do not map directly to the kernel global priorities defined in the dispatch tables. Rather, they allow a user to coarsely adjust the priority of a thread within the boundaries of the global priorities. Depending on the variety of conditions used by the kernel to determine a thread's priority, a user priority change may or may not significantly alter the execution characteristics of a thread. Use the priocntl(1) command to display and change user priority limits and user priorities:
1 sunsys> priocntl -l 2 CONFIGURED CLASSES 3 ================== 4 5 SYS (System Class) 6 7 TS (Time Sharing) 8 Configured TS User Priority Range: -60 through 60 9 10 IA (Interactive) 11 Configured IA User Priority Range: -60 through 60 12 sunsys> ps -c 13 PID CLS PRI TTY TIME CMD 14 2373 IA 58 pts/10 0:00 ksh 15 sunsys> priocntl -s -p -50 -i pid 2373 16 sunsys> ps -c 17 PID CLS PRI TTY TIME CMD 18 2373 IA 8 pts/10 0:00 ksh 19 sunsys> ps -c 20 PID CLS PRI TTY TIME CMD 21 2373 IA 8 pts/10 0:00 ksh 22 sunsys> priocntl -s -p 30 -i pid 2373 23 sunsys> ps -c 24 PID CLS PRI TTY TIME CMD 25 2373 IA 58 pts/10 0:00 ksh 26 sunsys>
The priocntl(1) command with the -l flag will list the configured classes and user priority ranges. We examine the class and priority of the ksh running in my window (line 12), and see it's IA class process/thread, priority 58 (lines 13 to 14). We set the priority lower with a value of -50 (line 15) and take another look (lines 16 to 18). The new priority reflects a difference of 50 (which won't always be the case). We set it to a better priority (line 22) by adding a value of 30. When we examine the process priority again, we see it's back at 58 (lines 23 to 25), which, of course, isn't the sum of 8 and 30. As I said earlier, the user priority settings are, at best, "hints" to the operating system, so the actual thread priority may not precisely reflect user-defined values. Once a user priority value is set via priocntl(1), it's plugged into the ts_upri field and used for subsequent priority calculations. Once the priority has been calculated (sum of ts_cpupri, ts_upri, and ts_boost) and determined to fall within the limits for the scheduling class (a priority greater than the max value will be set to the max value; a priority less than zero will be set to zero), a kernel ts_change_pri() function is used to plug the new priority into the kthread t_pri field. This kthread t_pri field is what the dispatcher uses to place the thread on the appropriate dispatch queue and schedule the thread. Note that the priocntl(1) command doesn't provide for specifying a unique kthread/LWP, and thus will implement the user requested change for all the threads in a process, or processes if a process group, UID, etc. is specified on the command line.
While we're on the subject of user-level priority adjustment, Solaris includes and supports the traditional nice(1) command for compatibility purposes. The semantics of nice comply with the traditional Unix priority scheme, where lower priorities are better priorities, and the use of a negative value as an argument to nice(1) to make the priority better requires super-user privileges. At the kernel level, the TS code includes a ts_donice() function to support the nice(1) command. The user-provided nice value is indirectly applied to the ts_upri value and a priority recalculation. Naturally, the sign of the argument to nice is flipped in the ts_donice() code to account for the priority scheme (higher number, better priority) used in Solaris.
System class priorities (SYS), in addition to being used for Solaris daemons, will be assigned to TS class priorities under limited conditions. Specifically, if a thread is holding a critical resource, such as a reader/writer lock or an exclusive lock on a page structure (memory page), a SYS priority will be assigned to the thread. The kernel thread t_kpri_req (kernel priority request) field is a flag used in the TS/IA class support code to signal that the thread priority should be set to a kernel priority, which results in the thread priority (t_pri) being set from an indexed point in the ts_kmdpri[] array of system priorities. This provides a very short-term boost to get the thread running before all other TS/IA threads. Once the thread begins execution, the priority will be readjusted to something back in the TS/IA priority range.
|
|
|
|
|
The dispatcher
As I mentioned last month, the dispatcher uses data stored in several key structures during the course of scheduling and managing thread priorities:
The dispatcher performs two basic functions: it finds the best priority runnable thread, places it on a processor for execution, and manages the insertion and removal of threads on the dispatch queues. At a high level, the dispatcher looks for work at regular intervals on the queues. The kp_prempt queue is searched first for high-priority threads, followed by the per-CPU dispatch queues. Each CPU searches its own dispatch queues for work. This provides very good scalability, as multiple CPUs can be in the dispatcher code searching their queues concurrently; it also reduces potential CPU-to-CPU synchronization. If there are no waiting threads on a CPU's queue, it will search the queues of other processors for work to do, which may result in thread migrations. The dispatcher will also attempt to keep the queue depths relatively balanced across processors, such that there won't be a large difference in the length of the per-processor dispatch queues.
There are three kernel functions that place threads on a dispatch queue: setfrontdq, setbackdq, and setkpdq. The first two functions will place a kthread on the front or back of a dispatch queue, and the third puts unbound RT kthreads on the kernel preempt queue. (Bound threads are placed on the dispatch queue of the processor they've been bound to.) These functions are called from several places in the dispatcher and the scheduling class-specific kernel code segments. For example, the kernel will enter the dispatch queue placement routines when a thread is made runnable, or woken up, or when a scheduling decision isn't ratified, etc. The kernel cpu_choose() function, which is called from the queue placement functions, handles the process of determining placement of a thread on a specific processor dispatch queue. Solaris attempts some level of affinity when placing threads on a queue, such that a thread will be put on the queue of the processor it last ran on, in order to get some potential performance benefit from a "warm" processor cache. If it has been too long since the thread last ran, the likelihood of there being any thread data or instructions in the cache diminishes, at which point a processor running the lowest priority thread is chosen. Obviously, in the case of bound threads, the processor the thread is bound to is always chosen. You may recall from my description of the dispatch queues that a linked list of threads exists on each processor queue for each priority. Thus, once the processor has been determined, the thread priority (t_pri) determines which queue the thread will be placed on.
The process of selecting which thread runs next is implemented as a "select and ratify" algorithm. Given the level of concurrency in the current Solaris implementation, with its per-CPU dispatch queues and fine-grained kernel locking, a given scheduling decision is based on a quick point-in-time snapshot of the dispatcher state. The dispatcher finds the highest priority loaded, runnable thread by examining a few structure members and examining the dispatch queues. Each dispatch queue maintains a disp_maxrunpri field in the disp structure, which contains the priority of the highest priority thread on the queue. The dispatcher will first check the kernel-preempt queue (RT and interrupt threads). Recall that there is one such queue systemwide, unless processor sets have been created, in which case there is a kernel preempt queue for each processor set. If the kernel-preempt queues are empty, the per-CPU queues are examined, and the highest priority thread is selected. Once the selection has been made, a ratify() routine is entered to verify that the selection made was in fact best for the processor. The ratify code checks max priority on the kernel preempt queue and on the CPU queue's disp_maxrunpri. If the selected thread has a priority greater then both, the choice is ratified, and the thread is placed on a processor for execution. During the ratification process, the processor's cpu_runrun and cpu_kprunrun flags are cleared. These flags are stored in each processor's CPU structure, and are used to indicate when a thread preemption is required. Thread preemption can occur if a higher priority thread is placed on a dispatch queue, or if a thread has used up its time quantum.
That's a wrap for this month. We still have a fair amount of ground
to cover with regard to the dispatcher and related topics, such as
turnstiles, the sleep/wakeup mechanism, and preemption and priority
inheritance.
Resources
About the author
Jim Mauro is currently an area technology manager for Sun
Microsystems in the Northeast, focusing on server systems,
clusters, and high availability. He has a total of 18 years of industry
experience, working in educational services (he developed
and delivered courses on Unix internals and administration) and
software consulting.
Reach Jim at jim.mauro@sunworld.com.
If you have technical problems with this magazine, contact webmaster@sunworld.com
URL: http://www.sunworld.com/swol-01-1999/swol-01-insidesolaris.html
Last modified: