Inside Solaris by Jim Mauro

The Solaris process model, Part 3

Jim reveals the inner workings of the process, kernel thread, and lightweight process relationship, and begins a discussion of process scheduling classes, dispatch tables, and the kernel dispatcher

October  1998
[Next story]
[Table of Contents]
Subscribe to SunWorld, it's free!

The multilayer thread model implemented in Solaris is a drastic departure from the traditional Unix process model. It's useful to have a solid understanding of the rationale behind this design, and how the key structures are related. Additionally, Solaris offers a redesign of the original Unix SVR4 scheduling model. Process/thread scheduling in Solaris involves the management of potentially thousands (or even tens of thousands) of executable threads in different scheduling classes. In this column, we'll further examine the process, kernel thread, and lightweight process (LWP) model, and venture forth into scheduling classes, priorities, and dispatch tables. (3,800 words)

Mail this
article to
a friend

In last month's column, we introduced several data structures that make up the Solaris process model: the process (proc) structure, which has the user area (uarea) embedded in it in Solaris 2.X; the lightweight process (LWP); and the kernel thread (kthread). A nonthreaded process running on a Solaris system will have exactly one LWP and one kthread associated with it, even though the code doesn't make an explicit thread_create(3T) or pthread_create(3T) call. The kernel will create and link the structures when the process is created. Additional LWPs and kthreads will be created by the operating system for multithreaded processes, either by explicit request in the thread_create(3T)/pthread_create(3T) call, or through the use of the thr_setconcurrency(3T) call. Additionally, user threads can be bound to an LWP for their entire existence, using the appropriate flag in the create library calls. Once a user thread is bound to an LWP, that LWP will not be available for use by any other user threads.

The intention of the multilayer threads model is to decouple the scheduling and management of user threads from the kernel. User-level threads have their own priority scheme and are scheduled with a scheduling thread that is created by the library when a multithreaded program is compiled and executed. This way, multithreaded applications can have literally thousands of threads and not impose any significant load on the kernel. In fact, user threads aren't visible to the kernel; they're made visible when they're bound to an LWP, which has some state in the operating system. The user-level threads are made executable by the thread's scheduler mapping them onto an LWP, which is associated with a kernel thread for execution on a processor.

In Solaris, every LWP has a kernel thread, but not every kernel thread has an LWP. The lwp_create() code in the kernel will also invoke the kernel thread_create() routine (which creates kernel threads, not user threads), and set the pointers such that the newly created LWP and kernel thread are pointing to each other. This establishes a one-to-one relationship that will be maintained throughout the life of the LWP. As we said, not every kernel thread is associated with an LWP. Solaris is a multithreaded operating system, and as such creates kernel threads to perform operating system-related tasks.

Generally, multithreaded applications will have many more user threads than LWPs. An interface, thr_setconcurrency(3T), can be used to allow the application to advise the kernel as to how many concurrent threads can be run, such that the operating system will maintain a sufficient number of LWPs for thread execution. Extended coverage of what goes on in user thread land with respect to user thread priorities, scheduling of user threads, and LWP management will be covered in the next month or two. Our discussion this month will focus on the layers below user threads: the LWP and kthreads.

Now and again, in various books and whitepapers that discuss the Solaris threads model, LWPs and kthreads appear to be the same thing. I've seen the term LWP used where kthread would have been more technically accurate, and vice versa. The fact is that LWPs and kernel threads are distinctly different structures, each defined by its own header file (sys/klwp.h and sys/thread.h). However, in the context of user processes (as opposed to the kernel), it's harmless to think of them as one large entity, because there is, without exception, a one-to-one relationship between LWPs and kthreads.

The LWP structure contains, among other things, an embedded structure that maintains various bits of resource utilization information:

struct lrusage {
        u_long  minflt;         /* minor page faults */
        u_long  majflt;         /* major page faults */
        u_long  nswap;          /* swaps */
        u_long  inblock;        /* input blocks */
        u_long  oublock;        /* output blocks */
        u_long  msgsnd;         /* messages sent */
        u_long  msgrcv;         /* messages received */
        u_long  nsignals;       /* signals received */
        u_long  nvcsw;          /* voluntary context switches */
        u_long  nivcsw;         /* involuntary context switches */
        u_long  sysc;           /* system calls */
        u_long  ioch;           /* chars read and written */

The information is updated during execution per LWP. Note that the process structure also contains an lrusage structure, which gets updated from each LWP as the LWP exits. In other words, the lrusage data at the process level is the sum of all the LWP's resource usage data. According to the proc(4) man page, this usage data is available when microstate accounting has been enabled for the process, which is done in code using PCSET and the PC_MSACCT flag. The resource usage information can be examined using the /proc file system. The /proc directory hierarchy contains a subdirectory for each LWP in the process, where the lwpusage data is accessible using open(2) and read(2) on the /proc path (e.g. /proc//lwp//lwpusage). Unfortunately, I'm not aware of any bundled commands or utilities that allow a Solaris user to examine this data.

Other interesting bits of information maintained on the LWP include:

The kernel thread is the entity that is actually put on a dispatch queue and scheduled. This fact is probably the most salient departure from traditional Unix implementations, where processes maintain a priority and are put on run queues and scheduled. Here, it's the kthread, not the process, that is assigned a scheduling class and priority. You can examine this on a running system using the -L and -c flags to the ps(1) command:

sunsys> ps -Lc
   449     1   IA  58 pts/6    0:04 ksh

Or, for all processes running on the system:

sunsys> ps -eLc
   179     1   TS  58 ?        0:00 syslogd
   179     2   TS  58 ?        0:00 syslogd
   179     3   TS  58 ?        0:00 syslogd
   179     4   TS  58 ?        0:00 syslogd
   179     5   TS  58 ?        0:00 syslogd
   179     6   TS  58 ?        0:00 syslogd
   350     1   IA  59 ?        0:01 Xsession
   230     1   TS  58 ?        0:00 sendmail
   255     1   TS  58 ?        0:00 vold
   255     2   TS  58 ?        0:00 vold
   255     3   TS  58 ?        0:00 vold
   255     4   TS  58 ?        0:00 vold
   255     5   TS  58 ?        0:00 vold
   227     1   TS  58 ?        0:00 powerd
   240     1   TS  58 ?        0:00 utmpd
   319     1   IA  54 ?       12:42 Xsun
   317     1   TS  59 ?        0:00 ttymon

(In the interest of space I've removed most of the lines from the above listing.) The columns in the ps(1) output above provide the process ID (PID), the LWP number within the process (LWP), the scheduling class the LWP is in (CLS), and the priority. It's interesting that the output indicates that the LWP has a priority and scheduling class, when technically it's the kthread associated with the LWP that actually maintains this information (note the above comment on how we often think of LWPs and kthreads as the same thing).

Other interesting bits in the kthread include:

Reference sys/thread.h for a complete definition of the kernel thread structure.


Scheduling with class
The framework on which the Solaris 2.X scheduler is built has its roots in the Unix SVR4 scheduler, which represents a complete rework of the traditional Unix scheduler. The SVR4 scheduler introduced the notion of the scheduling class, defining the policies and algorithms applied to a process or thread as belonging to a particular scheduling class, based on values established in a dispatch table. For each scheduling class, there exists a table of values and parameters the dispatcher code uses for selecting a thread to run on a processor, in addition to setting priorities based on wait time and how recently the thread had execution time on a processor. Solaris 2.X originally shipped with three scheduling classes by default: Systems (SYS), Timesharing (TS), and Realtime (RT). Somewhere around the Solaris 2.4 timeframe, the Interactive (IA) scheduling class was added to provide snappier interactive desktop performance (more on this in a bit). The TS class provides the traditional resource sharing behavior, such that all processes on the system get their fair share of execution time. The SYS class exists for kernel threads (e.g. the page daemon, clock thread, etc.).

The RT class provides realtime scheduling behavior, meaning threads in this class have a higher global priority than TS and SYS threads. Even with these enhancements, additional changes needed to be made in order to provide realtime application support. First, the kernel had to be made preemptable, such that a realtime thread in need of a processor could preempt the kernel. The second thing that needed to be addressed was memory locking. A realtime thread cannot afford to be exposed to the latency involved in resolving a page fault. This was addressed by providing a memory locking facility, so that the developer of a realtime application can use the memcntl(2) system call or mlock(3C) library function to lock a range of pages in physical memory.

The TS, IA, and RT scheduling classes are implemented as dynamically loadable kernel modules. The kernel binaries for the TS class are in /kernel/sched. The IA and RT class reside in the /usr/kernel/sched directory. The SYS class is an integral part of the kernel, and thus not built as a dynamically loadable module.

The Solaris scheduler adds several features that enhance the overall usability and flexibility of the operating system, among them are:

From an architectural perspective, the scheduling class implementation follows an object-oriented design similar to the VFS/vnode architecture described in the February 1998 Inside Solaris column (see Resources below). Just as the VFS/vnode implementation defined filesystem-dependent and filesystem-independent routines (e.g. file open(2) -> ufs_open() for a UFS file), the scheduler code is organized such that the kernel will make a generic call that maps to a class-specific function. In Solaris, a system class array is initialized at boot time, where each entry in the array is a sclass structure, one for each of the loaded scheduling classes. The sclass structure contains the following structure members:

The class init routine is called at boot time for all the preloaded scheduling classes. By default, the timeshare, system, and interactive classes are loaded. Only the realtime class is not. When a thread is set to the realtime class, via the priocntl(1) command or priocntl(2) system call, the realtime class module is dynamically loaded by the operating system. You can examine which scheduling classes are currently loaded on the running system via the dispadmin(1M) command:

sunsys> dispadmin -l

SYS     (System Class)
TS      (Time Sharing)
IA      (Interactive)

Or, for a bit more information, there's a class function in the /etc/crash(1M) utility:

# /etc/crash
dumpfile = /dev/mem, namelist = /dev/ksyms, outfile = stdout
> class

0       SYS     f00f7a50        f027b854
1       TS      f5b2db64        f5b302f8
2       IA      f5b2dc2c        f5b30358
> q

Let's put a process in the realtime scheduling class (process PID 833, my test program, sleeps for a minute then exits):

# priocntl -s -c RT -i pid 833

Now take another look at which scheduling classes are loaded on the system:

# dispadmin -l

SYS     (System Class)
TS      (Time Sharing)
IA      (Interactive)
RT      (Real Time)
# /etc/crash
dumpfile = /dev/mem, namelist = /dev/ksyms, outfile = stdout
> class

0       SYS     f00f7a50        f027b854
1       TS      f5b2db64        f5b302f8
2       IA      f5b2dc2c        f5b30358
3       RT      f67e5034        f67e5b60
> q

The class function in /etc/crash(1M) provides the class name, init function address, and the address of the class-operations table, also known as the class-functions structure or class-operations vector table. (Operations-vector table is a generic term that describes an array of pointers to functions, which is precisely what the class-functions [cl_funcs] structure is.)

The class-functions structure bundles together function pointers for three different classes of objects that make use of these routines: the scheduler class manager, processes, and threads. Operations used by the class manager include a cl_admin() function used to alter the values in the dispatch table for the class, in addition to functions for gathering information, setting and getting parameters, and retrieving the global priority value. Thread routines include cl_swapin() and cl_swapout(), cl_fork(), cl_preempt(), cl_sleep(), cl_wakeup(), and cl_donice(). Like the VFS/vnode implementation, the kernel defines a set of macros that are used to resolve a generic interface call to the appropriate class-specific routine. Just as the VFS_OPEN() macro will resolve to the ufs_open() routine to open a UFS file, the CL_FORK() macro will resolve to correct scheduler-specific fork support code, such as ts_fork() for the timesharing class. Linkage to the cl_funcs structure is also maintained in the kernel thread. There are actually two pointers in the kernel thread directly related to scheduling: the t_clfuncs pointer, which points to the cl_funcs structure, and the t_cldata pointer, which points to a class-specific data structure.

The scheduling class-related structures are depicted in Figure 2 below. Every kernel thread is linked to a class-specific data structure that maintains bits of information the dispatcher uses for managing the execution time and scheduling of the kthread. We'll cover the various fields involved when we discuss the dispatcher algorithms.

Dispatch tables
Each scheduling class loads with a corresponding dispatch table containing default values for priorities and priority readjustment. The dispatch tables can be examined and modified using the dispadmin(1M) command:

fawlty> dispadmin -c TS -g
# Time Sharing Dispatcher Configuration

# ts_quantum  ts_tqexp  ts_slpret  ts_maxwait ts_lwait  PRIORITY LEVEL
       200         0        50           0        50        #     0
       200         0        50           0        50        #     1
       200         0        50           0        50        #     2
       200         0        50           0        50        #     3
       200         0        50           0        50        #     4
       200         0        50           0        50        #     5

(In the interest of space, we haven't replicated the whole table). The dispatch table headers are defined in /usr/include/sys/ts.h (along with the other TS-related data structures.) The values in the table have the following meaning:

An interesting point is that the default values in the TS and IA dispatch tables inject a 0 value in ts_maxwait for every priority except the highest priority (59). So, just one increment in the ts_dispwait field will cause the thread priority to be readjusted to ts_lwait, except for priority-59 threads. The net effect of this is that all but the highest priority (59) timeshare threads will have their priority bumped to the 50 to 59 range (ts_lwait) every second. This has the desirable effect of not penalizing a thread that is CPU-bound for an extended period of time. Threads that are CPU hogs will, over time, end up in the low 0 to 9 priority range as they keep using up their time quantum, based on priority readjustments using ts_tqexp. Once each second, they'll be bumped back up to the 50 to 59 range, and will only migrate back down if they sustain their CPU-bound behavior.

Priority-59 threads are handled differently. These threads are already at the maximum (highest) priority for a timeshare thread, so there's no way to bump their priority via ts_maxwait and make it better. The ts_update() routine, which is the kernel code segment that increments the ts_dispwait value and readjusts thread priorities using ts_lwait, reorders the linked list of threads on the dispatch queues after adjusting the priority. (For each global priority in the kernel, a linked list of threads is maintained -- we'll get into this in detail next month.) The reordering after the priority adjustment puts threads at the front of their new dispatch queue for that priority. The threads on the priority 59 linked list would end up reordered, but still at the same priority. Experimentation has shown that this results in some percentage of priority-59 threads never getting processor time. Also, the threads that ts_update puts on the front of the dispatch queue following adjustment get scheduled more frequently. Setting the ts_maxwait value to 32000 for priority 59 ensures the queue is never reordered by ts_update(), so every thread gets a fair shot at being scheduled.

The dispatch tables can have user-supplied values applied, using the dispadmin(1M) command. This should be done with extreme caution, as I've seen dispatch table "tweaks" drastically affect the performance of a system and application behavior, not always for the better.

That's a wrap for this month. Next month, we'll cover the RT and IA dispatch tables, get more into priorities, and discuss how the dispatcher works. Following that, we'll examine interrupts, callout processing, and turnstiles.

Stay tuned.


About the author
Jim Mauro is currently an area technology manager for Sun Microsystems in the northeast, focusing on server systems, clusters, and high availability. He has a total of 18 years of industry experience, working in educational services (he developed and delivered courses on Unix internals and administration) and software consulting. Reach Jim at

What did you think of this article?
-Very worth reading
-Worth reading
-Not worth reading
-Too long
-Just right
-Too short
-Too technical
-Just right
-Not technical enough

[Table of Contents]
Subscribe to SunWorld, it's free!
[Next story]
Sun's Site

[(c) Copyright  Web Publishing Inc., and IDG Communication company]

If you have technical problems with this magazine, contact

Last modified: