The Solaris process model, Part 6: Inside the dispatcher

A deeper look into the Solaris dispatcher

February 1999

Abstract

The implementation of process/kthread priorities in Solaris is quite a complex topic. Up to this point, Jim has discussed the scheduling classes, associated dispatch tables, and global prioritization scheme. Last month, he started looking at some of the core dispatcher functions, such as queue management and the "select and ratify" algorithm used to determine which kernel thread executes next.
This month, Jim completes his coverage of dispatcher internals. Where appropriate, he revisits information provided in previous columns on the dispatcher in order to clarify his explanations. (3,200 words)

Mail this
article to
a friend

et's begin our discussion with a look at the dispatcher functions.

The many kernel functions that make up the dispatcher can be broken up into two categories: those functions that are scheduling class specific, and those that are class independent. There are more than 20 functions specific to the TS and RT class, five routines specific to the IA class (most IA support is done via the TS functions), and eight or so SYS class functions. These class-specific functions provide support for the user interface to priority setting [priocntl(2)], dispatch table and scheduling class administration [dispadmin(1M)], and kernel-level thread support for creation, execution and exit.

The system uses macros defined in the class.h header file to resolve calls into class-specific functions. The macro will resolve to the appropriate function by indexing through either the system class array, or the class functions table link from the kernel thread structure. For example, below is the macro for a scheduling class "enterclass" routine.

#define CL_ENTERCLASS(t, cid, clparmsp, credp, bufp) \
        (sclass[cid].cl_funcs->thread.cl_enterclass) (t, cid, \
            (void *)clparmsp, credp, bufp)

The class-specific enterclass routine is invoked via indexing through the system class array, using the class ID as the array index (sclass[cid]). Each entry in the system class array is a sclass structure, which stores basic information on a scheduling class. There is one entry for each scheduling class (from /usr/include/sys/class.h):

typedef struct sclass {
        char    *cl_name;       		/* class name */
        void    (*cl_init)(id_t, ...);	        /* pointer to class init routine */
        struct classfuncs *cl_funcs;    	/* pointer to classfuncs structure */
        krwlock_t *cl_lock;     		/* read/write lock for class structure */
        int     cl_count;       		/* number of threads trying to load class */
        size_t  cl_size;        		/* size of per-thread data */
} sclass_t;

As shown in the structure declaration above, the macro will resolve to the class "enterclass" function through the cl_funcs functions table pointer. Here's a second example:

#define CL_FORK(tp, ctp, bufp) \
        (*(tp)->t_clfuncs->cl_fork) (tp, ctp, bufp)

This class "fork" code macro is an example of a macro resolving through the class functions pointer located in the kernel thread (where "tp" is a thread pointer, passed as the first argument in the macro, and "t_clfuncs" is the thread structure member that points to the class functions table).

The generic (class-independent) dispatcher routines provide the low-level dispatch queue manipulation, priority management, and thread selection functions. They are invoked directly by their function names (for example, setfrontq(), setbackq(), disp(), or swtch() ) like most of the other kernel routines.

As I mentioned in a previous column, a thread will inherit its scheduling class and priority upon creation. It's in the tail end of the fork(2) kernel code flow that the class data is established for the thread. A fork(2) system call is supported in the kernel with several low-level functions that handle creating and initializing the different pieces that make up a process in Solaris: cfork() is the common fork code for all flavors of fork (fork1(2), vfork(2) and fork(2) ); and the forklwp() and lwp_create() mange the LWP piece, and thread_create() creates the kernel thread. The forklwp() segment, just prior to returning back to cfork(), is where space for a class structure is allocated (for example, a tsproc structure for TS threads), and the class-specific fork routine is called via the CL_FORK macro, where the class structure is initialized.

Back in cfork(), some additional finishing up is done for the newly created process/lwp/thread (for example, the setting of the process group), and the class fork_return code is invoked, where CL_FORKRET is the macro, and ts_forkret() is the routine name in the case of the TS class. It's in the fork return code that the thread is placed on a dispatch queue, based on the CPU and the priority -- the CPU was set based on whichever processor was running the thread_create() code, and the priority, like the class, is inherited.

The final thing to take care of before asking the dispatcher to execute and select a thread is to make sure that the new thread (or child -- it's basically the child process in traditional Unix terms, but technically the kernel thread in Solaris) will run before the parent does. The new thread ends up on a dispatch queue in front of the parent (by design), so the child executes first, executes an object file (overlays the new process with a new executable), and gets its own address space mappings.

Advertisements

Queue insertion
It's interesting to note that in the very first releases of Solaris 2 (SunOS 5), there was one dispatch queue for all the processors, and the processor selection was done after a thread was pulled off a queue for execution. This changed beginning with Solaris 2.3 (SunOS 5.3) where the CPU selection is done as part of the queue selection. It is possible that a thread will be migrated off the processor queue its sitting on if during the execution of the dispatcher code, there are no runnable threads on the CPU's dispatch queue. In such a case, it will go look for work on the queues of other processors.

The creation of a new process/LWP/kthread makes for a good framework for discussing how a threads class and priority are initially established, and how the dispatch queue is selected. However, whether we are dealing with a newly created thread or just picking a point in time during the life of a thread on a running system, there are basically four possible scenarios that drive queue selection:

An unbound thread with a priority less than kpreemptpri.
An unbound thread with a priority greater than or equal to kpreemptpri.
A bound thread with a priority less than kpreemptpri.
A bound thread with a priority greater than or equal to kpreemptpri.

If you remember from some of our previous discussions, threads in the RT class and interrupt threads will cause a kernel preemption, where the currently running thread is forced to surrender the processor it is running on prior to its time quantum expiring or the thread blocking. The kernel counts the occurrences of such preemptions in the inv_swtch member of the cpu_sysinfo structure (involuntary switch). This statistic is reported via mpstat(1M), in the icsw column.

Note that this will also include user preemptions, which occur when a thread placed on a dispatch queue is at a higher priority than the currently running thread. User preemptions are less urgent, and will wait until the thread returns from kernel mode to user mode before preempting it. A kernel preemption will preempt a thread while in kernel mode.

Preemptions are flagged using fields in the per-processor cpu structure: cpu_runrun and cpu_kprunrun. The first will flag a user preemption and gets set if a thread inserted into a dispatch queue is a higher priority than the one running, but a lower priority than kpreemptpri. The second, cpu_kprunrun, flags a kernel preemption. These flags are checked frequently in the kernel, such as on returns from system calls and traps.

Threads at high priorities (RT threads) are put on a system-wide cp_kp_queue, rather than one of the per-processor dispatch queues if they're not bound. For systems that support processor sets (2.6 and beyond), a kernel preempt queue exists for each processor set. The kernel sets the kpreemptpri value as scheduling classes get loaded during system initialization, or if a class is loaded at some point while the system is running. By default, kpreemptpri is set to a value of 100, which is the maximum system priority (99) plus 1. That will change only if a custom scheduling class is created and loaded into the kernel resulting in the maximum system global priority value of 99 to change.

Going back to determining which dispatch queue to place a thread on for each of those four possible scenarios, we can illustrate it simple using a pseudo code flow:

if (unbound && priority < kpreemptpri)
	insert on processor queue for CPU set in thread 
	t_cpu field;
if (unbound && priority >= kpreemptpri)
	insert in cp_kp_queue;
if (bound && priority < kpreemptpri)
	insert in queue of CPU thread is bound to;
if (bound && priority >= kpreemptpri)
	insert in queue of CPU thread is bound to;

In the first case, the t_cpu field in the kthread was set during the thread creation process, based on whichever processor was running the code. In the hopes of hitting a warm cache, the dispatcher will attempt to schedule a thread back on the processor it ran on previously, but it's possible the thread will have migrated during its lifetime.

In the second case, the thread will be placed on the single cp_kp_queue if no processor sets have been created; or the cp_kp_queue for the default processor set if processor sets have been created, but the thread has not explicitly been attached to one; or the cp_kp_queue for the processor set that the thread belongs to if it has been explicitly attached to a processor set. Bound threads are always placed on the queue for the processor they are bound to -- even high-priority RT threads.

Note that interrupt threads are linked to the processor's cpu structure through the cpu_intr_thread member. During system startup, nine interrupt threads are created for each processor and maintained on a linked list rooted in the aforementioned processor cpu_intr_thread pointer.

The insertion of a thread on a dispatch queue is accomplished using the setfrontdq() and setbackdq() routines for the per-processor dispatch queues, and setkpdq() for the kernel preemption queue (or queues). The algorithm for setfrontdq(), which places a thread at the front of a dispatch queue, is represented below in a pseudo code flow:

/* CPU below refers to selected CPU for the thread */
if (thread is swapped out or on the swap queue)
	call disp_swapped_setrun(thread);
	nudge the memory scheduler;
	return;
if (not a multiprocessor)
	CPU = t_cpu;    /* set cpu to cpu thread last ran on. it's a uniprocessor
			   so this can't be wrong */
else 			/* it is a multiprocessor */
	if (1 cpu )
		CPU = t_cpu; 	/* set cpu to cpu thread last ran on */
	else if (thread is not bound)
		if (thread priority > kpreemptpri)
			call setkpdq();
		else
			CPU = t_cpu;	/* set cpu to cpu thread last ran on */
			if (on correct cpu partition (processor set))
				if (thread priority < highest priority thread in this processor)
					cpu_choose(thread);
				else
					we'll use this processor.
			else
				CPU=disp_lowpri_cpu();
				/* migrate thread to a CPU in the correct processor set */

/* at this point, a processor has been selected */
set the thread state to runnable (TS_RUN);
increment disp_nrunnable; /* count of runnable threads for all the queues processor-wide */
increment dq_sruncnt; 	  /* count of runnable threads on the queue the thread was placed on */
if (dq_sruncnt != 0)
	set the dq first & last pointers appropriately;
else	/* this thread must be the only thread on the queue */
	set first and last pointing to this thread;
	set the queue bitmap to reflect the existence of a runnable thread on this queue;
	
if (thread priority > highest priority thread on the queue) /* disp_maxrunpri */
	disp_maxrunpri = thread priority;
	call cpu_reschedule();

Hopefully, most of the flow above is self-explanatory. Basically, we put the thread on the dispatch queue of the processor it last ran on, which is maintained in the threads t_cpu field. Exceptions to this will occur if the thread's priority is such that it belongs on the kernel preemption queue, or if the thread's priority is such that it's lower than the highest priority thread on the processor (we can't put it on the "front" of the queue).

The cpu_choose() routine looks for the best processor to put the thread on. In cpu_choose(), the kernel compares the amount of time (clock ticks) that have transpired since the thread ran last against the rechoose_interval, which has a value of 3. This is essentially a means of figuring out whether it's worth the effort to put the thread back on the CPU it last ran on in the hopes of hitting a warm hardware cache.

If it's been 3 clock ticks (30 milliseconds) since the thread ran last, we're probably not going to hit a warm cache. In that case, the processor running the lowest priority thread is located and selected. The thread's t_disp_time field gets incremented in the clock tick handler if the thread is not on a queue, and the kernel compares this value against rechoose_interval.

The dispatch queue structures maintain a count of the number of runnable threads on the queues. The disp_runnable counter reflects all the queues for the processor (remember that each processor has a queue for every priority -- it's a queue of queues), while the dq_sruncnt counter is specific to queue (that is, the per-priority queue). A field in the dispatch queue structure, disp_qactmap, is a bitmap set to represent which of the per-priority queues contains runnable threads. This field is used to search the queues during the selection of which thread to run next and greatly speeds up the process (bit operations are fast and we don't waste time looking in empty queues).

Finally, we need to figure out if a thread preemption is required. If the priority of the thread we just inserted into the queue is greater than the highest priority thread on the queue, we'll set up either a user preemption or kernel preemption. The cpu_reschedule() function figures that out, and sets cpu_runrun or cpu_kprunrun, as I described earlier.

The setbackdq(), which puts a thread on the back of a dispatch queue, is algorithmically similiar to setfrontdq(), but with a few differences. First, in setbackdq() the kernel attempts to maintain a balance in queue depths across processors. Once a CPU has been selected via cpu_choose(), the number of runnable threads on the processor queue for the corresponding thread priority is examined. If it is greater than MAX_RUNQ_DIFF (2), we'll try the next cpu in the same partition. Most of the subsequent code follows what happens in setfrontdq().

The decision around whether setfrontdq() or setbackdq() is called in front of the various points in the kernel where queue insertion is called is driven by a few different factors, such as how long a thread has been waiting to run, whether the thread is in the IA class, etc.

The setkpdq() code is pretty straightforward. It puts a kernel thread on either the front or back of the kernel preempt queue, depending on where it was called from. If it was called from setfrontdq(), it puts the thread on the front of the kernel preempt queue. Once the thread is inserted into a queue, cpu_choose() is called to select a processor, and cpu_resched() is invoked to set up a preemption.

Thread selection
The kernel swtch() routine initiates the context switching of a thread off of a processor, figures out which thread should run next, and context switches the selected thread on to a processor for execution. It's called from many places within the operating system, such as in the class fork return function (discussed above), from the idle thread (executed by processors if there are no runnable threads on a dispatch queue), from the preemption code, interrupt and trap handlers, thread sleep management, kernel synchronization support code (mutex's, reader/writer locks, condition variables, etc.), and so on.

Entering the swtch() routine causes the cpu_sysinfo.pswtch counter to get incremented (which is reported in mpstat(1M) in the "csw" column) the number of context switches per second. The swtch() function first checks to see if the current thread is an interrupt thread. When interrupts happen, the thread stack is switched to that of an interrupt thread from the linked list of interrupt threads off the processors cpu structure. If we entered swtch() with an interrupt thread as the current thread, the kernel must restore the interrupted thread's state so it can be resumed

That aside, swtch() will call the kernel disp() function, which is the code segment that actually looks for the highest priority thread to run. The search for the highest priority thread is made relatively simple with the implementation of the dispatch queue bitmap field, disp_qactmap, and the disp_maxrunpri field.

The disp_qactmap will have a bit set in its mask for every queue that is non-zero in depth, meaning that there's at least one kernel thread on the queue. The disp_maxrunpri field is updated every time a thread is put on a queue with a higher priority than any of the existing threads on the queue; in other words, it represents the value of the highest priority thread on the queues for the processor.

The kernel preempt queue is searched first, followed by the per-priority queues for TS/IA/SYS class threads. If a queue is empty, and there was no work on either the kernel preempt queue or any of the processors queues, we'll go look for work in the queue of other processors. If there's nothing runnable on any queues in the system, the idle thread is invoked.

The idle loop is quite simple. The code spins around waiting for something to get inserted into a queue by examining disp_nrunnable in each pass through the loop (disp_nrunnable will get incremented when a thread is inserted on the queue).

Then once the thread selection has been made, the kernel implements a "ratify" routine, in order to ensure that the selected thread was in fact the best candidate to run next. The fine-grained locking used within the dispatcher routines allows for simultaneous changes to be made to the queues and the queue state by, potentially, many processors. For this reason, the "select and ratify" algorithm was chosen for implementation.

The dispatcher state is updated to reflect the selection in the first pass through the swtch(), disp(), and associated routines, and then disp_ratify() is called. The ratify code simply compares the priority of the selected thread to the disp_maxrunpri values of the processor and kernel preempt queue. If the selected thread priority is greater than maxrunpri, the selection is "ratified" and the context switch is done. If not, we re-enter the code loop to find the best runnable thread.

That's about it for this month. Next month we'll look at what happens in the clock handler with respect to thread scheduling and the dispatcher. Stay tuned!

Resources

Goodheart, B. & Cox, J. "The Magic Garden Explained: The Internals of Unix System V Release 4," Prentice Hall
http://www.amazon.com/exec/obidos/ISBN=0130981389/sunworldonlineA/
Vahalia, Uresh. "Unix Internals: The New Frontiers," Prentice Hall
http://www.amazon.com/exec/obidos/ISBN=0131019082/sunworldonlineA/
Kleiman, S., Shah, D., Smaalders, B., "Programming with Threads," Sun Microsystems Press/Prentice Hall
http://www.amazon.com/exec/obidos/ISBN=0131723898/sunworldonlineA
Berg, D. & Lewis, B. "Multithreaded Programming with Pthreads" Sun Microsystems Press/Prentice Hall
http://www.amazon.com/exec/obidos/ISBN=0136807291/sunworldonlineA
Full listing of previous Inside Solaris columns
http://www.sunworld.com/common/swol-backissues-columns.html#insidesolaris
The SunWorld Topical Index -- a comprehensive listing of all SunWorld articles by subject
http://www.sunworld.com/common/swol-siteindex.html
sunWHERE -- launchpad to hundreds of online resources for Sun users
http://www.sunworld.com/sunworldonline/sunwhere.html
Back issues of SunWorld
http://www.sunworld.com/common/swol-backissues.html

About the author
Jim Mauro is currently an area technology manager for Sun Microsystems in the Northeast, focusing on server systems, clusters, and high availability. He has a total of 18 years of industry experience, working in educational services (he developed and delivered courses on Unix internals and administration) and software consulting. Reach Jim at jim.mauro@sunworld.com.

If you have technical problems with this magazine, contact webmaster@sunworld.com

URL: http://www.sunworld.com/swol-02-1999/swol-02-insidesolaris.html
Last modified:

Comments:
Name:
Email:
Company Name: