Inside Solaris by Jim Mauro

The Solaris process model, Part 7: Managing thread execution and wait times in the system clock handler

A look at the dispatcher-related functions driven by the clock handler in Solaris

SunWorld
March  1999
[Next story]
[Table of Contents]
[Search]
Subscribe to SunWorld, it's free!

Abstract
In last month's column, Jim discussed how a thread inherits its priority and scheduling class during the creation process, the dispatcher queue selection and insertion process, and the "select and ratify" algorithm that is implemented when selecting thread to be context switched on to a processor. This month, he'll examine the Solaris clock interrupt handler, in the context of dispatcher support functions. (1,800 words)


Mail this
article to
a friend
Timing, as they say, is everything. When it comes to the prioritization and scheduling of kernel threads in Solaris, time is everything. Specifically, how much time threads have been running on a processor and how much time threads have waited to run on a processor, drive the priority re-calculation of timeshare (TS) and interactive (IA) class threads. Realtime (RT) threads run at a fixed priority, so no priority adjustment is necessary, although execution time is still (of course) tracked because a time quantum is applied to RT threads. System (SYS) class threads are even simpler as far as the kernel is concerned: they execute until they voluntarily surrender the processor.

The Solaris kernel handles the updating and tracking of thread execution and wait times in the system clock handler, which runs at regular intervals. On all current UltraSPARC-based systems, a clock interrupt is generated 100 times a second, or every 10 milliseconds. The kernel clock handler does two passes through the list of CPUs on the system by walking the linked list of CPU structures. The first pass simply checks for wait I/O or swap wait by examining the per-processor status information maintained in the cpu_stat structures. The count of runnable threads is also determined by summing the disp_nrunnable values in each processor's dispatch structure, where a total count of runnable threads on all the dispatch queues for the processor is maintained.

The second loop through the processor list does a bit more work. This is where we gather data on system-wide processor utilization by determining if a processor is idle, running a thread in user mode, or running in kernel mode. For every processor running a thread that is not an interrupt thread, the system will do the necessary "tick" processing on the thread, which involves updating the appropriate fields in the thread structure to track execution time. There are two members of the thread structure that get checked and updated up front. The t_lbolt field stores the lbolt value from the last clock tick, and t_pctcpu stores the percentage of CPU time used by the thread since the last clock tick. The kernel maintains a timer called lbolt, which counts the number of clock ticks since boot time. It is incremented every clock tick (clock interrupt). The code checks the difference between the current lbolt value, and the threads t_lbolt value, and if t_lbolt is less than lbolt, we need to do tick processing for the thread. Before actually calling the kernel clock_tick() code, the threads t_pctcpu value is re-calculated, and the lbolt value is set in the thread structure (thread_structure.t_lbolt = lbolt).


Advertisements

Tick, tick, tick, tick...
The clock_tick() function is a routine defined in the kernel clock handler, and is used to update the user or system time charged to the thread, depending, of course, on which mode it is in. Before that happens, however, the scheduling class-specific clock tick handler is invoked via the CL_TICK(t) macro, where "t" is a pointer to the kernel thread. Threads in the TS or IA scheduling class will resolve the macro to the TS class ts_tick() routine, and RT class threads will resolve to the rt_tick(). There is not a clock tick handler for SYS class threads. We'll look first at the ts_tick() code.

The ts_tick() routine first checks whether or not the kernel thread is running in kernel mode (running at a SYS priority). If it is, the lion's share of ts_tick processing is circumvented. We'll get back to what happens in this situation in a little bit. First, we'll walk through the code path for threads not running in kernel mode. Remember the class-specific data structures that every kernel thread has a link (pointer) to? We covered them in a previous column. The class data structure, ts_data, which is used for both TS and IA threads, maintains (among other things) a "timeleft" field that tracks how much time the thread has left in its time quantum. If the thread has used its time quantum up (after decrementing ts_timeleft, the value is <= 0), the kernel must set things up for the thread to relinquish the processor and get context switched off. But first, the operating system must determine if preemption control has been applied to this thread, and if so, whether the thread has been given enough extra CPU cycles.

Preemption control is an implementation of Solaris scheduler activations, and is a fairly recent addition to Solaris (release 2.6). The interfaces for preemption control are now part of the standard Solaris APIs and as such are documented via man pages schedctl_init(3X), schedctl_start(3X), etc. Scheduler activations gives an application the ability to "ask" the operating system not to remove a thread from a processor (context switch off) even if the thread's time quantum has expired. The purpose of such an interface is to provide a little hint to the operating system for threads that are holding a critical resource (such as a semaphore or mutex lock). Such threads should execute until they're done with whatever they're doing, so they can release the resource.

Consider this simple scenario: An application uses a mutex lock to synchronize access to a shared memory segment. The shared segment is a critical resource, in that much of the application work requires that the threads can read/write data in the shared segment at some point during execution. A thread executes, grabs the mutex, and before finishing, gets context switched out because it has used up its time quantum. Other threads get scheduled to do work, they start running and attempt to get the mutex, but it's being held so they block. The thread holding the mutex is sitting on a dispatch queue, waiting for its turn to run again. Only when that happens will it (hopefully) have a chance to finish what it's doing with the shared segment and release the lock. With scheduler activations, you can issue a schedctl_start(3X) to notify the dispatcher that you're entering a critical code section and do a subsequent schedctl_stop(3X) when you leave the critical code segment. Note that there's some setup code to do prior to issuing a schedctl_start(3X); see the man pages for details.

Going back to the ts_tick() code, the kernel checks to see if a scheduler activation has been issued for the thread, and if it has, it will not force the thread to be preempted even if it has used up its time quantum. As kind of a fail-safe mechanism, this is not allowed to go on indefinitely. If the thread has not been preempted for a couple clock ticks, the kernel will set it up for preemption. In the absence of any scheduler activations, the thread will get a new priority based on the tqexp field of the TS dispatch table, indexed via the threads current priority. In other words, if the thread is currently at priority 50, the corresponding tqexp value in the table is 40, resulting in the thread's ts_cpupri field being set to 40 (a worse priority, because the thread ran through its time quantum). That done, the thread's new ts_umdpri is determined based on any priocntl(2) hints that may have been set for the thread. The kernel code will adjust the thread's position on the run queue if the result of the new priority requires it.

If, upon entering the ts_tick() routine, the thread has not yet used its time quantum, none of the above code is applied. Rather, the kernel will check to see if the thread's current priority is less than the highest priority thread sitting on any of the processor's dispatch queues (the queue's disp_maxrunpri field), and if it is, the thread will be forced to surrender the CPU. This situation can happen if a higher priority thread has been placed on the dispatch queue at some point since this thread was last running on a processor when a clock interrupt came in.

At this point we return back to the clock routine to do a little additional housekeeping. The user mode time or system mode time fields are incremented (depending, of course, on which mode the thread is running in) in the lwp (lwp_utime, lwp_stime) and at the process level (p_utime, p_stime). If interval timers have been enabled (setitimer(2) system call) for virtual time keeping or profiling, timer expiration is checked and the appropriate signal is sent (SIGVTALRM for virtual time expiration or SIGPROF for profiling timer expiration). Lastly, the CPU rlimit is enforced, and if the thread has exceeded its CPU limit, a SIGXCPU signal is sent, which will cause the process to exit and coredump (a "CPU Limit Exceeded(coredump" message will be sent to stderr).

For RT threads, the tick handler is quite simple. If an RT thread does not have an infinite time quantum, and it has used its alloted time quantum, it is forced to surrender the processor. Also, if a higher priority thread is now on the dispatch queue (disp_maxrunpri is greater than the current thread's priority), it will surrender the processor. If none of the aforementioned conditions are true, the thread will remain on the processor. The subsequent housekeeping, described above, will be done when rt_tick() returns back to the clock handler. There's no priority re-calculation done, as RT threads are fixed priority.

That's it for this month. Next month we'll get into signals in Solaris.


Resources


About the author
Jim Mauro is currently an area technology manager for Sun Microsystems in the Northeast, focusing on server systems, clusters, and high availability. He has a total of 18 years of industry experience, working in educational services (he developed and delivered courses on Unix internals and administration) and software consulting.

What did you think of this article?
-Very worth reading
-Worth reading
-Not worth reading
-Too long
-Just right
-Too short
-Too technical
-Just right
-Not technical enough
 
 
 
    

SunWorld
[Table of Contents]
Subscribe to SunWorld, it's free!
[Search]
Feedback
[Next story]
Sun's Site

[(c) Copyright  Web Publishing Inc., and IDG Communication company]

If you have technical problems with this magazine, contact webmaster@sunworld.com

URL: http://www.sunworld.com/swol-03-1999/swol-03-insidesolaris.html
Last modified: