Click on our Sponsors to help Support SunWorld
Performance Q & A by Adrian Cockcroft

Unlocking the kernel

How you can monitor mutex contention, and decide what to do when it becomes a problem

SunWorld
August  1998
[Next story]
[Table of Contents]
[Search]
Subscribe to SunWorld, it's free!

Abstract
For many years, developers have worked to make Solaris the most scalable shared memory multiprocessor operating system. For many applications, contention for locks inside the kernel is not an issue -- however, extreme workloads and new kinds of applications can still cause problems. It is possible to monitor lock contention and determine whether or not an application is causing contention. This monitoring will also help you decide whether it's appropriate to scale a workload by adding more CPUs, or to upgrade to a more recent Solaris release. (3,500 words)


Mail this
article to
a friend

Q: I can see mutex contention with mpstat, but what level constitutes a problem -- and what can I do about it?

A: To answer this question I have taken an excerpt from the second edition of my book, Sun Performance and Tuning -- Java and the Internet (©Sun Press/PTR Prentice Hall 1998), that explains the issues and how to analyze them. (The text from the subhead "Critical regions" up to the subhead "Monitoring using mpstat" is excerpted from my book). I have also taken the mutex rule in the SE toolkit and upgraded it to be a much more sophisticated check; this is explained in detail and will be added to the SE3.0 patch.

Critical regions
The Unix kernel has many critical regions, or sections of code, where a data structure is being created or updated. These regions must not be interrupted by a higher-priority interrupt service routine. A uniprocessor Unix kernel manages these regions by setting the interrupt mask to a high value while executing in the region. On a multiprocessor, there are other processors with their own interrupt masks, so a different technique must be used to manage critical regions.

The spin lock or mutex
One key capability in shared-memory multiprocessor systems is the ability to perform interprocessor synchronization by means of atomic load/store or swap instructions. All SPARC chips have an instruction called LDSTUB, which means load-store-unsigned-byte. The instruction reads a byte from memory into a register, then writes 0xFF into memory in a single, indivisible operation. The value in the register can then be examined to see if it was already 0xFF (which means that another processor got there first), or if it was 0x00 (which means that this processor is in charge). This instruction is used to make mutual exclusion locks (known as mutexes) that make sure only one processor at a time can hold the lock. The lock is acquired through LDSTUB and cleared by storing 0x00 back to memory.

If a processor does not get the lock, then it may decide to spin by sitting in a loop and testing the lock until it becomes available. By checking with a normal load instruction in a loop before issuing an LDSTUB, the processor performs the spin within the cache, and the bus snooping logic watches for the lock being cleared. In this way, spinning causes no bus traffic -- so processors that are waiting do not slow down those that are working. A spin lock is appropriate when the wait is expected to be short. If a long wait is expected, the process should sleep for a while so that a different job can be scheduled onto the CPU. The extra SPARC V9 instructions introduced with UltraSPARC include an atomic compare-and-swap operation, which is used by the UltraSPARC-specific version of the kernel to make locking more efficient.

Code locking
The simplest way to convert a Unix kernel that is using interrupt levels to control critical regions for use with multiprocessors is to replace the call that sets interrupt levels high with a call to acquire a mutex lock. At the point where the interrupt level was lowered, the lock is cleared. In this way, the same regions of code are locked for exclusive access. This method has been used to a greater or lesser extent by most MP Unix implementations, including SunOS 4 on the SPARCserver 600MP machines. The amount of actual concurrency that can take place in the kernel is controlled by the number and position of the locks.

In SunOS 4 there is effectively a single lock around the entire kernel. The reason for using a single lock is to make these MP systems totally compatible with user programs and device drivers written for uniprocessor systems. User programs can take advantage of the extra CPUs, but only one CPU at a time can be executing kernel code.

When code locking is used, there are a fixed number of locks in the system. This number can be used to characterize how much concurrency is available. On a very busy, highly configured system, the code locks are likely to become bottlenecks. In this case, adding extra processors will not help performance, and may actually reduce it.

Data locking and Solaris 2
The problem with code locks is that different processors often want to use the same code to work on different data. To allow this use, locks must be placed in data structures rather than in code. Unfortunately, this solution requires an extensive rewrite of the kernel -- one reason why Solaris 2 took several years to create. The result is that the kernel has a lot of concurrency available, and can be tuned to scale well with large numbers of processors. The same kernel is used on uniprocessors and multiprocessors, so that all device drivers and user programs must be written to work in the same environment, and there is no need to constrain concurrency for compatibility with uniprocessor systems. The locks are still needed in a uniprocessor because the kernel can switch between kernel threads at any time to service an interrupt.

With data locking, there is a lock for each instance of a data structure. Since table sizes vary dynamically, the total number of locks grows as the tables grow -- and the amount of concurrency that is available to exploit is greater on a very busy, highly configured system. Adding extra processors to such a system is likely to be beneficial. Solaris 2 has hundreds of different data locks, and multiplied by the number of instances of the data, there will typically be many thousand locks in existence on a running system. As Solaris 2 is tuned for more concurrency, some of the remaining code locks are turned into data locks, and "large" locks are broken down into finer-grained locks. There is a trade-off between having lots of mutexes for good MP scalability and few mutexes for reduced CPU overhead on uniprocessors.


Advertisements

Monitoring Solaris 2.6 lock statistics
In Solaris 2.6, the lockstat(1) command offers a kernel lock monitoring capability that is very powerful and easy to use. The kernel can dynamically change its lock code to start collecting information, and return to the normal, highly-optimized lock code when the collection is complete. There is some extra overhead while the locks are instrumented, but useful measurements can be taken in a few seconds without disruption of the work on production machines. Data is presented clearly with several summary options. However, you still must have a very good understanding of Unix kernel architecture -- and Solaris 2 in particular -- to really understand what is happening.

You should read the lockstat(1) manual page to see all of its options -- but in normal operation, it is run by the root user for the duration of a process for which you want to monitor the effects. If you want to run it for an interval, use the sleep command. I had trouble finding an example that would cause lock contention in Solaris 2.6. Eventually, I found that putting a C shell into an infinite loop, as shown below, generates 120,000 or more system calls per second, duplicating and shutting file descriptors. When two shells are run at the same time on a dual CPU system, mpstat shows that hundreds of mutex spins per second occur on each CPU. This test is not intended to be serious! The system call mix was monitored with truss -c, as shown in Code Example 1. With truss running, the system call rate dropped to about 12,000 calls per second, so I ran lockstat separately from truss.

% while 1
? end
% truss -c -p 351
^C
syscall      seconds   calls  errors
open             .26    2868
close            .43   17209
getpid           .44   17208
dup              .31   14341
sigprocmask      .65   28690
                ----     ---    ---
sys totals:     2.09   80316      0
usr time:        .39
elapsed:        6.51

The purpose of this experiment was to see what system calls were causing the contention. It appears to be a combination of dup and close, and possibly sigprocmask as well. Although there are a lot of calls to getpid, they should not cause any contention.

The example output shows that there were 3,318 adaptive mutex spins in 5 seconds, which is 663 mutex stalls per second in total and matches the data reported by mpstat. The locks and callers are shown; the top lock seems to be flock_lock, which is a file-related lock. Some other locks that are not named in this display may be members of hashed lock groups. Several of the calling routines mention file or filesystem operations and sigprocmask, which ties up with the system call trace.

If you see high levels of mutex contention, you need to identify both the locks that are being contended on and the component of the workload that is causing the contention -- for example, system calls or a network protocol stack. You can then try to change the workload to avoid the problem. You should also make a service call to Sun to report the contention. As Sun pushes multiprocessor scalability to 64 processors and beyond, there are bound to be new workloads and new areas of contention that do not occur on smaller configurations. Each new release of Solaris 2 further reduces contention to allow even higher scalability and new workloads. There is no point reporting high mutex contention levels in anything but the latest release of Solaris 2.

Code Example 2

# lockstat sleep 5

Adaptive mutex spin: 3318 events
Count indv cuml rcnt     spin Lock                   Caller                  
-------------------------------------------------------------------------------
  601  18%  18% 1.00        1 flock_lock             cleanlocks+0x10         
  302   9%  27% 1.00        7 0xf597aab0             dev_get_dev_info+0x4c   
  251   8%  35% 1.00        1 0xf597aab0             mod_rele_dev_by_major+0x2c
  245   7%  42% 1.00        3 0xf597aab0             cdev_size+0x74          
  160   5%  47% 1.00        7 0xf5b3c738             ddi_prop_search_common+0x50
  157   5%  52% 1.00        1 0xf597aab0          ddi_hold_installed_driver+0x2c
  141   4%  56% 1.00        2 0xf5c138e8             dnlc_lookup+0x9c        
  129   4%  60% 1.00        6 0xf5b1a790             ufs_readlink+0x120      
  128   4%  64% 1.00        1 0xf5c46910             cleanlocks+0x50         
  118   4%  67% 1.00        7 0xf5b1a790             ufs_readlink+0xbc       
  117   4%  71% 1.00        6 stable_lock            specvp+0x8c             
  111   3%  74% 1.00        1 0xf5c732a8             spec_open+0x54          
  107   3%  77% 1.00        3 stable_lock            stillreferenced+0x8     
   97   3%  80% 1.00        1 0xf5b1af00             vn_rele+0x24            
   92   3%  83% 1.00       19 0xf5c732a8             spec_close+0x104        
   84   3%  86% 1.00        1 0xf5b0711c             sfind+0xc8              
   57   2%  87% 1.00        1 0xf5c99338             vn_rele+0x24            
   53   2%  89% 1.00       54 0xf5b5b2f8             sigprocmask+0xa0        
   46   1%  90% 1.00        1 0xf5b1af00             dnlc_lookup+0x12c       
   38   1%  91% 1.00        1 0xf5b1af00             lookuppn+0xbc           
   30   1%  92% 1.00        2 0xf5c18478             dnlc_lookup+0x9c        
   28   1%  93% 1.00       33 0xf5b5b4a0             sigprocmask+0xa0        
   24   1%  94% 1.00        4 0xf5c120b8             dnlc_lookup+0x9c        
   20   1%  95% 1.00        1 0xf5915c58             vn_rele+0x24            
   19   1%  95% 1.00        1 0xf5af1e58             ufs_lockfs_begin+0x38   
   16   0%  96% 1.00        1 0xf5915c58             dnlc_lookup+0x12c       
   16   0%  96% 1.00        1 0xf5c732a8             spec_close+0x7c         
   15   0%  97% 1.00        1 0xf5915b08             vn_rele+0x24            
   15   0%  97% 1.00       17 kmem_async_lock        kmem_async_dispatch+0xc 
   15   0%  97% 1.00        1 0xf5af1e58             ufs_lockfs_end+0x24     
   12   0%  98% 1.00        1 0xf5c99338             dnlc_lookup+0x12c       
   11   0%  98% 1.00        1 0xf5c8ab10             vn_rele+0x24            
    9   0%  98% 1.00        1 0xf5b0711c             vn_rele+0x24            
    8   0%  99% 1.00        3 0xf5c0c268             dnlc_lookup+0x9c        
    8   0%  99% 1.00        5 kmem_async_lock        kmem_async_thread+0x290 
    7   0%  99% 1.00        2 0xf5c11128             dnlc_lookup+0x9c        
    7   0%  99% 1.00        1 0xf5b1a720             vn_rele+0x24            
    5   0%  99% 1.00        2 0xf5b5b2f8             clock+0x308             
    5   0% 100% 1.00        1 0xf5b1a720             dnlc_lookup+0x12c       
    3   0% 100% 1.00        1 0xf5915b08             dnlc_lookup+0x12c       
    2   0% 100% 1.00        1 0xf5b5b4a0             clock+0x308             
    2   0% 100% 1.00        1 0xf5c149b8             dnlc_lookup+0x9c        
    2   0% 100% 1.00        1 0xf5c8ab10             dnlc_lookup+0x12c       
    1   0% 100% 1.00        1 0xf5b183d0             esp_poll_loop+0xcc      
    1   0% 100% 1.00        1 plocks+0x110           polldel+0x28            
    1   0% 100% 1.00        2 kmem_async_lock        kmem_async_thread+0x1ec 
    1   0% 100% 1.00        6 mml_table+0x18         srmmu_mlist_enter+0x18  
    1   0% 100% 1.00        3 mml_table+0x10         srmmu_mlist_enter+0x18  
-------------------------------------------------------------------------------

Adaptive mutex block: 24 events

Count indv cuml rcnt     nsec Lock                   Caller                  
-------------------------------------------------------------------------------
    3  12%  12% 1.00    90333 0xf5b3c738             ddi_prop_search_common+0x50
    3  12%  25% 1.00   235500 flock_lock             cleanlocks+0x10         
    2   8%  33% 1.00   112250 0xf5c138e8             dnlc_lookup+0x9c        
    2   8%  42% 1.00   281250 stable_lock            specvp+0x8c             
    2   8%  50% 1.00   107000 0xf597aab0          ddi_hold_installed_driver+0x2c
    2   8%  58% 1.00    89750 0xf5b5b2f8             clock+0x308             
    1   4%  63% 1.00    92000 0xf5c120b8             dnlc_lookup+0x9c        
    1   4%  67% 1.00   174000 0xf5b0711c             sfind+0xc8              
    1   4%  71% 1.00   238500 stable_lock            stillreferenced+0x8     
    1   4%  75% 1.00   143500 0xf5b17ab8             esp_poll_loop+0x8c      
    1   4%  79% 1.00    44500 0xf5b183d0             esp_poll_loop+0xcc      
    1   4%  83% 1.00    81000 0xf5c18478             dnlc_lookup+0x9c        
    1   4%  88% 1.00   413000 0xf5c0c268             dnlc_lookup+0x9c        
    1   4%  92% 1.00  1167000 0xf5c46910             cleanlocks+0x50         
    1   4%  96% 1.00    98500 0xf5c732a8             spec_close+0x7c         
    1   4% 100% 1.00    94000 0xf5b5b4a0             clock+0x308             
-------------------------------------------------------------------------------

Spin lock spin: 254 events

Count indv cuml rcnt     spin Lock                   Caller                  
-------------------------------------------------------------------------------
   48  19%  19% 1.00        3 atomic_lock+0x29e      rw_exit+0x34            
   45  18%  37% 1.00       25 atomic_lock+0x24e      rw_exit+0x34            
   44  17%  54% 1.00        1 atomic_lock+0x29d      rw_exit+0x34            
   35  14%  68% 1.00        1 atomic_lock+0x24e      rw_enter+0x34           
   32  13%  80% 1.00       19 cpus+0x44              disp+0x78               
   22   9%  89% 1.00        1 atomic_lock+0x29d      rw_enter+0x34           
   17   7%  96% 1.00       52 cpu[2]+0x44            disp+0x78               
    6   2%  98% 1.00       10 cpu[2]+0x44            setbackdq+0x15c         
    5   2% 100% 1.00        1 atomic_lock+0x29e      rw_enter+0x34           
-------------------------------------------------------------------------------

Thread lock spin: 3 events

Count indv cuml rcnt     spin Lock                   Caller                  
-------------------------------------------------------------------------------
    3 100% 100% 1.00       19 0xf68ecf5b             ts_tick+0x8             
-------------------------------------------------------------------------------

Monitoring using mpstat
There is a command called mpstat that ships with Solaris 2. It shows each CPU individually, one per line as shown below.

CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    0   0    0     1    0  922    1    1  312    0   691    1  11   0  89
  1    0   0  121     3    0 1638    2    0  531    0  1294    1  21   0  78
  8   56   0    8     5    3 1018    1    1  364    0   940    1  15   0  84
  9   32   0  316   213   13 1082    2    2  373    0   906    1  16   0  83
 12   37   0   19     1    0  723    1    1  253    0   598    0  11   0  89
 13   18   0   25     3    0 1734    3    1  552    0  1318    0  22   0  77
 16    0   0    4  2969 2901   36    0    1 2470    0    30    0   0   0 100
 20    0   0    0     2    0 1547    2    0  531    0  1138    0  20   0  80
 21    0   0    0     2    2   25    0    0    1    0    11    0   0   0 100

If you watch the numbers carefully you will start to see some patterns. One thing that is apparent is that interrupts are not evenly distributed over all the CPUs. This is because Solaris binds each source of interrupts to a different CPU. This keeps related data structures local to a single CPU cache, and is more efficient than sending interrupts to CPUs at random. The only downside is when too many interrupts beat on one CPU and it can't cope with the load. This can happen with slower processors and Gigabit Ethernet or Quad Fast Ethernet cards. Device interrupt time cannot be seen by the system, since device interrupts are higher priority than the clock interrupt. So, you may see a CPU with a very high interrupt rate, but less CPU usage than you might expect. In our example, CPU 16 is taking interrupts from a Gigabit Ethernet saturation test. It is showing no user or system CPU time although it is probably 100 percent busy running interrupts, and has lots of interrupt threads and a high number of mutex stalls, shown as smtx.

Another thing you may see is high levels of inter-processor cross-calls. If you are running Solaris 2.5.1 with kernel patch 103640, a cross-call-related bug was introduced in 103640-07 and fixed in 103640-13. These patches are quite old, so it is worth upgrading the patch level to 103640-20 or later. Here is some example mpstat output showing the system before and after the problem was found and fixed.

Figure 1	mpstat before
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0  119   0 25734   271   70   51    2    1    2    0   360    4   8   0  87
  4   39   0 10373     3    1   62    0    1    2    0   263    1   4   0  95
  8   25   0 10363     4    2   34    0    1    1    0    56    0   2   0  97
 12  141   0 7072      3    1   35    1    1    2    0   208    1   5   0  94

Figure 2	mpstat after
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0  445   0  184   785  568  280    6   10   15    0  1531    8  10  14  68
  4  435   0  169    16    3  391    7   12   13    0  2093    9   7  15  68
  8  337   0  531     7    2  310    5   11   12    0  1623    8   6  16  71
 12  335   0  153    12    8  273    5   10   10    0  1300    6   6  15  74

Mutex contention
You should look at the smtx level (number of mutex stalls) in conjunction with the user and system CPU time. If there appears to be a correlation between high system time and high smtx levels, then you have some contention.

My original mutex contention rule, used in SE3.0, took the total of smtx over all CPUs and divided by the number of CPUs to get an average rate. It then raised a flag if the rate was over 300 or 500 per second per CPU. With the latest CPUs, these levels are too low, and the rule is too simplistic. It is important to look at each CPU individually, and to look at the CPU user and system time as well. So, I have made a new rule:

For each CPU, if smtx is less than 500, or system time is less than 20 percent, there is no problem. If smtx is over 500, and system time is over 20 percent, then I look at the ratio of user-to-system time. If user time is greater than system time it is only a warning (amber) level. If system time is higher, it is a problem (red) level.

The mutex rule has been upgraded to include an array of states -- one per CPU -- and maintains the old interface so that existing tools that use it do not need to be modified. I upgraded the test programs pure_test.se and live_test.se to check that the rule is working properly, and to print out all the per-CPU data using live_test.se. The GUI-based zoom.se also could be extended to show per-CPU data, but it works with the new rule with no changes. I'm pleased that I could make this change with no impact to existing tools. All the rules were deliberately constructed in a object-oriented fashion to hide their implementation details and allow this kind of change.

% /opt/RICHPse/bin/se live_test.se
...
Overall current state is green at: Wed Jul 29 00:12:12 1998
Subsystem  State Action
Disk       white No activity
   c1t10d0 white   c1t11d0 white
Networks   white No activity
      hme0 white      hme1 white      hme2 white      hme3 white      hme4 white
      hme5 white      hme6 white      hme7 white      hme8 white
NFS client white No client NFS/RPC activity
Swap space white There is a lot of unused swap space
RAM demand white RAM available
Kernel mem white No worries, mate
CPU power  white CPU idling
Mutex      green No worries, mate
No significant kernel contention
[cpu0     39 green] [cpu1      8 green] [cpu8      0 white] [cpu9      3 green] [cpu12    11 green] 
[cpu13     0 white] [cpu16   155 green] [cpu20    37 green] [cpu21    34 green] 
DNLC       white No activity
Inode      white No activity
TCP        green No activity

Overall current state is red   at: Wed Jul 29 00:12:32 1998
Subsystem  State Action
Disk       white No activity
   c1t10d0 white   c1t11d0 white
Networks   white No activity
      hme0 white      hme1 white      hme2 white      hme3 white      hme4 white
      hme5 white      hme6 white      hme7 white      hme8 white
NFS client white No client NFS/RPC activity
Swap space white There is a lot of unused swap space
RAM demand white RAM available
Kernel mem white No worries, mate
CPU power  white CPU idling
Mutex      red   MP Kernel overloaded
Serious mutex contention, multiprocessor kernel overloaded,
	use the lockstat(1M) command to diagnose the cause,
	faster CPUs will help, more CPUs will make it worse
[cpu0    350 green] [cpu1    441 green] [cpu8    416 green] [cpu9    637 red  ] [cpu12   506 green] 
[cpu13   335 green] [cpu16  5513 green] [cpu20   489 green] [cpu21   161 green] 
DNLC       white No activity
Inode      white No activity
TCP        green No activity

Wrap up
The new rules code will be added into the SE patch that is available from the SE download page. I've been meaning to update this rule for a very long time. I hope to get around to updating some of the other rules soon. While the rules are better than those found in most tools, they could be improved. If you try this updated rule out, let me know how well it works for you -- it may be necessary to fine tune the default values of the thresholds to prevent false alarms.

I'd also like to take this opportunity to mention a training course that I'm co-presenting with Dr. Neil Gunther on "Practical Performance Methods." It is taking place at Stanford University, California, August 17-21 1998, as part of the Western Institute of Computer Science (WICS) summer seminar series. More details and a draft agenda can be found at http://wics.stanford.edu/performance.html and at http://members.aol.com/CoDynamo/Training.wics.htm.


Click on our Sponsors to help Support SunWorld


Resources

Other Cockcroft columns at www.sun.com

About the author
Adrian Cockcroft joined Sun Microsystems in 1988, and currently works as a performance specialist for the Server Division of SMCC. He wrote Sun Performance and Tuning: SPARC and Solaris and Sun Performance and Tuning -- Java and the Internet, both published by SunSoft Press PTR Prentice Hall. Reach Adrian at adrian.cockcroft@sunworld.com.

What did you think of this article?
-Very worth reading
-Worth reading
-Not worth reading
-Too long
-Just right
-Too short
-Too technical
-Just right
-Not technical enough
 
 
 
    

SunWorld
[Table of Contents]
Subscribe to SunWorld, it's free!
[Search]
Feedback
[Next story]
Sun's Site

[(c) Copyright  Web Publishing Inc., and IDG Communication company]

If you have technical problems with this magazine, contact webmaster@sunworld.com

URL: http://www.sunworld.com/swol-08-1998/swol-08-perf.html
Last modified: