|
Unlocking the kernelHow you can monitor mutex contention, and decide what to do when it becomes a problem |
For many years, developers have worked to make Solaris the most scalable shared memory multiprocessor operating system. For many applications, contention for locks inside the kernel is not an issue -- however, extreme workloads and new kinds of applications can still cause problems. It is possible to monitor lock contention and determine whether or not an application is causing contention. This monitoring will also help you decide whether it's appropriate to scale a workload by adding more CPUs, or to upgrade to a more recent Solaris release. (3,500 words)
Mail this article to a friend |
: I can see mutex contention with mpstat, but what level constitutes a problem -- and what can I do about it?
: To answer this question I have taken an excerpt from the second edition of my book, Sun Performance and Tuning -- Java and the Internet (©Sun Press/PTR Prentice Hall 1998), that explains the issues and how to analyze them. (The text from the subhead "Critical regions" up to the subhead "Monitoring using mpstat" is excerpted from my book). I have also taken the mutex rule in the SE toolkit and upgraded it to be a much more sophisticated check; this is explained in detail and will be added to the SE3.0 patch.
Critical regions
The Unix kernel has many critical regions, or sections of code,
where a data structure is being created or updated. These regions
must not be interrupted by a higher-priority interrupt service
routine. A uniprocessor Unix kernel manages these regions by setting
the interrupt mask to a high value while executing in the region. On
a multiprocessor, there are other processors with their own
interrupt masks, so a different technique must be used to manage
critical regions.
The spin lock or mutex
One key capability in shared-memory multiprocessor systems is the
ability to perform interprocessor synchronization by means of atomic
load/store or swap instructions. All SPARC chips have an instruction
called LDSTUB, which means load-store-unsigned-byte. The instruction
reads a byte from memory into a register, then writes 0xFF into
memory in a single, indivisible operation. The value in the register
can then be examined to see if it was already 0xFF (which means that
another processor got there first), or if it was 0x00 (which means
that this processor is in charge). This instruction is used to make
mutual exclusion locks (known as mutexes) that make sure
only one processor at a time can hold the lock. The lock is acquired
through LDSTUB and cleared by storing 0x00 back to memory.
If a processor does not get the lock, then it may decide to spin by sitting in a loop and testing the lock until it becomes available. By checking with a normal load instruction in a loop before issuing an LDSTUB, the processor performs the spin within the cache, and the bus snooping logic watches for the lock being cleared. In this way, spinning causes no bus traffic -- so processors that are waiting do not slow down those that are working. A spin lock is appropriate when the wait is expected to be short. If a long wait is expected, the process should sleep for a while so that a different job can be scheduled onto the CPU. The extra SPARC V9 instructions introduced with UltraSPARC include an atomic compare-and-swap operation, which is used by the UltraSPARC-specific version of the kernel to make locking more efficient.
Code locking
The simplest way to convert a Unix kernel that is using interrupt
levels to control critical regions for use with multiprocessors is
to replace the call that sets interrupt levels high with a call to
acquire a mutex lock. At the point where the interrupt level was
lowered, the lock is cleared. In this way, the same regions of code
are locked for exclusive access. This method has been used to a
greater or lesser extent by most MP Unix implementations, including
SunOS 4 on the SPARCserver 600MP machines. The amount of actual
concurrency that can take place in the kernel is controlled by the
number and position of the locks.
In SunOS 4 there is effectively a single lock around the entire kernel. The reason for using a single lock is to make these MP systems totally compatible with user programs and device drivers written for uniprocessor systems. User programs can take advantage of the extra CPUs, but only one CPU at a time can be executing kernel code.
When code locking is used, there are a fixed number of locks in the system. This number can be used to characterize how much concurrency is available. On a very busy, highly configured system, the code locks are likely to become bottlenecks. In this case, adding extra processors will not help performance, and may actually reduce it.
Data locking and Solaris 2
The problem with code locks is that different processors often want
to use the same code to work on different data. To allow this use,
locks must be placed in data structures rather than in code.
Unfortunately, this solution requires an extensive rewrite of the
kernel -- one reason why Solaris 2 took several years to create. The
result is that the kernel has a lot of concurrency available, and can
be tuned to scale well with large numbers of processors. The same
kernel is used on uniprocessors and multiprocessors, so that all
device drivers and user programs must be written to work in the same
environment, and there is no need to constrain concurrency for
compatibility with uniprocessor systems. The locks are still needed
in a uniprocessor because the kernel can switch between kernel
threads at any time to service an interrupt.
With data locking, there is a lock for each instance of a data structure. Since table sizes vary dynamically, the total number of locks grows as the tables grow -- and the amount of concurrency that is available to exploit is greater on a very busy, highly configured system. Adding extra processors to such a system is likely to be beneficial. Solaris 2 has hundreds of different data locks, and multiplied by the number of instances of the data, there will typically be many thousand locks in existence on a running system. As Solaris 2 is tuned for more concurrency, some of the remaining code locks are turned into data locks, and "large" locks are broken down into finer-grained locks. There is a trade-off between having lots of mutexes for good MP scalability and few mutexes for reduced CPU overhead on uniprocessors.
|
|
|
|
Monitoring Solaris 2.6 lock statistics
In Solaris 2.6, the lockstat(1)
command offers a kernel
lock monitoring capability that is very powerful and easy to use.
The kernel can dynamically change its lock code to start collecting
information, and return to the normal, highly-optimized lock code
when the collection is complete. There is some extra overhead while
the locks are instrumented, but useful measurements can be taken in
a few seconds without disruption of the work on production machines.
Data is presented clearly with several summary options. However,
you still must have a very good understanding of
Unix kernel architecture -- and Solaris 2 in particular -- to really
understand what is happening.
You should read the lockstat(1)
manual page to see all
of its options -- but in normal operation, it is run by the root user for
the duration of a process for which you want to monitor the effects.
If you want to run it for an interval, use the sleep
command. I had trouble finding an example that would cause lock
contention in Solaris 2.6. Eventually, I found that putting a C
shell into an infinite loop, as shown below, generates 120,000 or
more system calls per second, duplicating and shutting file
descriptors. When two shells are run at the same time on a dual CPU
system, mpstat
shows that hundreds of mutex
spins per second occur on each CPU. This test is not intended to be
serious! The system call mix was monitored with truss
-c
, as shown in Code Example 1. With truss
running, the system call rate dropped to about 12,000 calls per
second, so I ran lockstat
separately from
truss
.
% while 1 ? end % truss -c -p 351 ^C syscall seconds calls errors open .26 2868 close .43 17209 getpid .44 17208 dup .31 14341 sigprocmask .65 28690 ---- --- --- sys totals: 2.09 80316 0 usr time: .39 elapsed: 6.51
The purpose of this experiment was to see what system calls were causing the
contention. It appears to be a combination of dup
and close
, and possibly sigprocmask
as well.
Although there are a lot of calls to getpid
, they
should not cause any contention.
The example output shows that there were 3,318 adaptive mutex spins
in 5 seconds, which is 663 mutex stalls per second in total and
matches the data reported by mpstat
. The locks
and callers are shown; the top lock seems to be
flock_lock
, which is a file-related lock. Some other
locks that are not named in this display may be members of hashed
lock groups. Several of the calling routines mention file or
filesystem operations and sigprocmask
, which ties up
with the system call trace.
If you see high levels of mutex contention, you need to identify both the locks that are being contended on and the component of the workload that is causing the contention -- for example, system calls or a network protocol stack. You can then try to change the workload to avoid the problem. You should also make a service call to Sun to report the contention. As Sun pushes multiprocessor scalability to 64 processors and beyond, there are bound to be new workloads and new areas of contention that do not occur on smaller configurations. Each new release of Solaris 2 further reduces contention to allow even higher scalability and new workloads. There is no point reporting high mutex contention levels in anything but the latest release of Solaris 2.
Code Example 2
# lockstat sleep 5 Adaptive mutex spin: 3318 events Count indv cuml rcnt spin Lock Caller ------------------------------------------------------------------------------- 601 18% 18% 1.00 1 flock_lock cleanlocks+0x10 302 9% 27% 1.00 7 0xf597aab0 dev_get_dev_info+0x4c 251 8% 35% 1.00 1 0xf597aab0 mod_rele_dev_by_major+0x2c 245 7% 42% 1.00 3 0xf597aab0 cdev_size+0x74 160 5% 47% 1.00 7 0xf5b3c738 ddi_prop_search_common+0x50 157 5% 52% 1.00 1 0xf597aab0 ddi_hold_installed_driver+0x2c 141 4% 56% 1.00 2 0xf5c138e8 dnlc_lookup+0x9c 129 4% 60% 1.00 6 0xf5b1a790 ufs_readlink+0x120 128 4% 64% 1.00 1 0xf5c46910 cleanlocks+0x50 118 4% 67% 1.00 7 0xf5b1a790 ufs_readlink+0xbc 117 4% 71% 1.00 6 stable_lock specvp+0x8c 111 3% 74% 1.00 1 0xf5c732a8 spec_open+0x54 107 3% 77% 1.00 3 stable_lock stillreferenced+0x8 97 3% 80% 1.00 1 0xf5b1af00 vn_rele+0x24 92 3% 83% 1.00 19 0xf5c732a8 spec_close+0x104 84 3% 86% 1.00 1 0xf5b0711c sfind+0xc8 57 2% 87% 1.00 1 0xf5c99338 vn_rele+0x24 53 2% 89% 1.00 54 0xf5b5b2f8 sigprocmask+0xa0 46 1% 90% 1.00 1 0xf5b1af00 dnlc_lookup+0x12c 38 1% 91% 1.00 1 0xf5b1af00 lookuppn+0xbc 30 1% 92% 1.00 2 0xf5c18478 dnlc_lookup+0x9c 28 1% 93% 1.00 33 0xf5b5b4a0 sigprocmask+0xa0 24 1% 94% 1.00 4 0xf5c120b8 dnlc_lookup+0x9c 20 1% 95% 1.00 1 0xf5915c58 vn_rele+0x24 19 1% 95% 1.00 1 0xf5af1e58 ufs_lockfs_begin+0x38 16 0% 96% 1.00 1 0xf5915c58 dnlc_lookup+0x12c 16 0% 96% 1.00 1 0xf5c732a8 spec_close+0x7c 15 0% 97% 1.00 1 0xf5915b08 vn_rele+0x24 15 0% 97% 1.00 17 kmem_async_lock kmem_async_dispatch+0xc 15 0% 97% 1.00 1 0xf5af1e58 ufs_lockfs_end+0x24 12 0% 98% 1.00 1 0xf5c99338 dnlc_lookup+0x12c 11 0% 98% 1.00 1 0xf5c8ab10 vn_rele+0x24 9 0% 98% 1.00 1 0xf5b0711c vn_rele+0x24 8 0% 99% 1.00 3 0xf5c0c268 dnlc_lookup+0x9c 8 0% 99% 1.00 5 kmem_async_lock kmem_async_thread+0x290 7 0% 99% 1.00 2 0xf5c11128 dnlc_lookup+0x9c 7 0% 99% 1.00 1 0xf5b1a720 vn_rele+0x24 5 0% 99% 1.00 2 0xf5b5b2f8 clock+0x308 5 0% 100% 1.00 1 0xf5b1a720 dnlc_lookup+0x12c 3 0% 100% 1.00 1 0xf5915b08 dnlc_lookup+0x12c 2 0% 100% 1.00 1 0xf5b5b4a0 clock+0x308 2 0% 100% 1.00 1 0xf5c149b8 dnlc_lookup+0x9c 2 0% 100% 1.00 1 0xf5c8ab10 dnlc_lookup+0x12c 1 0% 100% 1.00 1 0xf5b183d0 esp_poll_loop+0xcc 1 0% 100% 1.00 1 plocks+0x110 polldel+0x28 1 0% 100% 1.00 2 kmem_async_lock kmem_async_thread+0x1ec 1 0% 100% 1.00 6 mml_table+0x18 srmmu_mlist_enter+0x18 1 0% 100% 1.00 3 mml_table+0x10 srmmu_mlist_enter+0x18 ------------------------------------------------------------------------------- Adaptive mutex block: 24 events Count indv cuml rcnt nsec Lock Caller ------------------------------------------------------------------------------- 3 12% 12% 1.00 90333 0xf5b3c738 ddi_prop_search_common+0x50 3 12% 25% 1.00 235500 flock_lock cleanlocks+0x10 2 8% 33% 1.00 112250 0xf5c138e8 dnlc_lookup+0x9c 2 8% 42% 1.00 281250 stable_lock specvp+0x8c 2 8% 50% 1.00 107000 0xf597aab0 ddi_hold_installed_driver+0x2c 2 8% 58% 1.00 89750 0xf5b5b2f8 clock+0x308 1 4% 63% 1.00 92000 0xf5c120b8 dnlc_lookup+0x9c 1 4% 67% 1.00 174000 0xf5b0711c sfind+0xc8 1 4% 71% 1.00 238500 stable_lock stillreferenced+0x8 1 4% 75% 1.00 143500 0xf5b17ab8 esp_poll_loop+0x8c 1 4% 79% 1.00 44500 0xf5b183d0 esp_poll_loop+0xcc 1 4% 83% 1.00 81000 0xf5c18478 dnlc_lookup+0x9c 1 4% 88% 1.00 413000 0xf5c0c268 dnlc_lookup+0x9c 1 4% 92% 1.00 1167000 0xf5c46910 cleanlocks+0x50 1 4% 96% 1.00 98500 0xf5c732a8 spec_close+0x7c 1 4% 100% 1.00 94000 0xf5b5b4a0 clock+0x308 ------------------------------------------------------------------------------- Spin lock spin: 254 events Count indv cuml rcnt spin Lock Caller ------------------------------------------------------------------------------- 48 19% 19% 1.00 3 atomic_lock+0x29e rw_exit+0x34 45 18% 37% 1.00 25 atomic_lock+0x24e rw_exit+0x34 44 17% 54% 1.00 1 atomic_lock+0x29d rw_exit+0x34 35 14% 68% 1.00 1 atomic_lock+0x24e rw_enter+0x34 32 13% 80% 1.00 19 cpus+0x44 disp+0x78 22 9% 89% 1.00 1 atomic_lock+0x29d rw_enter+0x34 17 7% 96% 1.00 52 cpu[2]+0x44 disp+0x78 6 2% 98% 1.00 10 cpu[2]+0x44 setbackdq+0x15c 5 2% 100% 1.00 1 atomic_lock+0x29e rw_enter+0x34 ------------------------------------------------------------------------------- Thread lock spin: 3 events Count indv cuml rcnt spin Lock Caller ------------------------------------------------------------------------------- 3 100% 100% 1.00 19 0xf68ecf5b ts_tick+0x8 -------------------------------------------------------------------------------
Monitoring using mpstat
There is a command called mpstat that ships with Solaris 2. It shows each CPU individually, one per line as shown below.
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 0 1 0 922 1 1 312 0 691 1 11 0 89 1 0 0 121 3 0 1638 2 0 531 0 1294 1 21 0 78 8 56 0 8 5 3 1018 1 1 364 0 940 1 15 0 84 9 32 0 316 213 13 1082 2 2 373 0 906 1 16 0 83 12 37 0 19 1 0 723 1 1 253 0 598 0 11 0 89 13 18 0 25 3 0 1734 3 1 552 0 1318 0 22 0 77 16 0 0 4 2969 2901 36 0 1 2470 0 30 0 0 0 100 20 0 0 0 2 0 1547 2 0 531 0 1138 0 20 0 80 21 0 0 0 2 2 25 0 0 1 0 11 0 0 0 100
If you watch the numbers carefully you will start to see some patterns. One thing that is apparent is that interrupts are not evenly distributed over all the CPUs. This is because Solaris binds each source of interrupts to a different CPU. This keeps related data structures local to a single CPU cache, and is more efficient than sending interrupts to CPUs at random. The only downside is when too many interrupts beat on one CPU and it can't cope with the load. This can happen with slower processors and Gigabit Ethernet or Quad Fast Ethernet cards. Device interrupt time cannot be seen by the system, since device interrupts are higher priority than the clock interrupt. So, you may see a CPU with a very high interrupt rate, but less CPU usage than you might expect. In our example, CPU 16 is taking interrupts from a Gigabit Ethernet saturation test. It is showing no user or system CPU time although it is probably 100 percent busy running interrupts, and has lots of interrupt threads and a high number of mutex stalls, shown as smtx.
Another thing you may see is high levels of inter-processor cross-calls. If you are running Solaris 2.5.1 with kernel patch 103640, a cross-call-related bug was introduced in 103640-07 and fixed in 103640-13. These patches are quite old, so it is worth upgrading the patch level to 103640-20 or later. Here is some example mpstat output showing the system before and after the problem was found and fixed.
Figure 1 mpstat before CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 119 0 25734 271 70 51 2 1 2 0 360 4 8 0 87 4 39 0 10373 3 1 62 0 1 2 0 263 1 4 0 95 8 25 0 10363 4 2 34 0 1 1 0 56 0 2 0 97 12 141 0 7072 3 1 35 1 1 2 0 208 1 5 0 94 Figure 2 mpstat after CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 445 0 184 785 568 280 6 10 15 0 1531 8 10 14 68 4 435 0 169 16 3 391 7 12 13 0 2093 9 7 15 68 8 337 0 531 7 2 310 5 11 12 0 1623 8 6 16 71 12 335 0 153 12 8 273 5 10 10 0 1300 6 6 15 74
Mutex contention
You should look at the smtx level (number of mutex stalls)
in conjunction with the user and system CPU time. If there appears
to be a correlation between high system time and high smtx levels,
then you have some contention.
My original mutex contention rule, used in SE3.0, took the total of smtx over all CPUs and divided by the number of CPUs to get an average rate. It then raised a flag if the rate was over 300 or 500 per second per CPU. With the latest CPUs, these levels are too low, and the rule is too simplistic. It is important to look at each CPU individually, and to look at the CPU user and system time as well. So, I have made a new rule:
For each CPU, if smtx is less than 500, or system time is less than 20 percent, there is no problem. If smtx is over 500, and system time is over 20 percent, then I look at the ratio of user-to-system time. If user time is greater than system time it is only a warning (amber) level. If system time is higher, it is a problem (red) level.
The mutex rule has been upgraded to include an array of states -- one per CPU -- and maintains the old interface so that existing tools that use it do not need to be modified. I upgraded the test programs pure_test.se and live_test.se to check that the rule is working properly, and to print out all the per-CPU data using live_test.se. The GUI-based zoom.se also could be extended to show per-CPU data, but it works with the new rule with no changes. I'm pleased that I could make this change with no impact to existing tools. All the rules were deliberately constructed in a object-oriented fashion to hide their implementation details and allow this kind of change.
% /opt/RICHPse/bin/se live_test.se ... Overall current state is green at: Wed Jul 29 00:12:12 1998 Subsystem State Action Disk white No activity c1t10d0 white c1t11d0 white Networks white No activity hme0 white hme1 white hme2 white hme3 white hme4 white hme5 white hme6 white hme7 white hme8 white NFS client white No client NFS/RPC activity Swap space white There is a lot of unused swap space RAM demand white RAM available Kernel mem white No worries, mate CPU power white CPU idling Mutex green No worries, mate No significant kernel contention [cpu0 39 green] [cpu1 8 green] [cpu8 0 white] [cpu9 3 green] [cpu12 11 green] [cpu13 0 white] [cpu16 155 green] [cpu20 37 green] [cpu21 34 green] DNLC white No activity Inode white No activity TCP green No activity Overall current state is red at: Wed Jul 29 00:12:32 1998 Subsystem State Action Disk white No activity c1t10d0 white c1t11d0 white Networks white No activity hme0 white hme1 white hme2 white hme3 white hme4 white hme5 white hme6 white hme7 white hme8 white NFS client white No client NFS/RPC activity Swap space white There is a lot of unused swap space RAM demand white RAM available Kernel mem white No worries, mate CPU power white CPU idling Mutex red MP Kernel overloaded Serious mutex contention, multiprocessor kernel overloaded, use the lockstat(1M) command to diagnose the cause, faster CPUs will help, more CPUs will make it worse [cpu0 350 green] [cpu1 441 green] [cpu8 416 green] [cpu9 637 red ] [cpu12 506 green] [cpu13 335 green] [cpu16 5513 green] [cpu20 489 green] [cpu21 161 green] DNLC white No activity Inode white No activity TCP green No activity
Wrap up
The new rules code will be added into the SE patch that is available from the SE download page. I've been meaning to update this rule for a very long time. I hope to get around to updating some of the other rules soon. While the rules are better than those found in most tools, they could be improved.
If you try this updated rule out, let me know how well it works for you -- it may be necessary to fine tune the default values of the thresholds to prevent false alarms.
I'd also like to take this opportunity to mention a training course that I'm co-presenting with Dr. Neil Gunther on "Practical Performance Methods." It is taking place at Stanford University, California, August 17-21 1998, as part of the Western Institute of Computer Science (WICS) summer seminar series. More details and a draft agenda can be found at http://wics.stanford.edu/performance.html and at http://members.aol.com/CoDynamo/Training.wics.htm.
|
Resources
About the author
Adrian Cockcroft joined Sun Microsystems in 1988,
and currently works as a performance specialist for the Server Division
of SMCC. He wrote
Sun Performance and Tuning: SPARC and Solaris and Sun Performance and Tuning -- Java and the Internet, both
published by SunSoft Press
PTR Prentice Hall.
Reach Adrian at adrian.cockcroft@sunworld.com.
If you have technical problems with this magazine, contact webmaster@sunworld.com
URL: http://www.sunworld.com/swol-08-1998/swol-08-perf.html
Last modified: