|
Clearing up swap space confusionWhy don't swap space numbers add up? |
Solaris 2 has a unique and very perplexing implementation of swap space, and those incorrectly labeled monitoring commands aren't helping matters any. It's time to put the pieces together and find out exactly what's going on. This month, Adrian offers a detailed explanation for why swap space numbers don't make sense. (2,200 words)
Mail this article to a friend |
: The swap space numbers reported by swap -s and swap -l and sar and vmstat don't add up. Why not?
: When I tried to summarize the way that swap space works for the second edition of my book, I discovered that I didn't really understand it myself. After looking back through my April 1996 Performance Q&A column, "How does swap space work?" and reading Inside Solaris columnist Jim Mauro's two-part look at swap space implementation (see Resources below), I decided that Jim didn't quite have the full picture either. After many hours of deep detective work I figured it all out, and discovered some minor bugs in the tools that had us both confused.
Since then I've seen this question many times. The answer is too complex to include in an e-mail, so I've decided to base this month's column on a section of my book. In the future, I'll be able to point readers with swap-space queries in the direction of this column.
For all practical purposes, the swapping out of entire processes can be ignored in Solaris 2, which no longer implements the time-based soft swap-outs that occur in SunOS 4. vmstat -s reports total numbers of swap-ins and swap-outs, and they're almost always zero. It is important to note that prolonged memory shortages can trigger swap-outs of inactive processes. Swapping out idle processes helps the performance of machines with less than 32 megabytes (MB) of RAM. The number of idle swapped-out processes is reported as the swap queue length by vmstat. This measurement is not explained properly in the manual page, since that measure used to be the number of active swapped-out processes waiting to be swapped back in. As soon as a swapped-out process wakes up again, it will swap its basic data structures back into the kernel, and page in its code and data as they are accessed. This activity requires so little memory that it can always happen immediately.
Code Example 1
%vmstat 5 procs memory page disk faults cpu r b w swap free re mf pi po fr de sr f0 s0 s1 s5 in sy cs us sy id ... 0 0 4 314064 5728 0 7 2 1 1 0 0 0 177 91 22 132 514 94 1 0 98
If you come across a system with a non-zero swap queue reported by vmstat, take it as a sign that at some time in the past, free memory went low for long enough to trigger the swapping out of idle processes. This is the only useful conclusion you can draw from such a measure.
Swap space operations
Swap space is really a misnomer for paging space.
Almost all accesses are page related. Swap space is actually
used to hold anonymous memory, as it's not associated with a named
file in the file system. The kernel uses anon in the name of many
of the swap-related variables.
Swap space is allocated from spare RAM and from a swap disk. The measures provided are based on two sets of underlying numbers. One set relates to a physical swap disk, while the other set relates to RAM used as swap space by pages in memory.
Swap space is used in two stages: When memory is requested (for example, as a result of a malloc call), swap is reserved, and a mapping is made against the /dev/zero device. To start with, reservations are made against available disk-based swap. When disk swap is all used up, RAM is reserved. When these pages are first accessed, physical pages are obtained from the free list and filled with zeros, and pages of swap become allocated rather than reserved. In effect, reservations are initially taken out of disk-based swap, but allocations are initially taken out of RAM-based swap. When a page of anonymous RAM is stolen by the page scanner, the data is written out to the swap space (i.e., the swap allocation is moved from memory to disk, and the memory is freed.)
Memory space that is mapped but never used stays in the reserved state, and the reservation consumes swap space. This behavior is common for large database systems and is the reason why large amounts of swap disk must be configured to run applications like Oracle and SAP R/3, even though they are unlikely to allocate all the reserved space. Before Solaris 2.6, the shared memory area used by databases such as Oracle and Sybase also required a swap reservation. This memory is normally allocated as intimate shared memory (ISM) and is locked in RAM. The swap reservation can never be used, so in Solaris 2.6 a special case was made and ISM no longer reserves swap space. If you have a 2-gigabyte (GB) shared memory segment, 2 GB of swap disk will be freed for other purposes when you upgrade from Solaris 2.5.1 to 2.6.
The first swap partition allocated is also used as the system dump space for storage of a kernel crash dump. It's a good idea to have plenty of disk space set aside in /var/crash and to enable savecore by un-commenting the commands in /etc/rc2.d/S20sysetup. If you forget and think that there may have been an unsaved crash dump, you can try running savecore long after the system has rebooted. The crash dump is stored at the very end of the swap partition, and the savecore command will tell you if it has been overwritten yet.
|
|
|
|
Swap space calculations
Please refer to Jim Mauro's December 1997 and January 1998
Inside Solaris columns in SunWorld for a
detailed explanation of how swap space works (see Resources below).
Disk space used for swap is listed by the swap -l command. All swap space segments must be 2 GB or less in size. Any extra space is ignored. The anoninfo structure in the kernel keeps track of anonymous memory. In Solaris 2.6, the name of this structure has changed to k_anoninfo, but these three values are the same. This illustrates why it's best to rely on the more stable kstat interface rather than the raw kernel data. In this case the data provided is so confusing that I feel the need to see how the kstat data is derived:
anoninfo.ani_max is the total amount of disk-based swap space.anoninfo.ani_resv is the amount reserved thus far from both disk and RAM.
anoninfo.ani_free is the amount of unallocated physical space plus the amount of reserved unallocated RAM.
If ani_resv is greater than ani_max, we've reserved all the disk swap and some RAM-based swap. Otherwise, the amount of disk-resident swap space available to be reserved is calculated by subtracting ani_resv from ani_max.
swapfs_minfree is set to physmem/8 (with a minimum of 3.5 megabytes) and acts as a limit on the amount of memory used to hold anonymous data.availrmem is the amount of resident, unswappable memory in the system. It varies and can be read from the system_pages kstat shown in Code Example 2.
The amount of swap space that can be reserved from memory is calculated by subtracting swapfs_minfree from availrmem. The total amount available for reservation is thus
MAX(ani_max - ani_resv, 0) + (availrmem - swapfs_minfree)
A reservation failure will prevent a process from starting or growing. Allocations aren't really interesting. The counters provided by the kernel to commands such as vmstat and sar are part of the vminfo kstat structure. These counters accumulate once per second, so average swap usage over a measured interval can be determined. The swap -s command reads the kernel directly to obtain a snapshot of the current anoninfo values, so the numbers will never match exactly. Also, the simple act of running a program changes the values, so you can't get an exact match. The vminfo calculations are as follows:
swap_resv += ani_resv swap_alloc += MAX(ani_resv, ani_max) - ani_free swap_avail += MAX(ani_max - ani_resv, 0) + (availrmem - swapfs_minfree) swap_free += ani_free + (availrmem - swapfs_minfree)
Code Example 2
system_pages: physmem 15778 nalloc 7745990 nfree 5600412 nalloc_calls 2962 nfree_calls 2047 kernelbase 268461504 econtig 279511040 freemem 4608 availrmem 13849 lotsfree 256 desfree 100 minfree 61 fastscan 7884 slowscan 500 nscan 0 desscan 125 pp_kernel 1920 pagesfree 4608 pageslocked 1929 pagesio 0 pagestotal 15769
To figure out how the numbers really do add up, I wrote a short program in SE and compared it to the example data shown in Code Example 3. To get the numbers to match, I needed some odd combinations for sar and swap -s. In summary, the only useful measure is swap_available, as printed by swap -s, vmstat, and sar -r (though sar labels it freeswap, and before Solaris 2.5 sar actually displayed swap_free rather than swap_avail). The other measures are mislabelled and confusing. The code for the SE program in Code Example 4 shows how the data is calculated and suggests a more useful display that is also simpler to calculate.
Code Example 3
# se swap.se ani_max 54814 ani_resv 19429 ani_free 37981 availrmem 13859 swapfs_minfree 1972 ramres 11887 swap_resv 19429 swap_alloc 16833 swap_avail 47272 swap_free 49868 Misleading data printed by swap -s 134664 K allocated + 20768 K reserved = 155432 K used, 378176 K available Corrected labels: 134664 K allocated + 20768 K unallocated = 155432 K reserved, 378176 K available Mislabelled sar -r 1 freeswap (really swap available) 756352 blocks Useful swap data: Total swap 520 M available 369 M reserved 151 M Total disk 428 M Total RAM 92 M # swap -s total: 134056k bytes allocated + 20800k reserved = 154856k used, 378752k available # sar -r 1 18:40:51 freemem freeswap 18:40:52 4152 756912
The only thing you need to know about SE to read this code is that reading kvm$name causes the current value of the kernel variable name to be read. The preprocessor variable MINOR_RELEASE is set to 51 for Solaris 2.5.1 and 60 for Solaris 2.6.
Code Example 4
/* extract all the swap data and generate the numbers */ /* must be run as root to read kvm variables */ struct anon { int ani_max; int ani_free; int ani_resv; }; int max(int a, int b) { if (a > b) { return a; } else { return b; } } main() { #if MINOR_RELEASE < 60 anon kvm$anoninfo; #else anon kvm$k_anoninfo; #endif anon tmpa; int kvm$availrmem; int availrmem; int kvm$swapfs_minfree; int swapfs_minfree; int ramres; int swap_alloc; int swap_avail; int swap_free; int kvm$pagesize; int ptok = kvm$pagesize/1024; int res_but_not_alloc; #if MINOR_RELEASE < 60 tmpa = kvm$anoninfo; #else tmpa = kvm$k_anoninfo; #endif availrmem = kvm$availrmem; swapfs_minfree = kvm$swapfs_minfree; ramres = availrmem - swapfs_minfree; swap_alloc = max(tmpa.ani_resv, tmpa.ani_max) - tmpa.ani_free; swap_avail = max(tmpa.ani_max - tmpa.ani_resv, 0) + ramres; swap_free = tmpa.ani_free + ramres; res_but_not_alloc = tmpa.ani_resv - swap_alloc; printf("ani_max %d ani_resv %d ani_free %d availrmem %d swapfs_minfree %d\n", tmpa.ani_max, tmpa.ani_resv, tmpa.ani_free, availrmem, swapfs_minfree); printf("ramres %d swap_resv %d swap_alloc %d swap_avail %d swap_free %d\n", ramres, tmpa.ani_resv, swap_alloc, swap_avail, swap_free); printf("\nMisleading data printed by swap -s\n"); printf("%d K allocated + %d K reserved = %d K used, %d K available\n", swap_alloc * ptok, res_but_not_alloc * ptok, tmpa.ani_resv * ptok, swap_avail * ptok); printf("Corrected labels:\n"); printf("%d K allocated + %d K unallocated = %d K reserved, %d K available\n", swap_alloc * ptok, res_but_not_alloc * ptok, tmpa.ani_resv * ptok, swap_avail * ptok); printf("\nMislabelled sar -r 1\n"); printf("freeswap (really swap available) %d blocks\n", swap_avail * ptok * 2); printf("\nUseful swap data: Total swap %d M\n", swap_avail * ptok / 1024 + tmpa.ani_resv * ptok / 1024); printf("available %d M reserved %d M Total disk %d M Total RAM %d M\n", swap_avail * ptok / 1024, tmpa.ani_resv * ptok / 1024, tmpa.ani_max * ptok /1024, ramres * ptok / 1024); }
Wrap up
Over the years, many people have struggled to understand the Solaris
2 swap system. If you've experienced confusion in trying to add up
the numbers from the commands, it's not your fault. It
really is confusing, and the numbers don't add up unless you know
how they were calculated in the first place!
I've started a new alias specifically for people developing code using the SE toolkit. We now have a few thousand people using SE to monitor their systems, and relatively few people writing code in SE. Rich Pettit and I would like to talk about upcoming language features with a core group of people who have their own code to contribute or maintain. Send e-mail to the regular se-feedback@chessie.eng.sun.com alias asking to be added to the developers' alias, and tell us what kind of things you've developed.
Finally, thanks for buying the book. The first print run sold out, and the second is now shipping. We made a few corrections to the second print run. Most of the problems were format- and layout-related, and the index page numbers should now be properly synchronized throughout. The references chapter of the book is now online at http://www.sun.com/sun-on-net/performance/book2ref.html.
|
Resources
About the author
Adrian Cockcroft joined Sun Microsystems in 1988,
and currently works as a performance specialist for the Server Division
of SMCC. He wrote
Sun Performance and Tuning: SPARC and Solaris and Sun Performance and Tuning -- Java and the Internet, both
published by SunSoft Press
PTR Prentice Hall.
Reach Adrian at adrian.cockcroft@sunworld.com.
If you have technical problems with this magazine, contact webmaster@sunworld.com
URL: http://www.sunworld.com/swol-07-1998/swol-07-perf.html
Last modified: