The memory go round

Confused by "free memory?" We explain its role, how it flows, when to tune it, and what to change


Abstract
Pages of physical memory circulate through the system via something called "free memory" that seems to confuse a lot of people. This month the basic memory flows are explained, and the latest changes in the algorithm are described. Some guidance on when it may need tuning and what to change is included. (2,600 words)

Q: The free memory reported by vmstat is confusing, it doesn't seem to behave the way I'd expect it to, and it seems to vary depending upon the operating system version I've got installed as well. What's going on?

A: I've tried to explain this before, in my book and in my very first SunWorld Online column, entitled "Help, I've lost my memory!" (see Resources). This time I'll try a new approach, and I'll also try to indicate the changes that have occurred in recent releases of Solaris. First, let's look at some vmstat output.

% vmstat 5
 procs     memory            page            disk          faults      cpu
 r b w   swap  free  re  mf pi po fr de sr s0 s1 s5 --   in   sy   cs us sy id
 0 0 0 480528 68056   0   3  5  2  2  0  0 65 12  9  0  165  968  101  2  2 95
 0 0 0 476936 85768   0  15 107 0  0  0  0  3  4  7  0  465 1709  231  6  3 91
 0 0 0 476832 85160   0  31 144 0  0  0  0  7  0  9  0  597 3558  367  8  6 87
 0 0 0 476568 83840   0   7 168 0  0  0  0  4  1  6  0  320  796  155  6  1 93
 0 0 0 476544 83368   0   0 28  0  0  0  0  1  2  0  0  172 1739  166 10  5 85

I described how vmstat works in a previous article. The first thing to remember is that vmstat prints out averaged rates, based on the difference between two snapshots of a number of kernel counters. The first line of output is printed immediately, hence it is the difference between the current value of the counters and zero. This works out to be the average since the system was booted. It is best to disregard the first line. Secondly, if you run vmstat with a small time interval you tend to catch all the short-term peaks and get results that are much more variable from one line to the next. I typically use five or 10 seconds, but may use one second to try and catch peaks or as much as 30 to 60 seconds to just "keep an eye" on a system. Vmstat is a useful summary of what is happening on a system.

The basic unit of memory is a page, which is typically four kilobytes or eight kilobytes in size. UltraSPARC-based systems use an eight-kilobyte page, while the previous generation of SPARC systems, Intel x86, and many other architectures use four kilobytes. The first source of confusion when looking at vmstat output is that some measurements are in pages, and others are in kilobytes, so the numbers will not look the same on systems with four-kilobyte versus eight-kilobyte pages. The units of swap, free, pi, po, fr, and de are in kilobytes, and re, mf, and sr are in pages.

The role of the free list
Memory is one of the main resources in the system. There are four main consumers of memory, and when memory is needed, they all obtain it from the free list. The diagram below shows these consumers and how they relate to the free list. When memory is needed it is taken from the head of the free list. When memory is put back on the free list there are two choices. If the page still contains valid data it is put on the tail of the list, so it will not be reused for as long as possible. If the page has no useful content, it is put at the head of the list for immediate reuse. The kernel keeps track of valid pages in the free list so that they can be reclaimed if their content is requested, thereby saving a disk I/O.

The vmstat reclaim counter is two edged. On one hand it is good that a page fault was serviced by a reclaim, rather than a page in that would cause a disk read. On the other hand, you don't want active pages to be stolen and end up on the free list in the first place. The vmstat free value is simply the size of the free list in kilobytes. The way the size varies is what tends to confuse people. The most important value reported by vmstat is the scan rate - sr. If it is zero or close to zero then you can be sure that the system does have sufficient memory. If it is always high (hundreds to thousands of pages per second) then adding more memory is likely to help.

Kernel memory
At the start of the boot process, the kernel takes two or three megabytes of initial memory and puts all the rest on the free list. As the kernel dynamically sizes itself and loads device drivers and filesystems, it takes memory from the free list. Kernel memory is normally locked and cannot be paged out in a memory shortage, but the kernel does free memory if there is a lot of demand for it. Unused device drivers will be unloaded, and unused slabs of kernel memory data structures will be freed. I notice this sometimes when during heavy paging there is a pop from the speaker on a desktop system. This occurs when the audio device driver is unloaded. If you run a program that keeps the device open it cannot be unloaded.

One problem that can occur is reported as a kernel memory allocation error by sar -k, and by the kernel rule in the SE toolkit. There is a limit on the size of kernel memory, but at 3.75 gigabytes on UltraSPARC systems this is unlikely to be reached. The most common cause of this problem is that the kernel tried to get some memory while the free list was completely empty. Since the kernel cannot always wait this may cause operations to fail rather than be delayed. The streams subsystem cannot wait, and I have seen remote logins fail due to this issue when a large number of users try to login at the same time. In Solaris 2.5.1, and the current kernel jumbo patch for Solaris 2.3, 2.4, and 2.5, changes were made to make the free list larger on big systems and to try and prevent it from ever being completely empty.

Filesystem cache
The second consumer of memory is files. In order to read or write a file, memory must be obtained to hold the data for the I/O. While the I/O is happening, the pages are temporarily locked in memory. After the I/O is complete, the pages are unlocked, but are retained in case the contents of the file are needed again. The filesystem cache is often one of the biggest users of memory in a system. Note that in all versions of Solaris 1 and Solaris 2 the data in the filesystem cache is separate from kernel memory. It does not appear in any address space so there is no limit on its size. If you want to cache many gigbytes of filesystem data in memory you just need to buy a 30-gigabyte E6000 or a 64-gigabyte E10000. We cached an entire nine-gigabyte database on a 16-gigabyte E6000 once, just to show it could be done, and read-intensive queries ran 350 times faster than reading from disk. As I've mentioned previously, SPARC-based systems use either a 36-bit (SuperSPARC) or 41-bit (UltraSPARC) physical address, unlike many other systems that limit both virtual and physical to 32 bits, so stop at four gigabytes.

If you delete or truncate a file, the cached data becomes invalid, so it is returned to the head of the free list. If the kernel decides to stop caching the inode for a file, any pages of cached data attached to that inode are also freed.

For most files, the only way they become uncached is by having inactive pages stolen by the pageout scanner. The scanner only runs when the free list shrinks below a threshold (lotsfree in pages, usually set at a few megabytes), so eventually most of the memory in the system ends up in the filesystem cache. When there is a new, large demand for memory, the scanner will steal most of the required pages from the inactive files in the filesystem cache.

Files are also mapped directly by processes using mmap. This is used to map in the code and shared libraries for a process. Pages may have multiple references from several processes while also being resident in the filesystem cache. A recent optimization is that pages with eight or more references are skipped by the scanner, even if they are inactive. This helps shared libraries and multiprocess server code stay resident in memory. There is currently no measure provided to tell you how much memory is being used by the filesystem cache or which files are cached.

Process private memory
Processes also have private memory to hold their stack space, modified data areas, and heap. Stack and heap memory is always initialized to zero before it is given to the process. The stack grows dynamically as required, and the heap is grown using the brk system call, usually when the malloc library routine needs some more space. Data and code areas are initialized by mapping from a file, but as soon as they are written, a private copy of the page is made. There is currently no way to see how much resident memory is used by private or shared pages. All that is reported is the total number of resident mapped pages, as the RSS field in some forms of ps output (e.g. /usr/ucb/ps uax). The SIZE or SZ field indicates the total size of the address space, which includes memory mapped devices like framebuffers as well as pages that are not resident. SIZE really indicates how much virtual memory the process needs, and is more closely related to swap space requirements.

When a process first starts up it consumes memory very rapidly until it reaches its working set. If it is a user-driven tool, it may also need to grab a lot more memory to perform operations like opening a new window or processing an input file. In some cases the response time to the user request is affected by how long it takes to obtain a large amount of memory. If the process needs more than is currently in the free list, it goes to sleep until the pageout scanner has obtained more memory for it. In many cases additional memory is requested one page at a time -- on a uniprocessor, the process will eventually be interrupted by the scanner as it wakes up to replenish the free list.

Free list performance problems and deficit
This behavior manifests itself in a common problem that has a non-obvious solution. After a reboot, the free list is very large, and memory intensive programs have a good response time. After awhile the free list is consumed by the file cache, and the page scanner cuts in. At that point, the response time may worsen, as large amounts of memory are not immediately available. System administrators may watch this happening with vmstat and see that free memory decreases. When it gets low paging starts, and performance worsens. An initial reaction is to add some more RAM to the system, but this does not usually solve the problem. It may postpone the problem, but it may also make paging more intense, as some paging parameters are scaled up as you add RAM. The kernel tries to counteract this effect by calculating running averages of the memory demand over five-second and 30-second periods.

If the average demand is high the kernel expects that more memory will be needed in the future, so it sets up a deficit which makes the target size of the free list up to twice as big as normal. This is the de column reported by vmstat. The deficit decays over a few seconds back to zero, so you often see a large deficit suddenly appear, then decay away again. With the latest kernel code the target size of the free list (set by lotsfree) increases on big memory systems, and since the deficit is limited to the same value, you should expect to see larger peak values of de on large memory systems.

The real problem is that the free list is too small, and it is being replenished too aggressively, too late. The simplest fix is to increase lotsfree, but remember that the free list is unused memory. If you make it too big you are wasting RAM. If you think that the scanner is being too aggressive, you may also try reducing fastscan, which is the maximum page scanner rate in pages per second. By increasing lotsfree, the maximum value of the deficit will also increase. You also have to let the system stabilize for a while after the first time the page scanner cuts in. It needs to go right round the whole of memory once or twice before it settles down. (vmstat -s tells you the number of revolutions it has done). It should then run on a "little and often" basis. As long as you always have enough free RAM to handle the short term demand and don't have to scan hard all the time, performance should be good.

System V shared memory
There are a few applications that make trivial use of System V shared memory, but the big important applications are the database servers. Databases benefit from very large shared caches of data in some cases, and use System V Shared Memory to allocate as much as 3.5 gigabytes of RAM. By default applications such as Oracle, Informix, and Sybase use a special flag to specify that they want intimate shared memory (ISM). In this case two special changes are made. First, all the memory is locked, and cannot be paged out. Secondly the memory management data structures that are normally created on a per-process basis are created once then shared by every process.

Kernel values, tunables and defaults
I'll end by defining the most important kernel variables that can be used to tune the virtual memory system. They are normally set in the /etc/system file, but some can be changed online using adb -kw if you really know what you are doing. If in any doubt, stick with the defaults.

Wrap up
Although I've tried to explain how memory usage works on several occasions, I hope taking this different approach means that the behavior and problems are now understood more clearly than before.


Resources

Other Cockcroft columns at www.sun.com


About the author
Adrian Cockcroft joined Sun in 1988, and currently works as a performance specialist for the Server Division of SMCC. He wrote Sun Performance and Tuning: SPARC and Solaris, published by SunSoft Press PTR Prentice Hall. Reach Adrian Cockcroft at adrian.cockcroft@sunworld.com.

[Amazon.com Books] You can buy Sun Performance and Tuning: SPARC and Solaris at Amazon.com Books. http://www.amazon.com/exec/obidos/ISBN=0131496425/sunworldonlineA/