The memory go round
Confused by "free memory?" We explain its role, how it flows, when to tune it, and what to change
Pages of physical memory circulate through the system via something called "free memory" that seems to confuse a lot of people. This month the basic memory flows are explained, and the latest changes in the algorithm are described. Some guidance on when it may need tuning and what to change is included. (2,600 words)
Q: The free memory reported by vmstat is confusing, it doesn't seem to behave the way I'd expect it to, and it seems to vary depending upon the operating system version I've got installed as well. What's going on?
A: I've tried to explain this before, in my book and in my very first SunWorld Online column, entitled "Help, I've lost my memory!" (see Resources). This time I'll try a new approach, and I'll also try to indicate the changes that have occurred in recent releases of Solaris. First, let's look at some vmstat output.
% vmstat 5 procs memory page disk faults cpu r b w swap free re mf pi po fr de sr s0 s1 s5 -- in sy cs us sy id 0 0 0 480528 68056 0 3 5 2 2 0 0 65 12 9 0 165 968 101 2 2 95 0 0 0 476936 85768 0 15 107 0 0 0 0 3 4 7 0 465 1709 231 6 3 91 0 0 0 476832 85160 0 31 144 0 0 0 0 7 0 9 0 597 3558 367 8 6 87 0 0 0 476568 83840 0 7 168 0 0 0 0 4 1 6 0 320 796 155 6 1 93 0 0 0 476544 83368 0 0 28 0 0 0 0 1 2 0 0 172 1739 166 10 5 85
I described how vmstat works in a previous article. The first thing to remember is that vmstat prints out averaged rates, based on the difference between two snapshots of a number of kernel counters. The first line of output is printed immediately, hence it is the difference between the current value of the counters and zero. This works out to be the average since the system was booted. It is best to disregard the first line. Secondly, if you run vmstat with a small time interval you tend to catch all the short-term peaks and get results that are much more variable from one line to the next. I typically use five or 10 seconds, but may use one second to try and catch peaks or as much as 30 to 60 seconds to just "keep an eye" on a system. Vmstat is a useful summary of what is happening on a system.
The basic unit of memory is a page, which is typically four
kilobytes or eight kilobytes in size. UltraSPARC-based systems use
an eight-kilobyte page, while the previous generation of SPARC
systems, Intel x86, and many other architectures use four kilobytes.
The first source of confusion when looking at vmstat output is that
some measurements are in pages, and others are in kilobytes, so the
numbers will not look the same on systems with four-kilobyte versus
eight-kilobyte pages. The units of
de are in kilobytes, and
sr are in pages.
The role of the free list
Memory is one of the main resources in the system. There are four main consumers of memory, and when memory is needed, they all obtain it from the free list. The diagram below shows these consumers and how they relate to the free list. When memory is needed it is taken from the head of the free list. When memory is put back on the free list there are two choices. If the page still contains valid data it is put on the tail of the list, so it will not be reused for as long as possible. If the page has no useful content, it is put at the head of the list for immediate reuse. The kernel keeps track of valid pages in the free list so that they can be reclaimed if their content is requested, thereby saving a disk I/O.
The vmstat reclaim counter is two edged. On one hand it is good that
a page fault was serviced by a reclaim, rather than a page in that
would cause a disk read. On the other hand, you don't want active
pages to be stolen and end up on the free list in the first place.
The vmstat free value is simply the size of the free list in
kilobytes. The way the size varies is what tends to confuse people.
The most important value reported by vmstat is the scan rate
sr. If it is zero or close to zero then you can be sure that
the system does have sufficient memory. If it is always high
(hundreds to thousands of pages per second) then adding more memory
is likely to help.
At the start of the boot process, the kernel takes two or three megabytes of initial memory and puts all the rest on the free list. As the kernel dynamically sizes itself and loads device drivers and filesystems, it takes memory from the free list. Kernel memory is normally locked and cannot be paged out in a memory shortage, but the kernel does free memory if there is a lot of demand for it. Unused device drivers will be unloaded, and unused slabs of kernel memory data structures will be freed. I notice this sometimes when during heavy paging there is a pop from the speaker on a desktop system. This occurs when the audio device driver is unloaded. If you run a program that keeps the device open it cannot be unloaded.
One problem that can occur is reported as a kernel memory allocation
sar -k, and by the kernel rule in the SE
toolkit. There is a limit on the size of kernel memory, but at 3.75
gigabytes on UltraSPARC systems this is unlikely to be reached. The
most common cause of this problem is that the kernel tried to get
some memory while the free list was completely empty. Since the
kernel cannot always wait this may cause operations to fail rather
than be delayed. The streams subsystem cannot wait, and I have seen
remote logins fail due to this issue when a large number of users
try to login at the same time. In Solaris 2.5.1, and the current
kernel jumbo patch for Solaris 2.3, 2.4, and 2.5, changes were made
to make the free list larger on big systems and to try and prevent
it from ever being completely empty.
The second consumer of memory is files. In order to read or write a file, memory must be obtained to hold the data for the I/O. While the I/O is happening, the pages are temporarily locked in memory. After the I/O is complete, the pages are unlocked, but are retained in case the contents of the file are needed again. The filesystem cache is often one of the biggest users of memory in a system. Note that in all versions of Solaris 1 and Solaris 2 the data in the filesystem cache is separate from kernel memory. It does not appear in any address space so there is no limit on its size. If you want to cache many gigbytes of filesystem data in memory you just need to buy a 30-gigabyte E6000 or a 64-gigabyte E10000. We cached an entire nine-gigabyte database on a 16-gigabyte E6000 once, just to show it could be done, and read-intensive queries ran 350 times faster than reading from disk. As I've mentioned previously, SPARC-based systems use either a 36-bit (SuperSPARC) or 41-bit (UltraSPARC) physical address, unlike many other systems that limit both virtual and physical to 32 bits, so stop at four gigabytes.
If you delete or truncate a file, the cached data becomes invalid, so it is returned to the head of the free list. If the kernel decides to stop caching the inode for a file, any pages of cached data attached to that inode are also freed.
For most files, the only way they become uncached is by having inactive pages stolen by the pageout scanner. The scanner only runs when the free list shrinks below a threshold (lotsfree in pages, usually set at a few megabytes), so eventually most of the memory in the system ends up in the filesystem cache. When there is a new, large demand for memory, the scanner will steal most of the required pages from the inactive files in the filesystem cache.
Files are also mapped directly by processes using mmap. This is used to map in the code and shared libraries for a process. Pages may have multiple references from several processes while also being resident in the filesystem cache. A recent optimization is that pages with eight or more references are skipped by the scanner, even if they are inactive. This helps shared libraries and multiprocess server code stay resident in memory. There is currently no measure provided to tell you how much memory is being used by the filesystem cache or which files are cached.
Process private memory
Processes also have private memory to hold their stack space, modified data areas, and heap. Stack and heap memory is always initialized to zero before it is given to the process. The stack grows dynamically as required, and the heap is grown using the
brk system call, usually when the
library routine needs some more space. Data and code areas are
initialized by mapping from a file, but as soon as they are written,
a private copy of the page is made. There is currently no way to
see how much resident memory is used by private or shared pages. All
that is reported is the total number of resident mapped pages, as
the RSS field in some forms of ps output (e.g.
uax). The SIZE or SZ field indicates the total size of the
address space, which includes memory mapped devices like
framebuffers as well as pages that are not resident. SIZE really
indicates how much virtual memory the process needs, and is more
closely related to swap space requirements.
When a process first starts up it consumes memory very rapidly until it reaches its working set. If it is a user-driven tool, it may also need to grab a lot more memory to perform operations like opening a new window or processing an input file. In some cases the response time to the user request is affected by how long it takes to obtain a large amount of memory. If the process needs more than is currently in the free list, it goes to sleep until the pageout scanner has obtained more memory for it. In many cases additional memory is requested one page at a time -- on a uniprocessor, the process will eventually be interrupted by the scanner as it wakes up to replenish the free list.
Free list performance problems and deficit
This behavior manifests itself in a common problem that has a non-obvious solution. After a reboot, the free list is very large, and memory intensive programs have a good response time. After awhile the free list is consumed by the file cache, and the page scanner cuts in. At that point, the response time may worsen, as large amounts of memory are not immediately available. System administrators may watch this happening with vmstat and see that free memory decreases. When it gets low paging starts, and performance worsens. An initial reaction is to add some more RAM to the system, but this does not usually solve the problem. It may postpone the problem, but it may also make paging more intense, as some paging parameters are scaled up as you add RAM. The kernel tries to counteract this effect by calculating running averages of the memory demand over five-second and 30-second periods.
If the average demand is high the kernel expects that more memory
will be needed in the future, so it sets up a deficit which makes
the target size of the free list up to twice as big as normal. This
de column reported by vmstat. The deficit decays
over a few seconds back to zero, so you often see a large deficit
suddenly appear, then decay away again. With the latest kernel code
the target size of the free list (set by lotsfree)
increases on big memory systems, and since the deficit is limited to
the same value, you should expect to see larger peak values of
de on large memory systems.
The real problem is that the free list is too small, and it is being
replenished too aggressively, too late. The simplest fix is to
increase lotsfree, but remember that the free list is
unused memory. If you make it too big you are wasting RAM. If you
think that the scanner is being too aggressive, you may also try
reducing fastscan, which is the maximum page scanner rate
in pages per second. By increasing lotsfree, the maximum
value of the deficit will also increase. You also have to let the
system stabilize for a while after the first time the page scanner
cuts in. It needs to go right round the whole of memory once or
twice before it settles down. (
vmstat -s tells you the
number of revolutions it has done). It should then run on a "little
and often" basis. As long as you always have enough free RAM to
handle the short term demand and don't have to scan hard all the
time, performance should be good.
System V shared memory
There are a few applications that make trivial use of System V shared memory, but the big important applications are the database servers. Databases benefit from very large shared caches of data in some cases, and use System V Shared Memory to allocate as much as 3.5 gigabytes of RAM. By default applications such as Oracle, Informix, and Sybase use a special flag to specify that they want intimate shared memory (ISM). In this case two special changes are made. First, all the memory is locked, and cannot be paged out. Secondly the memory management data structures that are normally created on a per-process basis are created once then shared by every process.
Kernel values, tunables and defaults
I'll end by defining the most important kernel variables that can be used to tune the virtual memory system. They are normally set in the
/etc/system file, but some can be changed online using
adb -kw if you really know what you are doing. If in
any doubt, stick with the defaults.
Although I've tried to explain how memory usage works on several occasions, I hope taking this different approach means that the behavior and problems are now understood more clearly than before.
Other Cockcroft columns at www.sun.com
About the author
Adrian Cockcroft joined Sun in 1988, and currently works as a performance specialist for the Server Division of SMCC. He wrote Sun Performance and Tuning: SPARC and Solaris, published by SunSoft Press PTR Prentice Hall. Reach Adrian Cockcroft at email@example.com.
You can buy Sun Performance and Tuning: SPARC and Solaris at Amazon.com Books. http://www.amazon.com/exec/obidos/ISBN=0131496425/sunworldonlineA/