The memory go roundConfused by "free memory?" We explain its role, how it flows, when to tune it, and what to change |
Pages of physical memory circulate through the system via something called "free memory" that seems to confuse a lot of people. This month the basic memory flows are explained, and the latest changes in the algorithm are described. Some guidance on when it may need tuning and what to change is included. (2,600 words)
Q: The free memory reported by vmstat is confusing, it doesn't seem to behave the way I'd expect it to, and it seems to vary depending upon the operating system version I've got installed as well. What's going on?
A: I've tried to explain this before, in my book and in my very first SunWorld Online column, entitled "Help, I've lost my memory!" (see Resources). This time I'll try a new approach, and I'll also try to indicate the changes that have occurred in recent releases of Solaris. First, let's look at some vmstat output.
% vmstat 5 procs memory page disk faults cpu r b w swap free re mf pi po fr de sr s0 s1 s5 -- in sy cs us sy id 0 0 0 480528 68056 0 3 5 2 2 0 0 65 12 9 0 165 968 101 2 2 95 0 0 0 476936 85768 0 15 107 0 0 0 0 3 4 7 0 465 1709 231 6 3 91 0 0 0 476832 85160 0 31 144 0 0 0 0 7 0 9 0 597 3558 367 8 6 87 0 0 0 476568 83840 0 7 168 0 0 0 0 4 1 6 0 320 796 155 6 1 93 0 0 0 476544 83368 0 0 28 0 0 0 0 1 2 0 0 172 1739 166 10 5 85
I described how vmstat works in a previous article. The first thing to remember is that vmstat prints out averaged rates, based on the difference between two snapshots of a number of kernel counters. The first line of output is printed immediately, hence it is the difference between the current value of the counters and zero. This works out to be the average since the system was booted. It is best to disregard the first line. Secondly, if you run vmstat with a small time interval you tend to catch all the short-term peaks and get results that are much more variable from one line to the next. I typically use five or 10 seconds, but may use one second to try and catch peaks or as much as 30 to 60 seconds to just "keep an eye" on a system. Vmstat is a useful summary of what is happening on a system.
The basic unit of memory is a page, which is typically four
kilobytes or eight kilobytes in size. UltraSPARC-based systems use
an eight-kilobyte page, while the previous generation of SPARC
systems, Intel x86, and many other architectures use four kilobytes.
The first source of confusion when looking at vmstat output is that
some measurements are in pages, and others are in kilobytes, so the
numbers will not look the same on systems with four-kilobyte versus
eight-kilobyte pages. The units of swap
,
free
, pi
, po
,
fr
, and de
are in kilobytes, and
re
, mf
, and sr
are in pages.
The role of the free list
Memory is one of the main resources in the system. There are four
main consumers of memory, and when memory is needed, they all obtain
it from the free list. The diagram below shows these consumers and
how they relate to the free list. When memory is needed it is taken
from the head of the free list. When memory is put back on the free
list there are two choices. If the page still contains valid data it
is put on the tail of the list, so it will not be reused for as long
as possible. If the page has no useful content, it is put at the
head of the list for immediate reuse. The kernel keeps track of
valid pages in the free list so that they can be reclaimed if their
content is requested, thereby saving a disk I/O.
The vmstat reclaim counter is two edged. On one hand it is good that
a page fault was serviced by a reclaim, rather than a page in that
would cause a disk read. On the other hand, you don't want active
pages to be stolen and end up on the free list in the first place.
The vmstat free value is simply the size of the free list in
kilobytes. The way the size varies is what tends to confuse people.
The most important value reported by vmstat is the scan rate -
sr
. If it is zero or close to zero then you can be sure that
the system does have sufficient memory. If it is always high
(hundreds to thousands of pages per second) then adding more memory
is likely to help.
Kernel memory
At the start of the boot process, the kernel takes two or three
megabytes of initial memory and puts all the rest on the free list.
As the kernel dynamically sizes itself and loads device drivers and
filesystems, it takes memory from the free list. Kernel memory is
normally locked and cannot be paged out in a memory shortage, but
the kernel does free memory if there is a lot of demand for it.
Unused device drivers will be unloaded, and unused slabs of kernel
memory data structures will be freed. I notice this sometimes when
during heavy paging there is a pop from the speaker on a desktop
system. This occurs when the audio device driver is unloaded. If you
run a program that keeps the device open it cannot be unloaded.
One problem that can occur is reported as a kernel memory allocation
error by sar -k
, and by the kernel rule in the SE
toolkit. There is a limit on the size of kernel memory, but at 3.75
gigabytes on UltraSPARC systems this is unlikely to be reached. The
most common cause of this problem is that the kernel tried to get
some memory while the free list was completely empty. Since the
kernel cannot always wait this may cause operations to fail rather
than be delayed. The streams subsystem cannot wait, and I have seen
remote logins fail due to this issue when a large number of users
try to login at the same time. In Solaris 2.5.1, and the current
kernel jumbo patch for Solaris 2.3, 2.4, and 2.5, changes were made
to make the free list larger on big systems and to try and prevent
it from ever being completely empty.
Filesystem cache
The second consumer of memory is files. In order to read or write a
file, memory must be obtained to hold the data for the I/O. While
the I/O is happening, the pages are temporarily locked in memory.
After the I/O is complete, the pages are unlocked, but are retained
in case the contents of the file are needed again. The filesystem
cache is often one of the biggest users of memory in a system. Note
that in all versions of Solaris 1 and Solaris 2 the data in the
filesystem cache is separate from kernel memory. It does not appear
in any address space so there is no limit on its size. If you want
to cache many gigbytes of filesystem data in memory you just need to
buy a 30-gigabyte E6000 or a 64-gigabyte E10000. We cached an entire
nine-gigabyte database on a 16-gigabyte E6000 once, just to show it
could be done, and read-intensive queries ran 350 times faster than
reading from disk. As I've mentioned previously, SPARC-based
systems use either a 36-bit (SuperSPARC) or 41-bit (UltraSPARC)
physical address, unlike many other systems that limit both virtual
and physical to 32 bits, so stop at four gigabytes.
If you delete or truncate a file, the cached data becomes invalid, so it is returned to the head of the free list. If the kernel decides to stop caching the inode for a file, any pages of cached data attached to that inode are also freed.
For most files, the only way they become uncached is by having inactive pages stolen by the pageout scanner. The scanner only runs when the free list shrinks below a threshold (lotsfree in pages, usually set at a few megabytes), so eventually most of the memory in the system ends up in the filesystem cache. When there is a new, large demand for memory, the scanner will steal most of the required pages from the inactive files in the filesystem cache.
Files are also mapped directly by processes using mmap. This is used to map in the code and shared libraries for a process. Pages may have multiple references from several processes while also being resident in the filesystem cache. A recent optimization is that pages with eight or more references are skipped by the scanner, even if they are inactive. This helps shared libraries and multiprocess server code stay resident in memory. There is currently no measure provided to tell you how much memory is being used by the filesystem cache or which files are cached.
Process private memory
Processes also have private memory to hold their stack space,
modified data areas, and heap. Stack and heap memory is always
initialized to zero before it is given to the process. The stack
grows dynamically as required, and the heap is grown using the
brk
system call, usually when the malloc
library routine needs some more space. Data and code areas are
initialized by mapping from a file, but as soon as they are written,
a private copy of the page is made. There is currently no way to
see how much resident memory is used by private or shared pages. All
that is reported is the total number of resident mapped pages, as
the RSS field in some forms of ps output (e.g. /usr/ucb/ps
uax
). The SIZE or SZ field indicates the total size of the
address space, which includes memory mapped devices like
framebuffers as well as pages that are not resident. SIZE really
indicates how much virtual memory the process needs, and is more
closely related to swap space requirements.
When a process first starts up it consumes memory very rapidly until it reaches its working set. If it is a user-driven tool, it may also need to grab a lot more memory to perform operations like opening a new window or processing an input file. In some cases the response time to the user request is affected by how long it takes to obtain a large amount of memory. If the process needs more than is currently in the free list, it goes to sleep until the pageout scanner has obtained more memory for it. In many cases additional memory is requested one page at a time -- on a uniprocessor, the process will eventually be interrupted by the scanner as it wakes up to replenish the free list.
Free list performance problems and deficit
This behavior manifests itself in a common problem that has a
non-obvious solution. After a reboot, the free list is very large,
and memory intensive programs have a good response time. After
awhile the free list is consumed by the file cache, and the page
scanner cuts in. At that point, the response time may worsen, as
large amounts of memory are not immediately available. System
administrators may watch this happening with vmstat and see that
free memory decreases. When it gets low paging starts, and
performance worsens. An initial reaction is to add some more RAM to
the system, but this does not usually solve the problem. It may
postpone the problem, but it may also make paging more intense, as
some paging parameters are scaled up as you add RAM. The kernel
tries to counteract this effect by calculating running averages of
the memory demand over five-second and 30-second periods.
If the average demand is high the kernel expects that more memory
will be needed in the future, so it sets up a deficit which makes
the target size of the free list up to twice as big as normal. This
is the de
column reported by vmstat. The deficit decays
over a few seconds back to zero, so you often see a large deficit
suddenly appear, then decay away again. With the latest kernel code
the target size of the free list (set by lotsfree)
increases on big memory systems, and since the deficit is limited to
the same value, you should expect to see larger peak values of
de
on large memory systems.
The real problem is that the free list is too small, and it is being
replenished too aggressively, too late. The simplest fix is to
increase lotsfree, but remember that the free list is
unused memory. If you make it too big you are wasting RAM. If you
think that the scanner is being too aggressive, you may also try
reducing fastscan, which is the maximum page scanner rate
in pages per second. By increasing lotsfree, the maximum
value of the deficit will also increase. You also have to let the
system stabilize for a while after the first time the page scanner
cuts in. It needs to go right round the whole of memory once or
twice before it settles down. (vmstat -s
tells you the
number of revolutions it has done). It should then run on a "little
and often" basis. As long as you always have enough free RAM to
handle the short term demand and don't have to scan hard all the
time, performance should be good.
System V shared memory
There are a few applications that make trivial use of System V
shared memory, but the big important applications are the database
servers. Databases benefit from very large shared caches of data in
some cases, and use System V Shared Memory to allocate as much as
3.5 gigabytes of RAM. By default applications such as Oracle,
Informix, and Sybase use a special flag to specify that they want
intimate shared memory (ISM). In this case two special changes are
made. First, all the memory is locked, and cannot be paged out.
Secondly the memory management data structures that are normally
created on a per-process basis are created once then shared by every
process.
Kernel values, tunables and defaults
I'll end by defining the most important kernel variables that can be
used to tune the virtual memory system. They are normally set in the
/etc/system
file, but some can be changed online using
adb -kw
if you really know what you are doing. If in
any doubt, stick with the defaults.
Wrap up
Although I've tried to explain how memory usage works on several
occasions, I hope taking this different approach means that the
behavior and problems are now understood more clearly than before.
Resources
Other Cockcroft columns at www.sun.com
About the author
Adrian Cockcroft joined Sun in 1988, and currently works as a
performance specialist for the Server Division of SMCC. He wrote
Sun Performance and Tuning: SPARC and Solaris,
published by SunSoft Press PTR
Prentice Hall. Reach Adrian Cockcroft at
adrian.cockcroft@sunworld.com.
You can buy Sun Performance and Tuning: SPARC and Solaris at Amazon.com Books. http://www.amazon.com/exec/obidos/ISBN=0131496425/sunworldonlineA/