Click on our Sponsors to help Support SunWorld
Performance Q & A by Adrian Cockcroft

Help! I've lost my memory!

Before you scream "Memory leak!" take a look at how
SunOS and Solaris handle your precious RAM.

SunWorld
October  1995
[Next story]
[Table of Contents]
[Search]
Subscribe to SunWorld, it's free!

Abstract
Why doesn't Sun's OS free unused memory? Adrian Cockcroft tackles this question in the first of his monthly performance columns for SunWorld Online. Cockcroft, Sun's performance guru, has heard and answered this and countless other questions during his years as a systems engineer. Once he explains how Solaris 1 and 2 handle your computer's memory, you'll probably be relieved. (2,600 words)


Mail this
article to
a friend

Dear Adrian,
After a reboot I saw that most of my computer's memory was free, but when I launched my application it used up almost all the memory. When I stopped the application the memory didn't come back! Take a look at my
vmstat output:

% vmstat 5
 procs     memory            page            disk          faults      cpu
 r b w   swap  free  re  mf pi po fr de sr s0 s1 s2 s3   in   sy   cs us sy id
This is before the program starts:
 0 0 0 330252 80708   0   2  0  0  0  0  0  0  0  0  1   18  107  113  0  1 99
 0 0 0 330252 80708   0   0  0  0  0  0  0  0  0  0  0   14   87   78  0  0 99
I start the program and it runs like this for a while:
 0 0 0 314204  8824   0   0  0  0  0  0  0  0  0  0  0  414  132   79 24  1 74
 0 0 0 314204  8824   0   0  0  0  0  0  0  0  0  0  0  411   99   66 25  1 74
I stop it, then almost all the swap space comes back, but the free memory does not:
 0 0 0 326776 21260   0   3  0  0  0  0  0  0  1  0  0  420  116   82  4  2 95
 0 0 0 329924 24396   0   0  0  0  0  0  0  0  0  0  0  414   82   77  0  0 100
 0 0 0 329924 24396   0   0  0  0  0  0  0  0  2  0  1  430   90   84  0  1 99

I checked that there were no application processes running. It looks like a huge memory leak in the operating system. How can I get my memory back?
--RAMless in Ripon

The short answer
Launch your application again. Notice that it starts up more quickly than it did the first time, and with less disk activity. The application code and its data files are still in memory, even though they are not active. The memory they occupy is not "free." If you restart the same application it finds the pages that are already in memory. The pages are attached to the inode cache entries for the files. If you start a different application, and there is insufficient free memory, the kernel will scan for pages that have not been touched for a long time, and "free" them. Once you quit the first application, the memory it occupies is not being touched, so it will be freed quickly for use by other applications.

In 1988, Sun introduced this feature in SunOS 4.0. It still applies to all versions of Solaris 1 and 2. The kernel is trying to avoid disk reads by caching as many files as possible in memory. Attaching to a page in memory is around 1,000 times faster than reading it in from disk. The kernel figures that you paid good money for all of that RAM, so it will try to make good use of it by retaining files you might need.

By contrast, Memory leaks appear as a shortage of swap space after the misbehaving program runs for a while. You will probably find a process that has a larger than expected size. You should restart the program to free up the swap space, and check it with a debugger that offers a leak-finding feature (SunSoft's DevPro debugger, for example).


Advertisements

The long (and technical) answer
To understand how Sun's operating systems handle memory, I will explain how the inode cache works, how the buffer cache fits into the picture, and how the life cycle of a typical page evolves as the system uses it for several different purposes.

The inode cache and file data caching
Whenever you access a file, the kernel needs to know the size, the access permissions, the date stamps and the locations of the data blocks on disk. Traditionally, this information is known as the inode for the file. There are many filesystem types. For simplicity I will assume we are only interested in the Unix filesystem (UFS) on a local disk. Each filesystem type has its own inode cache.


Want to speed up your computer?
Send your performance questions to
adrian.cockcroft@sunworld.com
-- look for his answers here each month.


The filesystem stores inodes on the disk; the inode must be read into memory whenever an operation is performed on an entity in the filesystem. The number of inodes read per second is reported as iget/s by the sar -a command. The inode read from disk is cached in case it is needed again, and the number of inodes that the system will cache is influenced by a kernel parameter called ufs_ninode. The kernel keeps inodes on a linked list, rather than in a fixed-size table.

As I mention each command I will show you what the output looks like. In my case I'm collecting sar data automatically using cron. sar, which defaults to reading the stored data for today. If you have no stored data, specify a time interval and sar will show you current activity.


% sar -a

SunOS hostname 5.4 Generic_101945-32 sun4c    09/18/95

00:00:01  iget/s namei/s dirbk/s
01:00:01       4       6       0
All reads or writes to UFS files occur by paging from the filesystem. All pages that are part of the file and are in memory will be attached to the inode cache entry for that file. When a file is not in use, its data is cached in memory, using an inactive inode cache entry. When the kernel reuses an inactive inode cache entry that has pages attached, it puts the pages on the free list; this case is shown by sar -g as %ufs_ipf. This number is the percentage of UFS inodes that were overwritten in the inode cache by iget and that had reusable pages associated with them. The kernel flushes the pages, and updates on disk any modified pages. Thus, this %ufs_ipf number is the percentage of igets with page flushes. Any non-zero values of %ufs_ipf reported by sar -g indicate that the inode cache is too small for the current workload.
% sar -g

SunOS hostname 5.4 Generic_101945-32 sun4c    09/18/95

00:00:01  pgout/s ppgout/s pgfree/s pgscan/s %ufs_ipf
01:00:01     0.02     0.02     0.08     0.12     0.00

For SunOS 4 and releases up to Solaris 2.3, the number of inodes that the kernel will keep in the inode cache is set by the kernel variable ufs_ninode. To simplify: When a file is opened, an inactive inode will be reused from the cache if the cache is full; when an inode becomes inactive, it is discarded if the cache is over-full. If the cache limit has not been reached then an inactive inode is placed at the back of the reuse list and invalid inodes (inodes for files that longer exist) are placed at the front for immediate reuse. It is entirely possible for the number of open files in the system to cause the number of active inodes to exceed ufs_ninode; raising ufs_ninode allows more inactive inodes to be cached in case they are needed again.

Solaris 2.4 uses a more clever inode cache algorithm. The kernel maintains a reuse list of blank inodes for instant use. The number of active inodes is no longer constrained, and the number of idle inodes (inactive but cached in case they are needed again) is kept between ufs_ninode and 75 percent of ufs_ninode by a new kernel thread that scavenges the inodes to free them and maintains entries on the reuse list. If you use sar -v to look at the inode cache, you may see a larger number of existing inodes than the reported "size."

% sar -v

SunOS hostname 5.4 Generic_101945-32 sun4c    09/18/95

00:00:01  proc-sz    ov  inod-sz    ov  file-sz    ov   lock-sz
01:00:01   66/506     0 2108/2108    0  353/353     0    0/0   

Buffer cache
The buffer cache is used to cache filesystem data in SVR3 and BSD Unix. In SunOS 4, generic SVR4, and Solaris 2, it is used to cache inode, indirect block, and cylinder group blocks only. Although this change was introduced in 1988, many people still incorrectly think the buffer cache is used to hold file data. Inodes are read from disk to the buffer cache in 8-kilobyte blocks, then the individual inodes are read from the buffer cache into the inode cache.

Life cycle of a typical physical memory page
This section provides additional insight into the way memory is used. The sequence described is an example of some common uses of pages; many other possibilities exist.

1. Initialization -- A page is born

When the system boots, it forms all free memory into pages, and allocates a kernel data structure to hold the state of every page in the system.

2. Free -- An untouched virgin page

All the memory is put onto the free list to start with. At this stage the content of the page is undefined.

3. ZFOD -- Joining an uninitialized data segment

When a program accesses data that is preset to zero for the very first time, a minor page fault occurs and a Zero Fill On Demand (ZFOD) operation takes place. The page is taken from the free list, block-cleared to contain all zeroes, and added to the list of anonymous pages for the uninitialized data segment. The program then reads and writes data to the page.

4. Scanned -- The pagedaemon awakes

When the free list gets below a certain size, the pagedaemon starts to look for memory pages to steal from processes. It looks at all pages in physical memory order; when it gets to the page, the page is synchronized with the memory management unit (MMU) and a reference bit is cleared.

5. Waiting -- Is the program really using this page right now?

There is a delay that varies depending upon how quickly the pagedaemon scans through memory. If the program references the page during this period, the MMU reference bit is set.

6. Pageout Time -- Saving the contents

The pageout daemon returns and checks the MMU reference bit to find that the program has not used the page so it can be stolen for reuse. The pagedaemon checks to see if anything had been written to the page; if it contains no data, a page-out occurs. The page is moved to the pageout queue and marked as I/O pending. The swapfs code clusters the page together with other pages on the queue and writes the cluster to the swap space. The page is then free and is put on the free list again. It remembers that it still contains the program data.

7. Reclaim -- Give me back my page!

Belatedly, the program tries to read the page and takes a page fault. If the page had been reused by someone else in the meantime, a major fault would occur and the data would be read from the swap space into a new page taken from the free list. In this case, the page is still waiting to be reused, so a minor fault occurs, and the page is moved back from the free list to the program's data segment.

8. Program Exit -- Free again

The program finishes running and exits. The data segments are private to that particular instance of the program (unlike the shared-code segments), so all the pages in the data segment are marked as undefined and put onto the free list. This is the same state as Step 2.

9. Page-in -- A shared code segment

A page fault occurs in the code segment of a window system shared library. The page is taken off the free list, and a read from the filesystem is scheduled to get the code. The process that caused the page fault sleeps until the data arrives. The page is attached to the inode of the file, and the segments reference the inode.

10. Attach -- A popular page

Another process using the same shared-library page faults in the same place. It discovers that the page is already in memory and attaches to the page, increasing its inode reference count by one.

11. COW -- Making a private copy

If one of the processes sharing the page tries to write to it, a copy-on-write (COW) page fault occurs. Another page is grabbed from the free list, and a copy of the original is made. This new page becomes part of a privately mapped segment backed by anonymous storage (swap space) so it can be changed, but the original page is unchanged and can still be shared. Shared libraries contain jump tables in the code that are patched, using COW as part of the dynamic linking process.

12. File Cache -- Not free

The entire window system exits, and both processes go away. This time the page stays in use, attached to the inode of the shared library file. The inode is now inactive but will stay in the inode cache until it is reused, and the pages act as a file cache in case the user is about to restart the window system again.

13. fsflush -- Flushed by the sync

Every 30 seconds all the pages in the system are examined in physical page order to see which ones contain modified data and are attached to a vnode. The details differ between SunOS 4 and Solaris 2, but essentially any modified pages will be written back to the filesystem, and the pages will be marked as clean.

This example sequence can continue from Step 4 or Step 9 with minor variations. The fsflush process occurs every 30 seconds by default for all pages, and whenever the free list size drops below a certain value, the pagedaemon scanner wakes up and reclaims some pages.

Now you know
I have seen this missing-memory question asked about once a month since 1988! Perhaps the manual page for vmstat should include a better explanation of what the values are measuring. This answer is based on some passages from my book Sun Performance and Tuning. The book explains in detail how the memory algorithms work and how to tune them.

Please fill out SunWorld Online's performance questionnaire to let us know where you stand on performance issues. Submit your own question by sending mail to adrian.cockcroft@sunworld.com.


Click on our Sponsors to help Support SunWorld


Resources


About the author
Adrian Cockcroft joined Sun in 1988, and currently works as a performance specialist for the newly formed Server Division of SMCC. Reach Adrian at adrian.cockcroft@sunworld.com.

What did you think of this article?
-Very worth reading
-Worth reading
-Not worth reading
-Too long
-Just right
-Too short
-Too technical
-Just right
-Not technical enough
 
 
 
    

SunWorld
[Table of Contents]
Subscribe to SunWorld, it's free!
[Search]
Feedback
[Next story]
Sun's Site

[(c) Copyright  Web Publishing Inc., and IDG Communication company]

If you have technical problems with this magazine, contact webmaster@sunworld.com

URL: http://www.sunworld.com/swol-10-1995/swol-10-perf.html
Last modified: