Click on our Sponsors to help Support SunWorld

Getting to know the Solaris filesystem, Part 3

How is file data cached?

By Richard McDougall

SunWorld
July  1999
[Next story]
[Table of Contents]
[Search]
Subscribe to SunWorld, it's free!

Abstract
One of the most important features of a filesystem is its ability to cache file data. Ironically, however, the filesystem cache isn't implemented in the filesystem. In Solaris, the filesystem cache is implemented in the virtual memory system. In Part 3 of this series on the Solaris filesystem, Richard explains how Solaris file caching works and explores the interactions between the filesystem cache and the virtual memory system. (4,000 words)


Mail this
article to
a friend
Traditional Unix implements filesystem caching in the I/O subsystem by keeping copies of recently read or written blocks in a block cache. This block cache sits just above the disks and caches data corresponding to physical disk sectors. Figure 1 illustrates the trajectory by which a process reads a piece of a file. The process reads a segment of a file by issuing a read system call into the operating system. The filesystem must then find the corresponding disk block for the file (by looking up the block number in the direct/indirect blocks for that file) and request that block from the I/O system. The I/O system retrieves the block from disk the first time, and subsequential reads are satisfied by reading the disk block from the block buffer cache. Note that even though the disk block is cached in memory, we have to invoke the filesystem and look up the physical block number for every cached read because this is a physical block cache.


Figure 1. The old buffer cache

The old buffer cache is typically sized statically by a kernel configuration parameter. Changing the size of the buffer cache requires a kernel rebuild and a reboot.

The Solaris page cache
Solaris has a new method of caching filesystem data, known as the page cache. The page cache was developed at Sun as part of the virtual memory rewrite in SunOS 4 in 1985, and is used by System V Release 4 Unix. Page cache derivatives are now also used in Linux and Windows NT. The page cache has two major differences. First, it's dynamically sized and can use all memory that isn't being used by applications. Second, it caches file blocks rather than disk blocks. The key difference is that the page cache is a virtual file cache, rather than a physical block cache. This allows the operating system to retrieve file data by simply looking up the file reference and seek offset, rather than having to invoke the filesystem to look up the physical disk block number corresponding to the file and retrieving that block from the physical block cache. This is far more efficient.

Figure 2 shows the new page cache. When a Solaris process reads a file the first time, file data is read from disk though the file system into memory in page-size chunks and returned to the user. The next time the same segment of file data is read it can be retrieved directly from the page cache, without having to do a logical-to-physical lookup though the filesystem. The old buffer cache is still used in Solaris, but only for filesystem data that is only known by physical block numbers. This data only consists of metadata items -- direct/indirect blocks and inodes. All file data is cached though the page cache.


Figure 2. Filesystem caching via the page cache

The diagram in Figure 2 is somewhat simplified. The filesystem is still involved in the page cache lookup, but the amount of work the filesystem needs to do is dramatically simplified. The page cache is implemented in the virtual memory system. In fact, the virtual memory system is architected around the page cache principle, and each page of the physical memory is identified the same way, by file and offset. Pages associated with files point to regular files, while pages of memory associated with the private memory space of processes point to the swap device. We aren't going to go into too much detail about the Solaris virtual memory system in this article. A more detailed description of the Solaris memory system can be found in my Solaris memory white paper at http://www.sun.com/sun-on-net/performance/vmsizing.pdf. The important thing to remember is that file cache is just like process memory. And, as we will see, file caching shares the same paging dynamics as the rest of the memory system.

The old buffer cache in Solaris
The old buffer cache is used in Solaris to cache inodes and file metadata, and is now also dynamically sized. In old versions of Unix, the buffer cache was fixed in size by the nbuf kernel parameter, which specified the number of 512-byte buffers. We now allow the buffer cache to grow by nbuf as needed, until it reaches a ceiling specified by the bufhwm kernel parameter. By default, the buffer cache is allowed to grow until it uses two percent of physical memory. We can look at the upper limit for the buffer cache using the sysdef command.

# sysdef

*
* Tunable parameters
*
7757824 maximum memory allowed in buffer cache (bufhwm) 
           5930 maximum number of processes (v.v_proc)
             99 maximum global priority in sys class (MAXCLSYSPRI)
           5925 maximum processes per user id (v.v_maxup)
             30 auto update time limit in seconds (NAUTOUP)
             25 page stealing low water mark (GPGSLO)
              5 fsflush run rate (FSFLUSHR)
             25 minimum resident memory for avoiding deadlock (MINARMEM)
             25 minimum swappable memory for avoiding deadlock (MINASMEM)

Now that we only keep inodes and metadata in the buffer cache, we don't need a very large buffer. In fact, we only need 300 bytes per inode, and about 1 MB per 2 GB of files that we expect to be accessed concurrently. (Note that this rule of thumb is for the Unix filesystem (UFS).) For example, if we have a database system with 100 files totaling 100 GB of storage space and we estimate that we'll access only 50 MB of those files at the same time, then at most we would need 3 KB (100 x 300 bytes = 3 KB) for the inodes, and about 25 MB (50 / 2 x 1 MB = 25 MB) for the metadata (direct and indirect blocks). On a system with 5 GB of physical memory, the defaults for bufhwm would provide us with a bufhwm of 102 MB, which would be more than sufficient for the buffer cache. We could limit bufhwm to 30 MB to save memory. This can be done by setting the bufhwm parameter in the /etc/system file. The units for bufhwm are in kilobytes. To set bufhwm smaller for this example, we would put the following line into /etc/system.

*
* Limit size of bufhwm
*
set bufhwm=30000

You can monitor the old buffer cache hit statistics using sar -b. The statistics for the buffer cache show the number of logical reads and writes into the buffer cache, the number of physical reads and writes out of the buffer cache, and the read/write hit ratios.

# sar -b 3 333

SunOS zangief 5.7 Generic sun4u 06/27/99

22:01:51 bread/s  lread/s  %rcache bwrit/s lwrit/s  %wcache  pread/s  pwrit/s
22:01:54       0     7118      100       0       0      100        0        0
22:01:57       0     7863      100       0       0      100        0        0
22:02:00       0     7931      100       0       0      100        0        0
22:02:03       0     7736      100       0       0      100        0        0
22:02:06       0     7643      100       0       0      100        0        0
22:02:09       0     7165      100       0       0      100        0        0
22:02:12       0     6306      100       8      25       68        0        0
22:02:15       0     8152      100       0       0      100        0        0
22:02:18       0     7893      100       0       0      100        0        0

On this system, we can see that the buffer cache is caching 100 percent of the reads, and the number of writes are small. This measurement was taken on a machine with 100 GB of files being read in a random pattern. You should try to obtain a read cache hit ratio of 100 percent on systems with few, but very large files (e.g., database systems), and a hit ratio of 90 percent or better for systems with many files.


Advertisements

The page cache and the virtual memory system
The virtual memory system is implemented around the page cache, and the filesystem makes use of this facility to cache files. This means that to understand filesystem caching behavior, we need to look at how the virtual memory system implements the page cache. The virtual memory system divides physical memory into chunks known as pages; on UltraSPARC systems each page is 8 KB. To read data from a file into memory, the virtual memory system reads in one page at a time. The operation of reading in file pages this way is known as paging in a file. The page-in operation is initiated in the virtual memory system, which then requests the filesystem for that file to page in a page from storage to memory. Every time we read in data from disk to memory we cause paging to occur, which is what we see when we look at the virtual memory statistics. For example, a file read will be reflected in vmstat as page ins.

# ./rreadtest testfile&

# vmstat 
 procs    memory          page             disk        faults     cpu
r b w  swap free re mf  pi  po fr de sr s0 -- -- --  in   sy  cs us sy id
0 0 0 50436 2064  5  0  81   0  0  0  0 15  0  0  0 168  361  69  1 25 74
0 0 0 50508 1336 14  0 222   0  0  0  0 35  0  0  0 210  902 130  2 51 47
0 0 0 50508  648 10  0 177   0  0  0  0 27  0  0  0 168  850 121  1 60 39
0 0 0 50508  584 29 57  88 109  0  0  6 14  0  0  0 108 5284 120  7 72 20
0 0 0 50508  484  0 50 249  96  0  0 18 33  0  0  0 199  542 124  0 50 50
0 0 0 50508  492  0 41 260  70  0  0 56 34  0  0  0 209  649 128  1 49 50
0 0 0 50508  472  0 58 253 116  0  0 45 33  0  0  0 198  566 122  1 46 53

In our example, we can see that by starting a program that does random reads of a file we cause a number of page ins to occur, as is indicated by the numbers in the pi column of vmstat. Note that the free memory column in vmstat has dropped to a very low value. In fact, the amount of free memory is almost zero. Have you noticed that when you boot your machine there's a lot of free memory, and as the machine is used, memory continues to fall to zero and then just hangs there? This is because the filesystem consumes a page of physical memory every time it pages in a page-size chunk of a file. It's normal -- the filesystem is using all available memory to cache the reads and writes to each file. Memory is put back on the free list by the page scanner, which looks for memory pages that haven't been used recently. The page scanner runs when memory falls to a system parameter known as lotsfree. In this example, we can see that the page scanner is scanning about 50 pages per second to replace the memory used by the filesystem.

There is no equivalent parameter to bufhwm for the page cache. The page cache simply grows to consume all available memory, which includes all process memory that isn't being used by applications. The rate at which the system pages and the rate at which the page scanner runs is proportional to the rate at which the filesystem is reading or writing pages to disk. On large systems, this means that you should expect to see large paging values. Consider a system that is reading 10 MBps though the filesystem; this translates to 1,280 page ins per second. The page scanner must scan enough memory to be able to free 1,280 pages per second. The page scanner must actually scan faster than 1,280 pages per second because not all memory the page scanner comes across will be eligible for freeing. (The page scanner only frees memory that hasn't been recently used.) If the page scanner only finds one out of three pages eligible for freeing, it will actually need to run at 3,840 pages per second. Don't worry about high scan rates. If you're using the filesystem heavily, it's normal. There are many myths about high scan rates indicating a shortage of memory. You can now see that high scan rates are normal in many circumstances.

filesystem paging tricks
It should be noted that some filesystems try to reduce the amount of memory pressure by doing two things: enabling free behind with sequential access and freeing pages when free memory falls to lotsfree. As mentioned in last month's article, free behind is invoked on UFS when a file is accessed sequentially so that we don't pollute the cache when we do a sequential scan though a large file. This means that when we create or sequentially read a file we don't see high scan rates.

There are also checks in some filesystems to limit the filesystem's use of the page cache when memory falls to lotsfree. There is an additional parameter that is used by the UFS filesystem, pages_before_pager. This parameter reflects the amount of memory above the point where the page scanner will start. By default this is 200 pages. This means that when memory falls to 1.6 MB (on UltraSPARC) above lotsfree, the filesystems start throttling back the use of the page cache. To be more specific, when memory falls to lotsfree + pages_before_pager, Solaris filesystems do the following:

Is all that paging bad for my system?
Although I stated earlier that it may be normal to have high paging and scan rates, it is likely that the page scanner will be putting too much pressure on your applications' private process memory. If we scan at a rate of several hundred pages per second or more, the amount of time the page scanner takes to check if a page has been accessed reduces to a few seconds. This means that any pages that haven't been used in the last few seconds will be taken by the page scanner when you're using the filesystem. This can have a very negative effect on application performance. We introduced priority paging to address this problem.

If you've ever noticed that your system feels slow while filesystem I/O is going on, this is because your applications are being paged in and out as a direct result of the filesystem activity. For example, consider an OLTP (online transaction processing) application that makes heavy use of the filesystem. The database is generating filesystem I/O, making the page scanner actively steal pages from the system. The user of the OLTP application has paused to read the contents of a screen from the last transaction for 15 seconds. During this time, the page scanner has found that those pages associated with the user application have not been referenced and makes them available for stealing. The pages are then stolen. When the user types the next keystroke, he is forced to wait until his application is paged back in -- usually several seconds. So, the user is forced to wait for an application to page in from the swap device, even though the system has sufficient memory to keep all the application in physical memory!

The priority paging algorithm effectively places a boundary around the file cache, so that filesystem I/O does not cause unnecessary paging of applications. It does this by prioritizing the different types of pages in the page cache, in order of importance:

When the dynamic page cache grows to the point where free memory falls to almost zero, the page scanner wakes up and begins scanning. Even though there is sufficient memory in the system, the scanner will only steal pages associated with regular files. The filesystem effectively pages against itself, rather than against everything else on the system.

Should there be a real memory shortage where there is insufficient memory for the applications and kernel, the scanner is again allowed to steal pages from the applications. By default, priority paging is disabled. It is likely to be enabled by default in a Solaris release subsequent to Solaris 7. To use priority paging, you will need either Solaris 7, Solaris 2.6 with kernel patch 105181-13, or Solaris 2.5.1 with 103640-25 or higher. To enable priority paging, set the following in /etc/system:

*
* Enable Priority Paging
*
set priority_paging=1

Setting priority_paging=1 in /etc/system causes a new memory tunable, cachefree, to be set to two times the old paging high-water mark, lotsfree. The cachefree memory tunable is a new parameter, which scales with minfree, desfree, and lotsfree. (See the complete discussion of priority paging at http://www.sun.com/sun-on-net/performance/priority_paging.html.)

Priority paging distinguishes between executable files and regular files by recording whether or not they are mapped into an address space with execute permissions. This means that regular files with the execute bit that are mapped into an address space with the memory map system call are treated as executables, and care should be taken to ensure that data files that are being memory mapped do not have the execute bit set.

Under Solaris 7, there is an extended set of paging counters that allow us to see what type of paging is occurring. We can now see the difference between paging caused by an application memory shortage and paging though the filesystem. The paging counters are visible under Solaris 7 with the memstat command. The output from the memstat command is similar to that of vmstat, but with extra fields to break down the different types of paging. In addition to the regular paging counters (sr,po,pi,fr), the memstat command shows the three types of paging broken out as executable, application, and file. The memstat fields are listed in the table below.

Column Description
pi Total page ins per second
po Total page outs per second
fr Total page frees per second
sr Page scan rate in pages per second
epi Executable page ins per second
epf Executable pages freed per second
api Application (anonymous) page ins per second from the swap device
apo Application (anonymous) page outs per second to the swap device
apf Application pages freed per second
fpi File page ins per second
fpo File page outs per second
fpf File page frees per second

The memstat fields

If we use the memstat command we can now see that as we randomly read our test file, the scanner scans several hundred pages per second though memory and causes executable pages to be freed. Application pages are also paged out to the swap device as they are stolen. On a system with plenty of memory, this isn't the desired mode of operation. Enabling priority paging will stop this from happening when there is sufficient memory.

# ./readtest testfile&
# memstat 3
memory -------  paging ------- -executable- -anonymous- --filesys - -- cpu ---
free re mf  pi  po  fr de   sr epi epo epf api apo apf fpi fpo fpf us sy wt id
2080  1  0 749 512 821  0  264   0   0 269   0 512 549 749   0   2  1  7 92  0
1912  0  0 762 384 709  0  237   0   0 290   0 384 418 762   0   0  1  4 94  0
1768  0  0 738 426 610  0 1235   0   0 133   0 426 434 738   0  42  4 14 82  0
1920  0  2 781 469 821  0  479   0   0 218   0 469 525 781   0  77 24 54 22  0
2048  0  0 754 514 786  0  195   0   0 152   0 512 597 754   2  37  1  8 91  0
2024  0  0 741 600 850  0  228   0   0 101   0 597 693 741   2  56  1  8 91  0
2064  0  1 757 426 589  0  143   0   0  72   8 426 498 749   0  18  1  7 92  0

If we enable priority paging, we can observe the difference using the memstat command.

# ./readtest testfile&

# memstat 3
memory --------- paging -------- -executable - anonymous - - filesys -- -cpu ---
 free re  mf   pi po  fr de   sr epi epo epf api apo apf fpi fpo fpf us sy wt id
 3616  6   0  760  0 752  0  673   0   0   0   0   0   0 760   0 752  2  3 95  0
 3328  2 198  816  0 925  0 1265   0   0   0   0   0   0 816   0 925  2 10 88  0
 3656  4 195  765  0 792  0  263   0   0   0   2   0   0 762   0 792  7 11 83  0
 3712  4   0  757  0 792  0  186   0   0   0   0   0   0 757   0 792  1  9 91  0
 3704  3   0  770  0 789  0  203   0   0   0   0   0   0 770   0 789  0  5 95  0
 3704  4   0  757  0 805  0  205   0   0   0   0   0   0 757   0 805  2  6 92  0
 3704  4   0  778  0 805  0  266   0   0   0   0   0   0 778   0 805  1  6 93  0

With priority paging enabled, we can see different behaviors of the virtual memory system. Using the same test program, random reads on the filesystem again cause the system to page, and the scanner is actively involved in managing the pages. But now the scanner is only freeing file pages. We can clearly see from the rows of zeros in the executable and anonymous memory columns that the scanner chooses file pages first. The activity is in the fpi and fpf columns -- this means that file pages are read in, and an equal number are freed by the page scanner to make room for more reads.

Paging parameters that effect filesystem performance
When priority paging is enabled, you will notice that the filesystem scan rate is higher. This is because the page scanner must skip over process private memory and executables. It needs to scan more pages before it finds file pages that it can steal. High scan rates are always found on systems that make heavy use of the filesystem and shouldn't be used as a factor for determining memory shortage. If you have Solaris 7, the memstat command will reveal whether you're paging to the swap device -- if so, the system is short of memory.

If you have high filesystem activity, you will find that the scanner parameters are insufficient and will limit filesystem performance. To compensate for this, you will need to set some of the scanner parameters to allow the scanner to scan at a high enough rate to keep up with the filesystem. By default, the scanner is limited by the fastscan parameter, which reflects the number of pages per second that the scanner can scan. It defaults to scan a quarter of memory every second and is limited to 64 MBps. The scanner runs at half of fastscan when memory is at lotsfree, which limits it to 32 MBps. If only one in three physical memory pages are file pages, the scanner will only be able to put 11 MB (32 / 3 = 11 MB) of memory on the free list, thus limiting filesystem throughput.

You will need to increase fastscan so that the page scanner works faster. I recommend setting it to one quarter of memory with an upper limit of 1 GB per second. This translates to an upper limit of 131072 for the fastscan parameter. The handspreadpages parameter should also be increased with fastscan to the same value.

Another limiting factor for the filesystem is maxpgio. The maxpgio parameter is the maximum amount of pages the page scanner can push. It can also limit the amount of filesystem pages that are pushed. This in turn limits the write performance of the filesystem. If your system has sufficient memory, I recommend setting maxpgio to something large -- 65536. On E10000 systems this is the default for maxpgio.

For example, on a 4-GB machine, one quarter of memory is 1 GB, so we would set fastscan to 131072. Our parameters for this machine would be:

*
* Parameters to allow better filesystem throughput
*
set fastscan=131072
set handspreadpages=131072
set maxpgio=65536

We've now seen how Solaris uses a virtual file cache, the page cache, to integrate filesystem caching with the memory system. In future articles we will explore some of the filesystem-specific caching options that can be used to relieve some of the memory pressure caused by filesystem I/O.


Click on our Sponsors to help Support SunWorld


Resources

Other SunWorld resources

About the author
Richard McDougallRichard McDougall is an established engineer in the Enterprise Engineering Group at Sun Microsystems where he focuses on large system performance and operating system architecture. He has more than 12 years of performance tuning, application/kernel development, and capacity planning experience on many different flavors of Unix. Richard has authored a wide range of papers and tools for measurement, monitoring, tracing and sizing of Unix systems including the memory sizing methodology for Sun, the set of tools known as MemTool allowing fine-grained instrumentation of memory for Solaris, the recent priority paging memory algorithms in Solaris and man of the unbundled Tools for Solaris. Richard is currently coauthoring with Jim Mauro the Sun Microsystems book, Solaris Architecture, which details Solaris architecture.

What did you think of this article?
-Very worth reading
-Worth reading
-Not worth reading
-Too long
-Just right
-Too short
-Too technical
-Just right
-Not technical enough
 
 
 
    

SunWorld
[Table of Contents]
Subscribe to SunWorld, it's free!
[Search]
Feedback
[Next story]
Sun's Site

[(c) Copyright  Web Publishing Inc., and IDG Communication company]

If you have technical problems with this magazine, contact webmaster@sunworld.com

URL: http://www.sunworld.com/swol-07-1999/swol-07-filesystem3.html
Last modified: