Click on our Sponsors to help Support SunWorld

Caches, thrashes and smashes

How to use CacheNFS to cut your NFS server and network load

August 1996

Abstract

A look at CacheFS in action, minimizing network contortion, thrashing, and trashing. We'll start with a review of NFS caching using the virtual memory system, and then see how CacheFS adds another local layer of backing store to the cache hierarchy. Then we'll dig into CacheFS management, configuration options, and cache tuning. The usual assortment of tips, caveats and customer-tested hacks rounds out this month's installment--and the recent months' series--on NFS and friends. (3,100 words)

Mail this
article to
a friend

Even as George Gilder promises the world that bandwidth will be infinite, demand for network capacity seems to increase faster than corporate networks can deliver it. Wire speed may be cheap, but running a fat and wide pipe to every user's desk is typically beyond the financial reach of most IT organizations. Network congestion and server pile-ups haven't slaked our thirst for networked data access (possibly why we've devoted the last quarter's worth of columns to NFS-related topics), so demand reduction is the order of the day. The best way to control network congestion is to simply not use the network, and access data from a local disk. Pushing the data out to the edges of the network reduces latency, and allows for better scaling by reducing the network requirements of each node.

Disk caching of recently used files was introduced in the Unix space with the Andrew filesystem, and later commercialized with the OSF's Distributed Filesystem (DFS) as part of the Distributed Computing Environment (DCE). Sun added similar functionality to Solaris 2.3 with the CacheFS filesystem type, an on-disk cache for recently used NFS data. Caching NFS accesses can improve server scalability (since fewer requests are serviced over the network), relieve network congestion, and improve client response time. On the contrary, poor cache management and configuration can lead to reduced client performance with no appreciable benefit on the server or network.

This month, we'll see CacheFS in action, minimizing network contortion, thrashing, and trashing. We'll start with a review of NFS caching using the virtual memory system, and then see how CacheFS adds another local layer of backing store to the cache hierarchy. Then we'll dig into CacheFS management, configuration options, and cache tuning. The usual assortment of tips, cavaets and customer-tested hacks rounds out this month's installment--and the recent months' series--on NFS and friends.

Advertisements

The NFS mosh pit: in-memory caching unmasked
NFS relies heavily on client-side caching to achieve a minimum level of performance. NFS clients keep recently used file pages in memory, updating them when the server's copy of the file has changed or purging them when the cache space is needed for other files. When a file is read from an NFS server, new pages are added to the cache. Upon subsequent access, the client's NFS code goes through a series of if-then clauses that would make a naive BASIC programmer blush:

The file page reference is converted into a file handle and offset into the file, which is then hashed into a table of in-memory file pages. If the VM system doesn't find the required page on the client, an NFS read is issued to fetch the data from the server.
If the page is in memory, the NFS client needs to perform a cache consistency check to be sure that another client has not modified the data since it was loaded. NFS uses the file attributes, namely the modification time from the file's inode, to do a timestamp comparison on the cached file page and the server's copy of the file.
Retrieving file attributes from the server on each NFS file page access would reduce NFS performance to unacceptable levels, so the NFS client also caches file attributes. When a file's attributes are retrieved from the NFS server, they are put into the attribute cache for at least 60 seconds. When the NFS client goes to perform an attribute timestamp check, it first verifies that the attribute cache has not become stale.
If the attribute cache entry is more than 60 seconds old, it is discarded and the client issues an NFS getattr request to reload that cache entry.
Armed with valid attributes for the file, the client compares the timestamps on the file and on the in-memory cache page, and either returns the page to the process requesting it, or performs an NFS read to re-load the page from the server.

The 60-second window within which file attributes are assumed to remain static gives NFS its weak consistency model--another client could modify the file during the attribute cache entry lifetime, and the NFS client won't see the changes until it discards the dated attributes and reloads them from the NFS server. In the ideal world, an NFS client's working set would be small enough to be held in memory without memory contention from processes or the kernel. In practice, however, the NFS working set is typically larger than the memory available for file page caching, so NFS clients send out a steady stream of NFS read requests.

CacheFS inserts a new if-then check immediately after the VM system search. If the file page isn't already cached in memory, the NFS client looks for it in a CacheFS filesystem on a local disk. Data found via CacheFS is subject to the same consistency checks as file pages located in the VM cache, after which the file pages are copied from disk back into memory. Caching NFS blocks on the local disk doesn't improve NFS's consistency, but it can lead to reduced network traffic and server load since NFS read requests are replaced with an occasional getattr call and local disk traffic.

Given that you're doing a disk read to pull the bits off of the local disk, where's the win for CacheFS? When the time required to read an NFS buffer from the local disk is less than the time for an 8 kilobyte network transfer, plus the server's disk access and RPC service times, using CacheFS will out-perform a straight NFS client. If the NFS server has the data you need already cached in memory, however, the NFS read is likely to be fulfilled faster than a local read. Similarly, workstations with a single local disk may send the disk arm into jitterbug mode if CacheFS and paging or swapping activity send the disk seeking from one end to the other.

Here are some simple defining guidelines for CacheFS applications:

Is your client performance gated by retrieving data from the server or transferring data over the network? Is the network as much of a bottleneck as the server?
Are you out of server bandwidth, with high disk service times and uneven workload distribution?
Do you have sufficient local disk capacity--in bytes and in I/O operations per second--to cache frequently used NFS-mounted files?

Any "yes" answer means you're in a position to reduce demands on your network and server, making headroom for future growth while improving client performance at the same time. Sound like a win-win? You're ready to start crafting a cache.

Down in front, up in back: filesystem overlays with CacheFS
CacheFS operates as a transparent filesystem on top of an NFS mounted volume. A good analogy is that of a clear overhead placed on top of a printed sheet of paper -- you can see what's on the paper, but anything you write on the overhead transparency "comes first" when looking at the overlaid sheets. CacheFS uses its own terminology for the underlying filesystem and the cache. The front filesystem contains the cache directory. It may be entirely dedicated to NFS caching, or it could contain one or more cache directories along with regular user and system files. An NFS-mounted volume is the back filesystem, mounted on the backpath. Maintaining our analogy, the back filesystem is the printed sheet of paper, while the front filesystem is the clear overhead. Whenever you access a file on the back filesystem, it gets cached in the front filesystem local to the client.

Turning on NFS caching is as simple as creating the front filesystem cache, and then mounting it on top of the back filesystem:

luey# cfsadmin -c /vol/cache/cache1
luey# mount -F cachefs -o backfstype=nfs,backpath=/home/stern,\
cachedir=/vol/cache/cache1 bigboy:/export/home/stern /home/stern

The first command creates a new cache directory underneath /vol/cache. The cache1 subdirectory cannot exist before you build the cache; cfsadmin creates the subdirectory and initializes the cache management parameters described below.

The CacheFS mount is a bit more complex. If you strip off the CacheFS options (preceded by the -o flag), the command line looks like a regular NFS mount of bigboy:/export/home/stern onto /home/stern. The CacheFS options line up the front and back filesystems:

backfstype tells CacheFS that it's sitting on top of an NFS mount point. CacheFS can also be used to speed up access to CD-ROM devices, using a back filesystem type of hsfs or ufs.
backpath indicates that the back filesystem has already been mounted, and tells CacheFS where to find it.
cachedir is the name of the front filesystem cache directory.

Now for the subtle part: the CacheFS mount takes the front filesystem, that is, the subdirectory of /vol/cache, and mounts it on top of the existing NFS mount of /home/stern. As a user, you continue to use /home/stern as the path to the desired files; when you hit that mount point, however, you'll first touch a CacheFS directory backed by an underlying NFS mount. Look in the /etc/mnttab mount table and you'll see two entries for the overlaid mount points:

bigboy:/home/stern    /home/stern          nfs
/home/stern           /vol/cache/cache1    cachefs     backfstype=nfs

You don't have to perform the CacheFS mount on top of the back filesystem, but users will be noticeably confused if they have to contend with two pathnames for the same data with different behaviors. Think about the DOS DoubleSpace driver, and the compressed C: and raw H: drives it creates. Ever try explaining how C: and H: are the same disk, but not quite? By dropping the front filesystem on top of the back filesystem, you maintain the user's view of the world, possibly giving them a performance boost in the process.

A single cache directory can be used for multiple back filesystems; all NFS mounts will share the available cache space with global least recently used (LRU) policies dictating which files are purged when the cache fills. Similarly, you can create multiple cache directories in the same front filesystem, if you want to differentiate caches based on access patterns or file sizes.

The weak stay weak: CacheFS internals
What happens when you reference a file through the front filesystem? If the page is not in memory, and not in the cache, it gets dragged from the NFS server via an NFS read operation. Once in-core on the client, the page is put into the local disk cache for future reference. Every piece of data retrieved by NFS, including directory entries, symbolic links, and file buffers, are put into the cache. As a result, CacheFS can improve performance of simple commands like ls -l executed against large directories that would have previously required multiple NFS readdir operations.

Directory entries are cached in 8 kilobyte blocks; files are usually managed in 64 kilobyte chunks. CacheFS doesn't bring the entire file over into the cache upon first access, and it only caches pages as they are accessed. If you walk through a file randomly, the CacheFS entry will be sparse. Determining what files are in the cache is quite difficult, because pathnames are not used in the cache directory. Instead, cache entries are given strings of hex digits as identifiers, based on the inode number of the underlying file. Protecting the cache directory with root-only access, and hiding the underlying file names reduces the chance that a casual browsing of the front filesystem would lead to accidental access to the cached, and possibly hole-ridden, files. If you want to see how large a cache has grown, use df on the front filesystem if it's being used solely for cache, or du -s on the cache directories to monitor their growth.

CacheFS doesn't change the NFS consistency picture. The timestamp comparison algorithm used to determine when pages become stale is the same as used for in-core NFS filesystem pages. Writes to a cached NFS volume go directly to the back filesystem server, causing the cached blocks to be purged. CacheFS operates as a "write through" or "write around" cache, and never as a write-back cache. You'll never have data in the cache that isn't permanently recorded on the NFS server.

Threshold of pain: tuning and configuration options
cfsadmin lists the basic CacheFS configuration parameters for a cache directory along with the names, or CacheFS ids, of back filesystems using that cache:

luey# cfsadmin -l /vol/cache/cache1
cfsadmin: list cache FS information
maxblocks     90%
minblocks      0%
threshblocks  85%
maxfiles      90%
minfiles       0%
threshfiles   85%
maxfilesize    3MB
bigboy:_export_home_stern

These parameters are set by default, but of course, you can tune them as needed. Use the maxblocks threshold to prevent contention between multiple caches in the same front filesystem. If you have three independent caches, with roughly equal access to them, you should consider using maxblocks=30 to give each one 30% of the available filesystem blocks. You don't want the total maximum block thresholds to exceed 90 percent of the front filesystem, nor do you want to starve out a cache with too small a maximum allocation. There's a lower bound parameter for CacheFS called minblocks that allows CacheFS to grow at least that large before internal buffer management starts tossing data out of the cache

When you need to reserve disk space for other, non-CacheFS uses, make sure to set the threshblocks parameter to slow down CacheFS growth when disk space is at a premium. While maxblocks puts a static upper bound on the size of a cache directory, the block threshold implements a dynamic ceiling on growth. Once the front filesystem has reached the capacity specified by threshblocks, CacheFS will stop allocating space from the front filesystem and will resort to internal LRU chunk management to handle future space requests.

Establishing a block threshold prevents a potential denial of service problem, where CacheFS would grow large enough to prevent user-level processes from correctly writing local temporary or data files. The block threshold always takes precedence over the minimum space allocation parameter. If you set threshblocks to 60 percent, and set minblocks to 30 percent of the front filesystem, the CacheFS directory will drop its minimum space parameter down to 0 percent when the front filesystem reaches 60 percent of its capacity. Setting maxfiles limits the number of file entries that may be put into the cache. Large numbers of small files will consume a disproportionate number of inodes, possibly leading to another denial of service problem where the front filesystem no longer has a free inode for file creation or extension.

cfsadmin also offers two variations on the consistency model. Read-only back filesystems, like CD-ROMs, can be cached without any consistency checks. Using the noconst option cuts down on the client's getattr traffic. The other option, non-shared, indicates that no other client will be accessing the data in your CacheFS front filesystem. CacheFS performs a literal "write through" in this mode, updating both the cache and the back filesystem. If you know you'll be re-using the data on subsequent access, the non-shared mode is a good way to improve cache warmth.

What makes a good filesystem for use with CacheFS?

Read-only or read-mostly, since writes invalidate the cache.
Primarily randomly accessed, to prevent sequential file access from thrashing the cache
Mostly small, frequently used files that will tend to live in the caches.

Unfortunately, determining the best set of options for a particular back filesystem is largely a black art. Performance tuning CacheFS involves a fair bit of data collection as well as understanding your customers--users--expectations.

Bring to front or send to back: rules of thumb for CacheFS
Put enough knobs on something, and it gets easier to twist them into unpleasant states. Put NFS, the automounter, and CacheFS together and you have a system administrator's dream (or nightmare). Here are some guidelines for using CacheFS under different kinds of duress:

CacheFS and the automounter work quite well together, with a few "if" qualifiers. If you're using Solaris 2.3, you'll need patch 101329, revision -04 or later. Stick the CacheFS options right into the master automounter map to affect all mounts from a particular map, like auto_home, or add CacheFS configuration data to individual mounts to gain finer-grain control. In addition to the backfstype and cachedir parameters, you'll need to insert an fstype=cachefs to instruct the automounter how to deal with the CacheFS mount. To enable caching for auto_home, for example, modify a wildcard map entry like this:
```
luey*	-fstype=cachefs,backfstype=nfs,cachedir=/vol/cache/cache1 \
	bigboy:/export/home/&
```
All home directories will share the same cache directory. When you remount a home directory after a period of inactivity, the cache should already be warm for that back filesystem, speeding access to the files you used on the previous automounted effort.
Large files tend to induce cache thrashing. If you read a file sequentially, you'll fill up the cache, possibly displace older pages, only to rarely access the cached data again. Conversely, if you jump randomly through a large file, then having recently used parts of it cached might be a suitable win. When strolling sequentially through large files, set the maxfilesize parameter just below the smallest, big file you'll use. Parsing 2 megabyte Island Presents documents? Try doing the CacheFS mount with -o maxfilesize=2 to prevent the large files from polluting the cache. Similarly, if you'll be hopping around a 10 megabyte data file, and want most of it cached, be sure to crank maxfilesize up to at least 11 (megabytes).
When you need to mix and match access patterns, for example, some filesystems with large files read sequentially, some with many small files, and some with purely random access, consider using multiple caches with appropriate thresholds.
Don't use the local-access parameter if you're worried about the security of root access over the network. Permissions are only checked in NFS version 2 when a file open() is completed, not on each access to the data. The local access option retrieves file attributes from the on-disk cached data, performing a local credential check against the user's credentials.
Restoring or reloading an NFS server changes all of the file handles, rendering the caches useless. Be sure to purge client-side front filesystems using a command like cfsadmin -d all /vol/cache/cache1. Delete the cache entries for a specific NFS server using its CacheFS id with the -d option; get the CacheFS ids from cfsadmin -l.
Update the CacheFS thresholds at any time using cfsadmin -u, as long as you increase the values. If you want to rein in a cache that has claimed too much space, or one that is causing contention with other data in the front filesystem, you'll need to delete the cache and reinitialize it using cfsadmin with the right options, the first time.
Try to "pin" NFS clients to NFS servers that are part of a replicated server set in an automounter map. Because the CacheFS files are named by file handle, and not by pathname, CacheFS can't distinguish between identical files on different machines. When your clients access three or four NFS servers in a week, you'll end up with three or four somewhat- to mostly-full CacheFS front filesystems, one for each server back filesystem mount. Binding clients to a preferred server increases the probability of starting with a warm cache and offers a nice performance boost. If the preferred server is down, you'll start with a cold cache, but even that minor reload penalty is small compared to being down waiting for the crashed preferred server to recover.

In exchange for its complexity, CacheFS provides noticeable performance benefits on most clients on which it has been enabled. By reducing demand on the network, you gain scalability to an extent not possible by adding more or faster Ethernet segments, or by buying a bigger and better NFS server. Less is most definitely more when you're looking at possible network smashes and server thrashes.

Click on our Sponsors to help Support SunWorld

Resources

"A file by any other name"
/sunworldonline/swol-09-1995/swol-09-sysadmin.html
"Automatic for the people"
/sunworldonline/swol-05-1996/swol-05-sysadmin.html
"Twisty little passages all automounted alike"
/sunworldonline/swol-06-1996/swol-06-sysadmin.html
"A third in fourths: An NFS potpourri"
/sunworldonline/swol-07-1996/swol-07-sysadmin.html
A list of Hal Stern's other Sysadmin
/sunworldonline/common/swol-backissues-columns.html#sysadmin

If you have technical problems with this magazine, contact webmaster@sunworld.com

URL: http://www.sunworld.com/swol-08-1996/swol-08-sysadmin.html
Last modified:

Comments:
Name:
Email:
Company Name: