Caches, thrashes and smashes
How to use CacheNFS to cut your NFS server and network load
A look at CacheFS in action, minimizing network contortion, thrashing, and trashing. We'll start with a review of NFS caching using the virtual memory system, and then see how CacheFS adds another local layer of backing store to the cache hierarchy. Then we'll dig into CacheFS management, configuration options, and cache tuning. The usual assortment of tips, caveats and customer-tested hacks rounds out this month's installment--and the recent months' series--on NFS and friends. (3,100 words)
Even as George Gilder promises the world that bandwidth will be infinite, demand for network capacity seems to increase faster than corporate networks can deliver it. Wire speed may be cheap, but running a fat and wide pipe to every user's desk is typically beyond the financial reach of most IT organizations. Network congestion and server pile-ups haven't slaked our thirst for networked data access (possibly why we've devoted the last quarter's worth of columns to NFS-related topics), so demand reduction is the order of the day. The best way to control network congestion is to simply not use the network, and access data from a local disk. Pushing the data out to the edges of the network reduces latency, and allows for better scaling by reducing the network requirements of each node.
Disk caching of recently used files was introduced in the Unix space with the Andrew filesystem, and later commercialized with the OSF's Distributed Filesystem (DFS) as part of the Distributed Computing Environment (DCE). Sun added similar functionality to Solaris 2.3 with the CacheFS filesystem type, an on-disk cache for recently used NFS data. Caching NFS accesses can improve server scalability (since fewer requests are serviced over the network), relieve network congestion, and improve client response time. On the contrary, poor cache management and configuration can lead to reduced client performance with no appreciable benefit on the server or network.
This month, we'll see CacheFS in action, minimizing network contortion, thrashing, and trashing. We'll start with a review of NFS caching using the virtual memory system, and then see how CacheFS adds another local layer of backing store to the cache hierarchy. Then we'll dig into CacheFS management, configuration options, and cache tuning. The usual assortment of tips, cavaets and customer-tested hacks rounds out this month's installment--and the recent months' series--on NFS and friends.
The NFS mosh pit: in-memory caching unmasked
NFS relies heavily on client-side caching to achieve a minimum level of performance. NFS clients keep recently used file pages in memory, updating them when the server's copy of the file has changed or purging them when the cache space is needed for other files. When a file is read from an NFS server, new pages are added to the cache. Upon subsequent access, the client's NFS code goes through a series of if-then clauses that would make a naive BASIC programmer blush:
getattrrequest to reload that cache entry.
The 60-second window within which file attributes are assumed to remain static gives NFS its weak consistency model--another client could modify the file during the attribute cache entry lifetime, and the NFS client won't see the changes until it discards the dated attributes and reloads them from the NFS server. In the ideal world, an NFS client's working set would be small enough to be held in memory without memory contention from processes or the kernel. In practice, however, the NFS working set is typically larger than the memory available for file page caching, so NFS clients send out a steady stream of NFS read requests.
CacheFS inserts a new if-then check immediately after the VM system
search. If the file page isn't already cached in memory, the NFS
client looks for it in a CacheFS filesystem on a local disk. Data
found via CacheFS is subject to the same consistency checks as file
pages located in the VM cache, after which the file pages are copied
from disk back into memory. Caching NFS blocks on the local disk
doesn't improve NFS's consistency, but it can lead to reduced network
traffic and server load since NFS read requests are replaced with an
getattr call and local disk traffic.
Given that you're doing a disk read to pull the bits off of the local disk, where's the win for CacheFS? When the time required to read an NFS buffer from the local disk is less than the time for an 8 kilobyte network transfer, plus the server's disk access and RPC service times, using CacheFS will out-perform a straight NFS client. If the NFS server has the data you need already cached in memory, however, the NFS read is likely to be fulfilled faster than a local read. Similarly, workstations with a single local disk may send the disk arm into jitterbug mode if CacheFS and paging or swapping activity send the disk seeking from one end to the other.
Here are some simple defining guidelines for CacheFS applications:
Any "yes" answer means you're in a position to reduce demands on your network and server, making headroom for future growth while improving client performance at the same time. Sound like a win-win? You're ready to start crafting a cache.
Down in front, up in back: filesystem overlays with CacheFS
CacheFS operates as a transparent filesystem on top of an NFS mounted volume. A good analogy is that of a clear overhead placed on top of a printed sheet of paper -- you can see what's on the paper, but anything you write on the overhead transparency "comes first" when looking at the overlaid sheets. CacheFS uses its own terminology for the underlying filesystem and the cache. The front filesystem contains the cache directory. It may be entirely dedicated to NFS caching, or it could contain one or more cache directories along with regular user and system files. An NFS-mounted volume is the back filesystem, mounted on the backpath. Maintaining our analogy, the back filesystem is the printed sheet of paper, while the front filesystem is the clear overhead. Whenever you access a file on the back filesystem, it gets cached in the front filesystem local to the client.
Turning on NFS caching is as simple as creating the front filesystem cache, and then mounting it on top of the back filesystem:
luey# cfsadmin -c /vol/cache/cache1 luey# mount -F cachefs -o backfstype=nfs,backpath=/home/stern,\ cachedir=/vol/cache/cache1 bigboy:/export/home/stern /home/stern
The first command creates a new cache directory underneath
cache1 subdirectory cannot
exist before you build the cache;
cfsadmin creates the
subdirectory and initializes the cache management parameters described
The CacheFS mount is a bit more complex. If you strip off the CacheFS
options (preceded by the
-o flag), the command line looks
like a regular NFS mount of
/home/stern. The CacheFS options line up the front
and back filesystems:
backfstypetells CacheFS that it's sitting on top of an NFS mount point. CacheFS can also be used to speed up access to CD-ROM devices, using a back filesystem type of
backpathindicates that the back filesystem has already been mounted, and tells CacheFS where to find it.
cachediris the name of the front filesystem cache directory.
Now for the subtle part: the CacheFS mount takes the front filesystem,
that is, the subdirectory of
/vol/cache, and mounts it on
top of the existing NFS mount of
/home/stern. As a user,
you continue to use
/home/stern as the path to the
desired files; when you hit that mount point, however, you'll first
touch a CacheFS directory backed by an underlying NFS mount.
Look in the /etc/mnttab mount table and you'll see two entries for the
overlaid mount points:
bigboy:/home/stern /home/stern nfs /home/stern /vol/cache/cache1 cachefs backfstype=nfs
You don't have to perform the CacheFS mount on top of the back filesystem, but users will be noticeably confused if they have to contend with two pathnames for the same data with different behaviors. Think about the DOS DoubleSpace driver, and the compressed C: and raw H: drives it creates. Ever try explaining how C: and H: are the same disk, but not quite? By dropping the front filesystem on top of the back filesystem, you maintain the user's view of the world, possibly giving them a performance boost in the process.
A single cache directory can be used for multiple back filesystems; all NFS mounts will share the available cache space with global least recently used (LRU) policies dictating which files are purged when the cache fills. Similarly, you can create multiple cache directories in the same front filesystem, if you want to differentiate caches based on access patterns or file sizes.
The weak stay weak: CacheFS internals
What happens when you reference a file through the front filesystem? If the page is not in memory, and not in the cache, it gets dragged from the NFS server via an NFS read operation. Once in-core on the client, the page is put into the local disk cache for future reference. Every piece of data retrieved by NFS, including directory entries, symbolic links, and file buffers, are put into the cache. As a result, CacheFS can improve performance of simple commands like
ls -l executed against large directories that would have
previously required multiple NFS
Directory entries are cached in 8 kilobyte blocks; files are usually
managed in 64 kilobyte chunks. CacheFS doesn't bring the entire file
over into the cache upon first access, and it only caches pages as
they are accessed. If you walk through a file randomly, the CacheFS
entry will be sparse. Determining what files are in the cache is quite
difficult, because pathnames are not used in the cache directory.
Instead, cache entries are given strings of hex digits as identifiers,
based on the inode number of the underlying file. Protecting the cache
directory with root-only access, and hiding the underlying file names
reduces the chance that a casual browsing of the front filesystem
would lead to accidental access to the cached, and possibly
hole-ridden, files. If you want to see how large a cache has grown,
df on the front filesystem if it's being used solely
for cache, or
du -s on the cache directories to monitor
CacheFS doesn't change the NFS consistency picture. The timestamp comparison algorithm used to determine when pages become stale is the same as used for in-core NFS filesystem pages. Writes to a cached NFS volume go directly to the back filesystem server, causing the cached blocks to be purged. CacheFS operates as a "write through" or "write around" cache, and never as a write-back cache. You'll never have data in the cache that isn't permanently recorded on the NFS server.
Threshold of pain: tuning and configuration options
cfsadmin lists the basic CacheFS configuration
parameters for a cache directory along with the names, or CacheFS
ids, of back filesystems using that cache:
luey# cfsadmin -l /vol/cache/cache1 cfsadmin: list cache FS information maxblocks 90% minblocks 0% threshblocks 85% maxfiles 90% minfiles 0% threshfiles 85% maxfilesize 3MB bigboy:_export_home_stern
These parameters are set by default, but of course, you can tune them
as needed. Use the
maxblocks threshold to prevent
contention between multiple caches in the same front filesystem. If
you have three independent caches, with roughly equal access to them,
you should consider using
maxblocks=30 to give each one
30% of the available filesystem blocks. You don't want the total
maximum block thresholds to exceed 90 percent of the front filesystem, nor do
you want to starve out a cache with too small a maximum allocation.
There's a lower bound parameter for CacheFS called
minblocks that allows CacheFS to grow at least that large
before internal buffer management starts tossing data out of the cache
When you need to reserve disk space for other, non-CacheFS uses, make
sure to set the
threshblocks parameter to slow down
CacheFS growth when disk space is at a premium. While
maxblocks puts a static upper bound on the size of a
cache directory, the block threshold implements a dynamic ceiling on
growth. Once the front filesystem has reached the capacity specified
threshblocks, CacheFS will stop allocating space from
the front filesystem and will resort to internal LRU chunk management
to handle future space requests.
Establishing a block threshold prevents a potential denial of service
problem, where CacheFS would grow large enough to prevent user-level
processes from correctly writing local temporary or data files. The
block threshold always takes precedence over the minimum space
allocation parameter. If you set
threshblocks to 60 percent, and
minblocks to 30 percent of the front filesystem, the CacheFS
directory will drop its minimum space parameter down to 0 percent when the
front filesystem reaches 60 percent of its capacity. Setting
maxfiles limits the number of file entries that may be
put into the cache. Large numbers of small files will consume a
disproportionate number of inodes, possibly leading to another denial
of service problem where the front filesystem no longer has a free
inode for file creation or extension.
cfsadmin also offers two variations on the consistency
model. Read-only back filesystems, like CD-ROMs, can be cached without
any consistency checks. Using the
noconst option cuts
down on the client's
getattr traffic. The other option,
non-shared, indicates that no other client will be
accessing the data in your CacheFS front filesystem. CacheFS performs
a literal "write through" in this mode, updating both the cache and
the back filesystem. If you know you'll be re-using the data on
subsequent access, the non-shared mode is a good way to improve cache
What makes a good filesystem for use with CacheFS?
Unfortunately, determining the best set of options for a particular back filesystem is largely a black art. Performance tuning CacheFS involves a fair bit of data collection as well as understanding your customers--users--expectations.
Bring to front or send to back: rules of thumb for CacheFS
Put enough knobs on something, and it gets easier to twist them into unpleasant states. Put NFS, the automounter, and CacheFS together and you have a system administrator's dream (or nightmare). Here are some guidelines for using CacheFS under different kinds of duress:
cachedirparameters, you'll need to insert an
fstype=cachefsto instruct the automounter how to deal with the CacheFS mount. To enable caching for auto_home, for example, modify a wildcard map entry like this:
luey* -fstype=cachefs,backfstype=nfs,cachedir=/vol/cache/cache1 \ bigboy:/export/home/&
All home directories will share the same cache directory. When you remount a home directory after a period of inactivity, the cache should already be warm for that back filesystem, speeding access to the files you used on the previous automounted effort.
maxfilesizeparameter just below the smallest, big file you'll use. Parsing 2 megabyte Island Presents documents? Try doing the CacheFS mount with
-o maxfilesize=2to prevent the large files from polluting the cache. Similarly, if you'll be hopping around a 10 megabyte data file, and want most of it cached, be sure to crank
maxfilesizeup to at least 11 (megabytes).
local-accessparameter if you're worried about the security of root access over the network. Permissions are only checked in NFS version 2 when a file
open()is completed, not on each access to the data. The local access option retrieves file attributes from the on-disk cached data, performing a local credential check against the user's credentials.
cfsadmin -d all /vol/cache/cache1. Delete the cache entries for a specific NFS server using its CacheFS id with the
-doption; get the CacheFS ids from
cfsadmin -u, as long as you increase the values. If you want to rein in a cache that has claimed too much space, or one that is causing contention with other data in the front filesystem, you'll need to delete the cache and reinitialize it using
cfsadminwith the right options, the first time.
In exchange for its complexity, CacheFS provides noticeable performance benefits on most clients on which it has been enabled. By reducing demand on the network, you gain scalability to an extent not possible by adding more or faster Ethernet segments, or by buying a bigger and better NFS server. Less is most definitely more when you're looking at possible network smashes and server thrashes.
If you have technical problems with this magazine, contact email@example.com