Click on our Sponsors to help Support SunWorld

Twisty little passages all automounted alike

A second look at automounter secrets

June 1996

Abstract

The real world hollers for scalability, multiple network connections, and lower costs of administration. Your personal life benefits from the time savings of the last item, so this month we'll take a pragmatic view of using the automounter, focusing on environments characterized by multiple network paths between clients and servers, replicated servers sharing access to popular filesystems, and using client-side caching to improve performance. (2,900 words)

Mail this
article to
a friend

Last month, we took a simplistic view of the automounter, the de facto Unix standard for managing a client's NFS configuration. While covering the basics of creating maps, avoiding name space conflicts, and using shorthands like wildcards, the tips and techniques mentioned were best suited for machines with only a few exported filesystems, on a single network, with a single well-known hostname. Simple administrative techniques suffice for small networks, but be wary of extending them into a policy of "one server, one network, one filesystem." As your organization grows, you'll be dealing with several hundred small NFS servers. After you've spent hours trying to figure out where to put the newest user's home directory, and how to even out load imbalances, you'll come to appreciate the quandary faced by managers of large Novell NetWare sites. A thousand pennies are more cumbersome than a $10 bill.

We'll start with the problem of maintaining bandwidth into popular filesystems like the local development tools repository, and seeing how the automounter selects one server from the bunch. We'll refine that view a bit by discussing ways to tweak the selection algorithm when you have a maze of twisty little network passages to your servers. From there, we'll tackle the multi-homed host problem where each NFS server is seen as a little network of twisty names, and see how to optimize the automounter for use with the Solaris CacheFS client-side NFS cache. A bevy of warnings and caveats round out this month's installment on advanced client mount-point management.

Advertisements

The Tessier-Ashpool fileservers
From an NFS client perspective, replicated servers are complete clones of one another. That is, they offer the same filesystem, with the same file layouts, and, ideally, the same file attributes. They may be different kinds of machines, with other NFS duties as well, as long as they provide identical filesystem views for the replicated data set. One host might be a small desktop server offering only /usr/local for example, while another provides /usr/local in addition to home directories and an ftp archive. From the automounter's point of view, the two servers form a replicated server set for /usr/local. Note that the automounter doesn't ensure that the servers actually contain the same filesystem hierarchies; you could replicate across wildly different hosts and end up with wildly confused users. Building and maintaining the replicated hosts is your job; getting the NFS clients connected to one server in the set is up to the automounter.

Why bother with replicated hosts, when you can build a beefy, solar plexus of an NFS server, connected to dozens of networks? There are a variety of performance reasons:

Network bandwidth is finite. If all of your clients compete for NFS access over a single backbone network, performance degrades as you add clients or servers. The network becomes the limiting factor. Replicating fileservers lets you alleviate backbone bottlenecks.
Network latency colors NFS performance. Let's say you have to traverse three routers between a typical client and the mondo server. That can slow your "small" NFS requests, like attribute retrievals or name lookups, by a factor of two or more. It's not always feasible to connect the centralized server closer to the client networks, particularly when the client nets are distributed around the world. Instead, break up the server and replicate it, putting the data close to where it's being consumed.
Large, read-only volumes behave better when cached in the NFS server's memory. If your replicated data set is in the 10 gigabyte or more range, no single server's cache will hold all of it at once. You'll generally improve performance -- through improved caching -- by putting more servers under the data, each with an independent cache.

Note that replicated servers are for read-only data, not writable filesystems. NFS does not (currently) support write multicasting, or transactional writes to multiple servers to keep them all updated synchronously. If you want read and write replication to improve reliability, you're looking at a high-availability solution rather than the performance win offered by server replication. In a high-availability pair, the two servers offer exactly the same filesystem, since they share disks. One can take over for the other in the middle of an NFS traffic stream. With server replication, you're buying some protection from a server that crashes before you attempt to access it, but if the server crashes once you've mounted it, the situation is no better than having done the mount from a single source. A more detailed look at the replicated server mounting mechanics makes the distinction a bit clearer.

Finding the Chosen One
Setting up a replicated server set is as simple as naming the hosts in the map entry:

/usr/local		toolbox:/usr/local \
			workhorse:/export/usr/local \
			distbox:/usr/local/sun4

If all of the servers use the same filesystem naming convention, you can simplify the entry by enumerating the server names:

/usr/local		toolbox,workhorse,distbox:/usr/local

With such a nice assortment of machines, how does the automounter choose one? There's a three-step process:

Each server is pinged by the automounter with a null NFS RPC call. A null RPC is the equivalent of an ICMP ping: the server receives the RPC and immediately returns a response. The null call gives a rough indication of server load, since the null RPC will sit further down in the work queue of a heavily loaded server. The automounter also uses null RPC pings as a simple heartbeat to see which servers are responsive. During the first step, any servers that are down, or are so busy that they dropped the null RPC and appear to have crashed, are removed from the candidate list.
Any server directly attached to a local network interface is selected. The client's automounter process compares the IP subnet numbers of the servers named in the list to the subnet or subnets on which the client sits, and ranks those on local subnets with higher priority than those one hop or more away.
The servers are sorted by response time, using the round-trip request to reply timing for the null RPC. The fastest server is selected for the mount operation.

The subnet address match works in Solaris 2.4 and later releases. In earlier automounter implementations there's no check on subnet addresses, so it's conceivable that a fast server sitting across a few lightly loaded router hops would turn in a faster round trip lap time than a server on the same subnet -- not the best binding from a network perspective, but the best the automounter could choose without further information.

Note that the hard work is done at mount time by the automounter, and once the lucky server is identified, it's mounted and used as if it was the only server named in the map entry. This means that the usual automounted filesystem dependency rules apply: If the server crashes after you've mounted it, your client can get stuck on the non-responsive mount point. In addition to providing better response time for heavily-used volumes, the biggest reliability benefit of replicated volumes is seen when you mount, time out, and unmount the filesystem repeatedly. When you're casually bouncing through volumes, and the automounter is able to unmount the quiescent ones, you reduce your risk of being impaired by a crashed server -- if you go to re-mount the volume after your former server crashed, you'll pick up a different server on the next reference to the filesystem.

Again, once a server has been selected from the list, there's no re-adjustment made by the client. If hundreds of clients all pounce on the same server at once, the server has no way to retroactively say "Those null RPC response times were too optimistic, I'm much busier now with real RPC requests." It's not unusual to see 80 percent or more of all NFS clients using replicated servers binding to the same server, victimizing it for being physically close on the network or temporarily lightly loaded. Fortunately, the Solaris 2.x automounter lets you skew the timings to force a more equitable distribution of NFS clients.

A weighty decision
Consider this scenario: Engineers in your office start arriving at 9 am. Before then, all of the automounted filesystems have been unmounted, the desktops having been left alone overnight. As people begin to reference the popular shared volumes, the automounter sends flurries of null RPCs to the servers listed in replicated server sets. Since there's little sustained load at this point, the fastest server or machine closest to a desktop "wins" each time. An hour later, everyone is bound to the same one or two servers, while the other four or five in your set sit, sadly, squandering the work invested in replicating the data sets.

What's the solution? Break the clients up into groups, and skew the selection process so each group has a preferred server. The automounter provides a server weighting mechanism that kicks in after subnet address matching is performed. Here's a sample automounter map entry with weights assigned:

/usr/local		toolbox(30),workhorse(10),distbox(5):/usr/local

Weights are used as penalties: the bigger the weight, the less you want to deal with that server. Leaving off the weight sets it to 0, for the highest preference. If all of the servers are on the same network, distbox will be chosen first if it is responsive, then workhorse, and finally toolbox. Network address matching takes precedence over weights, however, so workhorse may be selected if it's the only server on the same network as the client doing the automount.

There are several algorithms you can apply to divide your clients into groups:

Put machines doing similar kinds of work together, assuming they'll use the same files and improve server-side file caching.
Separate heavy-duty NFS users into distinct groups, so that they don't gang up on the same server.
Use the physical network architecture to suggest a logical workgroup partitioning, minimizing the amount of NFS traffic that goes through bridges or routers.

The biggest drawback to server weights is that they're likely to be different for each client, making it hard to use the same automounter map across the board, and close to impossible to put the map under NIS or NIS+ control. To consolidate map management again, use automounter variables for the weights, so a generic map can be used by all clients with the per-client automounter invocation supplying the weight data:

/usr/local	toolbox($W1),workhorse($W2),distbox($W3):/usr/local

Change your automounter invocation on the client to define the weight values:

automount  -DW1=30 -DW2=10 -DW3=5

You can parameterize this command line as well, running a script to deduce the weights from network numbers or a configuration file that maps client names to group numbers and weights.

Executable content another way
What do you do if you have the inverse problem, that is, one server with many interfaces, and you don't want to have a plethora of maps containing the "best" interface name for each client? If you put server fred on ten networks, it's going to present itself with names like fred-net1 and fred-net7. Clients that use the name fred may have their NFS requests routed through the network, twisting and turning through a maze of less-than-ideal passages. Assuming you're not solving the problem with DNS (creating multiple IP address records for the same name), you can tackle this one in two ways with the automounter.

List all of the server names, as if it was a replicated server set, and let the automounter find the "best" path using subnet address matching. When all clients are on the same subnet as as one of the server interfaces, this is the way to go. Note that NFS clients on different subnets will mount from the "default" IP address, that is, the one that gets returned by gethostbyname() on the server. When the subnet address match fails, the automounter uses the null RPC ping data to sort the servers and then uses the IP address in the RPC response packet to identify the server. For a multihomed server, these IP addresses will be the same.

Combine the replicated server set listing with server weights to deal with clients on non-local subnets. For example:

/usr/local	fred-net1(1),fred-net7(2),fred(3):/usr/local

Again, you can substitute variables for the weights to simplify map maintenance and distribution.

Let's add another spin: how do you make this work with CacheFS? A typical automounter master map entry to use CacheFS for an entire indirect map looks like this:

/home	auto.home	-fstype=cachefs,backfstype=nfs

The Solaris automounter isn't particular about the underlying filesystem type of the mount point -- it defaults to NFS, but it can handle CacheFS as well as HSFS mounts using the fstype option. The first reference to something in /home completes the CacheFS mount of the front filesystem and the backing NFS filesystem mount. What do you do if you have replicated servers for the back filesystem? CacheFS uses file handles and other host-specific information in the cache, so you'll end up with distinct cache entries for each server you mount from, even if you are accessing the same named files. CacheFS has no way of knowing that two versions of Netscape on different servers are byte-for-byte identical, so it caches them both. Therefore, make sure you enforce some client preferences in the automounter map entries using server weights, so that clients re-bind to the same server whenever possible and maintain their cache "warmth." You'll still rebind to a different server if your first choice is down, and take the hit of repopulating the client-side disk cache, but the cache-miss penalty is far smaller than that of being dead in the water waiting on a crashed NFS server.

Finally, for the truly adventurous, the Solaris automounter supports a new map type called an executable map. The executable map is just a script that is run by the automounter when the map is referenced. The key value is passed to the executable map as an argument, and the automounter expects a valid map entry (key-value pair) in return, or a null return value if the key can't be matched. Make a map executable by setting the execute permission bit, and ensuring it is a valid script, starting with #!/bin/sh or something similar. Again, this precludes the use of NIS or NIS+ for the map, since shell scripts don't fit the key-value pair format too well.

Using an executable map, you can merge the weight determination and insertion steps, changing your values dynamically based on network load or NFS usage statistics. Executable maps are tricky, and you don't want to be too elaborate with them because each mount operation is delayed by the time it takes to execute the script. If you look up trend information in your network management database, calculate weights based on the time of day and traffic flow, you might take 4 to 5 seconds to return a map entry. Meanwhile, the user who invoked /usr/local/bin/mosaic is waiting at least that long for the /usr/local mount to occur before mosaic even starts its initialization. Simpler is better; faster is better; less configuration and maintenance work is better.

More songs about building mount tables
We'll close out this month's excursion into the automounter with three basic warnings:

Don't use the subdirectory map entry feature. A subdirectory mount looks like this, in an indirect map for /home:
```
	stern		sunrise:/export:&
	sue		divi:/export:&
	frank		monmouth:/export/home:&
	*		wasteland:/export/home:&
```
With a subdirectory mount structure, the first reference to a filesystem from a server completes an NFS mount, then subsequent references create symbolic links to the common parent directory. The primary problem with subdirectory mounts is that the pathnames used contain the key used for the first mount -- if you reference the filesystems in a different order the next time, you'll get a different set of mount points. It's amazingly confusing and of little value. In the early days of the automounter, NFS mounts consumed precious kernel resources, and reducing the number of mount points made the kernel and users happier. In Solaris, the number of mounts isn't unlimited but it's large enough not to worry about. With NFS over TCP in Solaris 2.5, there's a single TCP connection from the client to each server anyway, so you're not changing much by reducing the number of mount points.
To put automounter maps under NIS+ control, make sure you include the mount options and the server specification as part of the value field in the NIS+ table entry. You'd expect that the mount options would form their own column in the table, but they don't. The automounter parses the entire value string out of the NIS+ map, and it expects to find mount options there.
When you use mount from the command line, you are really calling a helper executable /usr/lib/fs/nfs/mount, documented under mount_nfs. The NFS-specific mount executable understands all of the possible options, parses them, and passes them into the kernel. The automounter doesn't use the helper -- it calls mount(2) directly. Consequently, your version of the automounter map may not understand some undocumented or recent additions to the NFS options camp, like the llock option that turns off remote file locking. Generally, these should be reported as bugs (the llock problem is currently being fixed by Sun) to your vendor.

To make sure we've thoroughly flogged the NFS horse, next month we'll cover the ins and outs of NFS and CD-ROM drives, some automounter debugginging tips, keeping users out of each other's NFS-mounted mail spools, and Sun's proposed WebNFS access paradigm for shuffling files through the Internet.

Click on our Sponsors to help Support SunWorld

Resources

"A file by any other name"
/sunworldonline/swol-09-1995/swol-09-sysadmin.html
"Automatic for the people"
/sunworldonline/swol-05-1996/swol-05-sysadmin.html
A list of Hal Stern's other Sysadmin
/sunworldonline/common/swol-backissues-columns.html#sysadmin

If you have technical problems with this magazine, contact webmaster@sunworld.com

URL: http://www.sunworld.com/swol-06-1996/swol-06-sysadmin.html
Last modified:

Comments:
Name:
Email:
Company Name: