Click on our Sponsors to help Support SunWorld

A file by any other name

The Unix filesystem is the primary mechanism
we use to locate our data. Here's how it works.

September 1995

Abstract

The Unix filesystem can be a rat's nest of hostile symbolic links, obstinate device mounts, and corpulent directories. A few simple tips and a handful of utilities can prevent order from turning into chaos. (3,800 words)

Mail this
article to
a friend

Despite all of the work that has been done with object-oriented systems and user interfaces, the Unix filesystem is the primary mechanism we use to locate and interact with our data. The hierarchical namespace is simple and relatively easy to use -- until you begin adding symbolic links, NFS mounts, removable media, and other wormholes that take you from one disk to another with little warning. Large Unix environments stress the filesystem in new and creative ways: Processes can go file-crazy and run into system-resource limitations. Users who have yet to discover the wonder of directories complain of terrible system performance, or just when you think your list of headaches is complete, you try to unmount a CD-ROM but continually get "filesystem busy" error messages, or you promise to remove the mailbox of a user who is hanging you up -- as soon as you determine his or her identity.

This month, we'll sort through some filesystem navigation issues. We'll look at open() to see how processes find files, and examine some performance issues. From there, it's off to the links -- hard and symbolic -- to see how they alter paths through the filesystem. Finally, we'll look at the tools available to find the user associated with an open file or directory. While we may not offer any solutions to the deep-versus-wide directory layout argument, we'll try to make sure that no matter what you call your files, you'll get the right bits.

Open duplicity

We see the filesystem as a tree of directories and filenames. These physical names aren't used inside of a process. Logical names known as file descriptors, or file handles, are used to identify a file for reading or writing. Unix file descriptors are integers returned by the open() or dup() system calls. Pass a filesystem pathname to open() along with the read or write permissions you want, and open() returns a file descriptor or an error:

int fd;
fd = open("/home/stern/cols/sep95.doc", O_RDONLY);
if (fd < 0) {
	fprintf(stderr, "cannot open file\n");
	exit(1);
}

Advertisements

You can open special (raw) devices or special files like Unix domain sockets using a similar code fragment. File descriptors are maintained on a per-process basis, in the same kernel data structures that keep track of the stack, address space, and signal handlers. Every process starts out with file descriptors 0, 1, and 2 assigned to the standard input, output, and error streams, respectively. Each subsequent call to open() returns the next available file handle. When you call close() on a file descriptor, that handle is the next one used by open(). The integer is really just an index into a table of per-process file descriptors that point to the system open file table. The system file table points to other file-specific information like inodes, NFS server addresses, or protocol-control blocks for file descriptors associated with sockets.

What's the point of dup() if open() is the primary mechanism for converting pathnames to file handles? When you want to use the same file for two output streams, such as stdout and stderr, use dup() to copy one descriptor into another. The following code segment closes the default stdout and stderr streams and pumps both into a new file:

int fd;
close(1);
close(2);
fd = open("/home/stern/log/errors", O_WRONLY|O_CREATE);
dup(fd);

Because the file descriptors for stdout and stderr were closed before the calls to open() and dup(), these handles are re-used.

Open file descriptors are preserved when new processes are created with fork() and vfork(), so any process that gets started via exec() inherits the open file descriptors left by the calling process. Here's a simple example of using dup() and fork() to set up a pipe between two processes:

int fd[2], pid;
pipe(&fd);
close(0);
dup(fd[0]);
if ((pid = fork()) == 0) {
	close(1);
	close(2);
	dup(fd[1]);  /* connect stdout */
	dup(fd[1]);  /* and stderr */
	exec("writer");
} 
exec("reader");

First the call to pipe() creates a pipe and returns file descriptors for the reading and writing ends. Standard input is closed, and the reading end of the pipe is connected via dup(). A new process is created using fork(), and its stdout and stderr are fitted into the writing end of the pipe. The child process execs the writer and the parent process becomes the reader. This is a slimmed-down version of what happens inside the shell when you execute command_a | command_b from the command line.

The per-process file descriptor table has a default maximum size of 20 in SunOS 4.1.x and 64 in Solaris. If you exceed the default size, the next call to open() or dup() returns EMFILE. Most processes don't touch that many files, but system processes or connection managers such as inetd may open many file descriptors. Relax the limit using the unlimit command from the shell. You can even start inetd in a subshell that has the file descriptor limit removed:

luey% ( unlimit descriptors; inetd & )

Setting the file descriptor ceiling to "unlimited" really means 128 in SunOS 4.1.x, and 1,024 in Solaris. The smaller limit in SunOS is due to the standard I/O library (and others) using a signed char for the file descriptor. If you are porting code from BSD platforms to Solaris, be on the lookout for snippets that use 8-bit signed file descriptors and test for return values less than zero: Solaris will return file descriptors over 128, which look like failures but are valid file handles. Solaris also dynamically allocates space in the per-process file descriptor table, so removing the file descriptor limit won't make your processes bloat with unused table space.

Open system call

Peering inside the open() system call lets us learn more about the performance implications of excessively deep or wide directory structures. When open() converts a pathname into a file handle, it starts by walking down the path one component at a time. For example, to look up /home/stern/log/error, first home is found in the root directory, then stern is found under home, and so on. To locate a filename component, a linear search is performed on the directory -- Unix doesn't keep the directory entries sorted on disk, so each one must be examined to find a match or determine that the file or directory doesn't exist.

Why is the lookup done one component at a time? Any of those directories could be a mount point or a symbolic link, taking the pathname resolution to another filesystem or another machine in the case of an NFS mount. If /home/stern is NFS-mounted, the first two lookups occur on the local disk, then the request to find log in stern is sent over the network to the NFS server. These pathname resolutions show up as NFS lookup requests and can account for as much as 40% of your NFS traffic.

Pathname-to-file-descriptor parsing can be a disk-, network-, and server-intensive business. If the directory entries must be searched, they have to be read from disk, incurring file-access overhead before the file is opened for normal I/O. When you are using NFS-mounted files, you may be performing directory searches on the server, accruing network round-trip and disk I/O penalties.

Excessively deep directory structures
generate large numbers of lookup requests
and can frustrate users. . .

So how does the operating system accelerate this process? Recently completed lookups are kept in the Directory Name Lookup Cache (DNLC), an association of directories and component names that is checked before performing a directory search. DNLC statistics are accessible through vmstat -s:

luey% vmstat -s
     ...
  462936 total name lookups (cache hits 91%)
     214 toolong

Ideally, your hit rate should be over 90%, and as close to 99% as possible. However, several things conspire to make the DNLC less efficient:

Opening a file for the first time adds its entry to the DNLC. If you perform many browsing type operations, you'll decrease the DNLC's efficiency.
When a file or directory is removed, the DNLC entry is purged. Create the file with the same name again, and a new entry is inserted. Large volumes of file-creation activity, such as compilations, reduce the hit rate as well.
Having a DNLC that is too small, particularly on an NFS server, will cripple your performance by sending each DNLC miss off to disk for a directory search. Under Solaris, the DNLC is sized dynamically but in SunOS, you need to increase your maxusers kernel parameter to crank up the cache size. For more information, ftp these technical documents: Building_and_Debugging_SunOS_Kernels.ps.gz or building_and_debugging_kernels.ps.
Long filenames can't be cached. In SunOS 4.1.x, any component over 14 characters long doesn't fit in the DNLC, and in Solaris, anything with 32 characters or more isn't inserted.
Unmounting a filesystem purges its cached entries. When the automounter decides a filesystem is idle and unmounts it, any entries for files on that filesystem are removed.

Sometimes vmstat -s reports bizarre statistics, usually when its internal counters overflow and its math precision becomes non-existent. Kimberley Brown, co-author of Panic! Unix System Crash Dump Analysis provides a detailed adb script to dump out all of the name cache (nc) statistics. It shows you the number of hits, misses, too-long names, and purges.

Deep and wide

Given that directory searches occur in linear fashion, but each pathname component requires another lookup operation, is it better to have deep or wide directories? Do you do better having all of your files in just a few directories, minimizing the number of lookups, or do you go for more lookups but make each one go faster because the directories have only a few entries?

The answer, as usual, is to avoid the extremes. Excessively deep directory structures generate large numbers of lookup requests and can frustrate users trying to navigate up and down half a dozen levels of directories. Perhaps the worst file naming convention is to take a serial number and use each digit as a directory, that is, document 51269 becomes /docs/5/1/2/6/9. While it's trivial to find any document, it's painful to consume so much of the disk with directories. Use a hash table to find documents quickly, or group them with several dozen documents in a single directory.

Large, flat directories are equally bad. When a user complains that ls takes a long time, try ls | wc and see how many files are in the current directory. In addition to the disk hits required take to search the directory, you're also paying a CPU time penalty to sort the list.

SUBHEAD Link to La-La-Land
Pathname resolution takes a turn when a mount point is encountered, switching from one local disk to another or to an NFS server. Symbolic links cause a similar detour. First, some background on links and their implementation. Filesystem links come in two flavors: hard and symbolic. Hard links are merely multiple names for the same file. They exist as duplicate directory entries, pointing to the same blocks on disk. Create hard links using the ln command:

luey% ln test1 test2

Because they refer to a set of disk blocks, hard links cannot span filesystem boundaries. If you're wondering why hard links are useful, consider the mv command. You can rename files by copying them, and removing the old name. However, this assumes you have disk space for two copies of the file while the operation is in progress, and it may upset any careful tuning you've done of the disks (see SysAdmin, Advanced Systems, May 1994: "Looking for Mr. Good Block"). Instead of copying the file, mv uses a hard link to create the new name, then it removes the old name. The disk blocks never move, unless you're moving the file across filesystems, in which case mv copies the file.

Symbolic links are pointers to filesystem pathnames, not filesystem data blocks. They're also created with the ln command, using the -s flag:

luey% ln -s /var/log/stern/errors/sep95 /home/stern/log

The link target is the first argument, and the link is the second argument. When open() hits a symbolic link, it follows the link by substituting its value for the current pathname component. In the above example, /home/stern/log is a link to /var/log/stern/errors/sep95 so the lookup of log in stern turns into a lookup of var in the root directory. Encountering a symbolic link lengthens the pathname lookup if the link has many components, and it may force a disk access to read the link. All symbolic links in SunOS are read from the disk, while only those with more than 32 characters in the target are read from disk in Solaris -- shorter links are kept in memory with the link's inode.

Absolute links point to a pathname that begins with a /, while relative links assume the current directory as a starting point. For example:

luey% ln -s ../logs/sep95 error

creates a link named error in the current directory, pointing to ../logs/sep95. Using symbolic links successfully means enforcing consistency in filesystem access on all of your clients. If you use absolute links, you must be sure that all of the pathname components in the link are accessible to processes that hit the link. With relative links, you run the risk of ending up somewhere unexpected when you walk up the directory tree.

One of the most common uses of symbolic links is the creation of a consistent name space built from CPU-, OS-, or release-specific components. For example, /usr/local/bin contains binaries specific to a CPU architecture and an OS release. Many administrators use symbolic links to point to the "right" version of /usr/local/bin:

luey% ln -s /usr/local/sun4.solaris2.4/bin /usr/local/bin

Maintaining the links gets more difficult as the number of special cases grows, and it makes performing an upgrade less than pleasant. Another solution is to use the hierarchical mount feature of the automounter, letting it build up the right combination of generic and specific filesystems on the fly. Here's a hierarchical automounter map entry for /usr/local:

/usr/local	\
	/		servera:/usr/local
	/bin		servera:/usr/local/bin.$ARCH.$OS
	/lib		serverb:/usr/local/lib.$ARCH.$OS

First the generic /usr/local template is mounted, giving you machine-independent directories like /usr/local/share and /usr/local/man. Then the bin and lib directories are dropped on top, using the variants named by the automounter variables. Invoke the automounter with the appropriate variable definitions on each machine:

luey% automount OS=solaris2.4 ARCH=sun4

When you upgrade the OS, change this line to reflect the new OS variable value, and all of your machine dependencies are resolved.

Virtual Boston

Wise use of the automounter and symbolic links can keep your users from falling off the edge of a filesystem into previously uncharted net-surf. But what if you want to explicitly prevent prowling? How can you create a restricted environment from which scripts, users, and not-so-well-intentioned visitors cannot escape? The magic system call is chroot() -- change root. When chroot() is called with a pathname, that directory becomes the new virtual root of the filesystem. The calling process and all of its descendants only see files in the virtual root and below.

Anonymous ftp servers use chroot() to keep you corraled in the public ftp area. Webmasters would sleep better at night if CGI (common gateway interface) scripts ran with chroot() so that file damage or exposure caused by script failures could be contained to a selected subset of the server's filesystems.

Once you call chroot(), anything you'll need has to appear in the truncated filesystem. If you are using anonymous ftp, and you expect users to generate listings, you'll need a version of /bin/ls as well as the dynamic linker ld.so and the essential C runtime libraries from /lib. Let's say you want to use /export/pub/ftp as the root of your subset. You'll need to create the following directories and populate them: /export/pub/ftp/bin for binaries such as ls and ld.so, /export/pub/ftp/lib for dynamically linked libraries, and /export/pub/dev for special devices needed at runtime, like /dev/zero.

For more information, consult the generic sun-managers Frequently Asked Questions list, which contains many more relevant pointers.

If you use chroot() in other places, for example, creating electronically padded cells for scripts, be careful about using symbolic links. Absolute links' values still refer to the old root of the filesystem. If you perform chroot("/home/cell"), and then try to resolve a link that points to /var/log/errors, the actual pathname created (relative to the true root of the filesystem) is /home/cell/var/log/errors. Relative links are a safer bet for portability when using chroot(), but again be sure that you don't try to walk above the root of the truncated filesystem. An arbitrarily long chain of .. components still only hits the subset root, not the true filesystem root.

You must be root to call chroot(), so it is usually called by daemons started by root just before they exec() another image or change their effective user ID to something non-privileged. Changing your filesystem-centric viewpoint to preen unrelated directories gives you a Virtual Boston -- the "hub of the universe." Should you not believe the hype, check out the view, the shopping, or its bovine Internet service provider.

User descriptors

By now you can help your users get to files, create shortcuts, and maybe even eliminate the odd complaint about the terrible performance of File Manager on that home directory with a thousand files in it. The true test of courage is to find out what files are in use without having to send mail to everyone in the building. Fortunately, there is a tool triumvirate to let you walk from the system open file table back to associated processes.

The first is a built-in tool called fuser, short for file user. Give it a filename, directory name, or a special device, and it dumps out the process IDs that have open file descriptors pointing to the argument. fuser's output shows you whether the process has the file open as its current working directory, as a normal file, or as its root directory:

luey# fuser /var/spool/lp
/var/spool/lp: 150c

The c indicates that the directory is used as a current working directory; exploring the process ID with ps yields:

luey# ps -p 150
PID TTY        TIME COMD
150 ?          0:00 lpNet

Because fuser goes into the kernel to scan the system file table, it must be run as root. One of the more powerful applications of fuser is finding the process that's got an open file or working directory on the CD device, preventing the CD-ROM caddy from ejecting with "file busy" messages:

luey# fuser /dev/dsk/c0t6d0s0
/dev/rdsk/c0t6d0s0:      166o

Process 166 has a file open on the CD-ROM; killing it will let you eject away. Note that fuser writes process IDs to stdout, but the open status letters go to stderr. If you want a list of process IDs to hand to kill or another command, catch the standard output of fuser and ignore the commentary on standard error.

A similar tool for SunOS 4.1.x is ofiles. It takes a filename or device and tells you the processes that are using it and the type of use, including read or write locks on the file. ofiles merges the file descriptor table walk with some of the functionality of ps, making it a simpler tool to use:

luey% ofiles /var
/var/	/var/ (mount point)
USER        PID  TYPE    FD  CMD           
root         59  file/x   7  ypbind        
root        186  cwd         cron          
root        186  file     3  cron          
root         95  file     9  syslogd       
root        103  cwd         sendmail      
root        194  file/x   3  lpd

ofiles has the added advantage of understanding protocol control blocks (PCB) associated with sockets. Again, this only works under SunOS 4.1.x, but you can answer nasty questions like "Who owns the socket on port 139?" First use netstat -A to dump out the port number and PCB table:

luey% netstat -A 
Active Internet connections
PCB      Proto Recv-Q Send-Q  Local Address      Foreign Address    (state)
f8984400 tcp        0      0  duck.139           pcserver.3411     TIME_WAIT

Then feed the PCB address to ofiles to locate the owner of the socket endpoint:

luey% ofiles -n f8984400
USER        PID  TYPE    FD  CMD
daemon     4559  sock     4  lanserver

ofiles should be run as root or made a setgid executable owned by group kmem so that it too can read the kernel's memory.

The third tool of the trio is lsof, which produces a list of open files. Consider it a version of ls that shows no regard for privacy:

luey% lsof /dev/rdsk/c0t6d0s0
COMMAND     PID     USER   FD   TYPE     DEVICE   SIZE/OFF  INODE/NAME
vold        166     root   10r  VCHR    32,  48        0x0   7869 /dev/rdsk/...

lsof is a superset of ofiles and fuser. It understands how to convert socket descriptors into process IDs, and can handle specifications in terms of filenames, filesystem names, or even TCP/IP addresses and service port numbers. lsof also lets you build in a secure mode, so that users can only get information on their own open files. Besides coming to the rescue of users held captive by CD-ROMs that won't unmount, what else can you do with lsof? Use it to generate "hot lists" of files and directories to feed capacity planning exercises and drive disk space allocation. After all, it doesn't matter by what name your users call their files -- as long as they don't have to call you to access them.

Click on our Sponsors to help Support SunWorld

Resources

Building_and_Debugging_SunOS_Kernels.ps.gz
ftp://ftp.ocs.mq.edu.au/Documents/Sun/Administration/Building_and_Debugging_SunOS_Kernels.ps.gz
building_and_debugging_kernels.ps
ftp://ftp.bath.ac.uk/sun/sun-white-papers/perfpapers/building_and_debugging_kernels.ps Panic! Unix System Crash Dump Analysis http://www.sun.com/smi/ssoftpress/books/Drake/Drake.html
SysAdmin, Advanced Systems, May 1995: "Looking for Mr. Good Block"
/sunworldonline/asm-05-1994/asm-05-sysadmin.html
excellent notes on setting up an anonymous ftp server
ftp://ftp.cs.toronto.edu:/pub/darwin/solaris2/solaris2.ftpsetup
sun-managers Frequently Asked Questions list
ftp://ftp.fwi.uva.nl/pub/solaris/solaris2.faq
check out the view
http://www.openmarket.com/boscam
the shopping
http://www.bweb.com
bovine Internet service provider
http://www.cow.net ofiles ftp://ftp.std.com/customers3/src/util/ofiles2.gz lsof ftp://ftp.std.com/customers3/src/util/lsof/lsof2.tar.gz
A list of Hal Stern's other Sysadmin
/sunworldonline/common/swol-backissues-columns.html#sysadmin

If you have technical problems with this magazine, contact webmaster@sunworld.com

URL: http://www.sunworld.com/swol-09-1995/swol-09-sysadmin.html
Last modified:

Comments:
Name:
Email:
Company Name:

A file by any other name

The Unix filesystem is the primary mechanism we use to locate our data. Here's how it works.

Excessively deep directory structures generate large numbers of lookup requests and can frustrate users. . .

The Unix filesystem is the primary mechanism
we use to locate our data. Here's how it works.

Excessively deep directory structures
generate large numbers of lookup requests
and can frustrate users. . .