|
A file by any other name
The Unix filesystem is the primary mechanism
|
The Unix filesystem can be a rat's nest of hostile symbolic links, obstinate device mounts, and corpulent directories. A few simple tips and a handful of utilities can prevent order from turning into chaos. (3,800 words)
Mail this article to a friend |
This month, we'll sort through some filesystem navigation issues.
We'll look at open()
to see how processes find files, and
examine some performance issues. From there, it's off to the links --
hard and symbolic -- to see how they alter paths through the
filesystem. Finally, we'll look at the tools available to find the
user associated with an open file or directory. While we may not offer
any solutions to the deep-versus-wide directory layout argument, we'll
try to make sure that no matter what you call your files, you'll get the
right bits.
Open duplicity
We see the filesystem as a tree of directories and filenames. These
physical names aren't used inside of a process. Logical
names known as file descriptors, or file handles, are used to
identify a file for reading or writing. Unix file descriptors are
integers returned by the open()
or dup()
system calls. Pass a filesystem pathname to open()
along
with the read or write permissions you want, and open()
returns a file descriptor or an error:
int fd; fd = open("/home/stern/cols/sep95.doc", O_RDONLY); if (fd < 0) { fprintf(stderr, "cannot open file\n"); exit(1); }
|
|
|
|
You can open special (raw) devices or special files like Unix domain
sockets using a similar code fragment. File descriptors are maintained
on a per-process basis, in the same kernel data structures that keep
track of the stack, address space, and signal handlers. Every process
starts out with file descriptors 0, 1, and 2 assigned to the standard
input, output, and error streams, respectively. Each subsequent call
to open()
returns the next available file handle. When
you call close()
on a file descriptor, that handle is the
next one used by open()
. The integer is really just an
index into a table of per-process file descriptors that point to the
system open file table. The system file table points to other
file-specific information like inodes, NFS server addresses, or
protocol-control blocks for file descriptors associated with sockets.
What's the point of dup()
if open()
is the
primary mechanism for converting pathnames to file handles? When you
want to use the same file for two output streams, such as
stdout
and stderr
, use dup()
to
copy one descriptor into another. The following code segment closes
the default stdout and stderr streams and pumps both into a new file:
int fd; close(1); close(2); fd = open("/home/stern/log/errors", O_WRONLY|O_CREATE); dup(fd);
Because the file descriptors for stdout and stderr were closed before
the calls to open()
and dup()
, these handles
are re-used.
Open file descriptors are preserved when new processes are created
with fork()
and vfork()
, so any process that
gets started via exec()
inherits the open file
descriptors left by the calling process. Here's a simple example of
using dup()
and fork()
to set up a pipe
between two processes:
int fd[2], pid; pipe(&fd); close(0); dup(fd[0]); if ((pid = fork()) == 0) { close(1); close(2); dup(fd[1]); /* connect stdout */ dup(fd[1]); /* and stderr */ exec("writer"); } exec("reader");
First the call to pipe()
creates a pipe and returns file
descriptors for the reading and writing ends. Standard input is
closed, and the reading end of the pipe is connected via
dup()
. A new process is created using
fork()
, and its stdout and stderr are fitted into the
writing end of the pipe. The child process exec
s the
writer and the parent process becomes the reader. This is a slimmed-down version of what happens inside the shell when you execute
command_a | command_b
from the command line.
The per-process file descriptor table has a default maximum size of 20
in SunOS 4.1.x and 64 in Solaris. If you exceed the default size, the
next call to open()
or dup()
returns EMFILE.
Most processes don't touch that many files, but system processes or
connection managers such as inetd
may open many file
descriptors. Relax the limit using the unlimit
command
from the shell. You can even start inetd
in a subshell
that has the file descriptor limit removed:
luey% ( unlimit descriptors; inetd & )
Setting the file descriptor ceiling to "unlimited" really means 128 in
SunOS 4.1.x, and 1,024 in Solaris. The smaller limit in SunOS is due
to the standard I/O library (and others) using a signed
char
for the file descriptor. If you are porting code
from BSD platforms to Solaris, be on the lookout for snippets that use
8-bit signed file descriptors and test for return values less than
zero: Solaris will return file descriptors over 128, which look like
failures but are valid file handles. Solaris also dynamically
allocates space in the per-process file descriptor table, so removing
the file descriptor limit won't make your processes bloat with unused
table space.
Open system call
Peering inside the open()
system call lets us learn more
about the performance implications of excessively deep or wide
directory structures. When open()
converts a pathname
into a file handle, it starts by walking down the path one component
at a time. For example, to look up /home/stern/log/error,
first home is found in the root directory, then
stern is found under home, and so on. To locate a
filename component, a linear search is performed on the directory -- Unix
doesn't keep the directory entries sorted on disk, so each one must be
examined to find a match or determine that the file or directory
doesn't exist.
Why is the lookup done one component at a time? Any of those directories could be a mount point or a symbolic link, taking the pathname resolution to another filesystem or another machine in the case of an NFS mount. If /home/stern is NFS-mounted, the first two lookups occur on the local disk, then the request to find log in stern is sent over the network to the NFS server. These pathname resolutions show up as NFS lookup requests and can account for as much as 40% of your NFS traffic.
Pathname-to-file-descriptor parsing can be a disk-, network-, and server-intensive business. If the directory entries must be searched, they have to be read from disk, incurring file-access overhead before the file is opened for normal I/O. When you are using NFS-mounted files, you may be performing directory searches on the server, accruing network round-trip and disk I/O penalties.
Excessively deep directory structures
generate large numbers of lookup requests
and can frustrate users. . .
So how does the operating system accelerate this process? Recently
completed lookups are kept in the Directory Name Lookup Cache (DNLC),
an association of directories and component names that is checked
before performing a directory search. DNLC statistics are accessible
through vmstat -s
:
luey% vmstat -s ... 462936 total name lookups (cache hits 91%) 214 toolong
Ideally, your hit rate should be over 90%, and as close to 99% as possible. However, several things conspire to make the DNLC less efficient:
maxusers
kernel parameter to
crank up the cache size. For more information, ftp these technical
documents:
Building_and_Debugging_SunOS_Kernels.ps.gz
or
building_and_debugging_kernels.ps.
Sometimes vmstat -s
reports bizarre statistics, usually
when its internal counters overflow and its math precision becomes
non-existent. Kimberley Brown, co-author of
Panic! Unix System Crash Dump Analysis
provides a detailed adb script to dump out all of the name cache (nc)
statistics. It shows you the number of hits, misses, too-long names,
and purges.
Deep and wide
Given that directory searches occur in linear fashion, but each
pathname component requires another lookup operation, is it better to
have deep or wide directories? Do you do better having all of your
files in just a few directories, minimizing the number of lookups, or
do you go for more lookups but make each one go faster because the
directories have only a few entries?
The answer, as usual, is to avoid the extremes. Excessively deep directory structures generate large numbers of lookup requests and can frustrate users trying to navigate up and down half a dozen levels of directories. Perhaps the worst file naming convention is to take a serial number and use each digit as a directory, that is, document 51269 becomes /docs/5/1/2/6/9. While it's trivial to find any document, it's painful to consume so much of the disk with directories. Use a hash table to find documents quickly, or group them with several dozen documents in a single directory.
Large, flat directories are equally bad. When a user complains that
ls
takes a long time, try ls | wc
and see
how many files are in the current directory. In addition to the disk
hits required take to search the directory, you're also paying a CPU
time penalty to sort the list.
SUBHEAD Link to La-La-Land
Pathname resolution takes a turn when a mount point is encountered,
switching from one local disk to another or to an NFS server. Symbolic
links cause a similar detour. First, some background on links and
their implementation. Filesystem links come in two flavors: hard and
symbolic. Hard links are merely multiple names for the same file. They
exist as duplicate directory entries, pointing to the same blocks on
disk. Create hard links using the ln
command:
luey% ln test1 test2
Because they refer to a set of disk blocks, hard links cannot span
filesystem boundaries. If you're wondering why hard links are useful,
consider the mv
command. You can rename files by copying
them, and removing the old name. However, this assumes you have disk
space for two copies of the file while the operation is in progress,
and it may upset any careful tuning you've done of the disks
(see SysAdmin, Advanced Systems, May 1994: "Looking for Mr. Good Block").
Instead of copying the file, mv
uses a hard link to
create the new name, then it removes the old name. The disk blocks
never move, unless you're moving the file across filesystems, in which
case mv
copies the file.
Symbolic links are pointers to filesystem pathnames, not filesystem
data blocks. They're also created with the ln
command,
using the -s
flag:
luey% ln -s /var/log/stern/errors/sep95 /home/stern/log
The link target is the first argument, and the link is the second
argument. When open()
hits a symbolic link, it follows
the link by substituting its value for the current pathname component.
In the above example, /home/stern/log is a link to
/var/log/stern/errors/sep95 so the lookup of log in
stern turns into a lookup of var in the root
directory. Encountering a symbolic link lengthens the pathname lookup
if the link has many components, and it may force a disk access to
read the link. All symbolic links in SunOS are read from the disk,
while only those with more than 32 characters in the target are read
from disk in Solaris -- shorter links are kept in memory with the
link's inode.
Absolute links point to a pathname that begins with a /, while relative links assume the current directory as a starting point. For example:
luey% ln -s ../logs/sep95 error
creates a link named error in the current directory, pointing to ../logs/sep95. Using symbolic links successfully means enforcing consistency in filesystem access on all of your clients. If you use absolute links, you must be sure that all of the pathname components in the link are accessible to processes that hit the link. With relative links, you run the risk of ending up somewhere unexpected when you walk up the directory tree.
One of the most common uses of symbolic links is the creation of a consistent name space built from CPU-, OS-, or release-specific components. For example, /usr/local/bin contains binaries specific to a CPU architecture and an OS release. Many administrators use symbolic links to point to the "right" version of /usr/local/bin:
luey% ln -s /usr/local/sun4.solaris2.4/bin /usr/local/bin
Maintaining the links gets more difficult as the number of special cases grows, and it makes performing an upgrade less than pleasant. Another solution is to use the hierarchical mount feature of the automounter, letting it build up the right combination of generic and specific filesystems on the fly. Here's a hierarchical automounter map entry for /usr/local:
/usr/local \ / servera:/usr/local /bin servera:/usr/local/bin.$ARCH.$OS /lib serverb:/usr/local/lib.$ARCH.$OS
First the generic /usr/local template is mounted, giving you machine-independent directories like /usr/local/share and /usr/local/man. Then the bin and lib directories are dropped on top, using the variants named by the automounter variables. Invoke the automounter with the appropriate variable definitions on each machine:
luey% automount OS=solaris2.4 ARCH=sun4
When you upgrade the OS, change this line to reflect the new OS variable value, and all of your machine dependencies are resolved.
Virtual Boston
Wise use of the automounter and symbolic links can keep your users
from falling off the edge of a filesystem into previously uncharted
net-surf. But what if you want to explicitly prevent prowling? How can
you create a restricted environment from which scripts, users, and
not-so-well-intentioned visitors cannot escape? The magic system call
is chroot()
-- change root. When chroot()
is
called with a pathname, that directory becomes the new virtual root of
the filesystem. The calling process and all of its descendants only
see files in the virtual root and below.
Anonymous ftp servers use chroot()
to keep you corraled
in the public ftp area. Webmasters would sleep better at night if CGI
(common gateway interface) scripts ran with chroot()
so
that file damage or exposure caused by script failures could be
contained to a selected subset of the server's filesystems.
Once you call chroot()
, anything you'll need has to
appear in the truncated filesystem. If you are using anonymous ftp, and you
expect users to generate listings, you'll need a version of
/bin/ls
as well as the dynamic linker ld.so
and the essential C runtime libraries from /lib. Let's say
you want to use /export/pub/ftp as the root of your subset.
You'll need to create the following directories and populate them:
/export/pub/ftp/bin for binaries such as ls
and
ld.so
, /export/pub/ftp/lib for dynamically
linked libraries, and /export/pub/dev for special devices needed
at runtime, like /dev/zero.
For more information, consult the generic sun-managers Frequently Asked Questions list, which contains many more relevant pointers.
If you use chroot()
in other places, for example,
creating electronically padded cells for scripts, be careful about
using symbolic links. Absolute links' values still refer to the
old root of the filesystem. If you perform
chroot("/home/cell")
, and then try to resolve a link that
points to /var/log/errors, the actual pathname created
(relative to the true root of the filesystem) is
/home/cell/var/log/errors. Relative links are a safer bet for
portability when using chroot()
, but again be sure that
you don't try to walk above the root of the truncated filesystem. An
arbitrarily long chain of .. components still only hits the subset
root, not the true filesystem root.
You must be root to call chroot()
, so it is usually
called by daemons started by root just before they exec()
another image or change their effective user ID to something
non-privileged. Changing your filesystem-centric viewpoint to preen
unrelated directories gives you a Virtual Boston -- the "hub of the
universe." Should you not believe the hype,
check out the view,
the shopping,
or its
bovine Internet service provider.
User descriptors
By now you can help your users get to files, create shortcuts, and
maybe even eliminate the odd complaint about the terrible performance
of File Manager on that home directory with a thousand files in it.
The true test of courage is to find out what files are in use without
having to send mail to everyone in the building. Fortunately, there is
a tool triumvirate to let you walk from the system open file table
back to associated processes.
The first is a built-in tool called
fuser
, short for file user.
Give it a filename, directory name, or a special device, and it dumps
out the process IDs that have open file descriptors pointing to the
argument. fuser
's output shows you whether the process
has the file open as its current working directory, as a normal file,
or as its root directory:
luey# fuser /var/spool/lp /var/spool/lp: 150c
The c indicates that the directory is used as a current working
directory; exploring the process ID with ps
yields:
luey# ps -p 150 PID TTY TIME COMD 150 ? 0:00 lpNet
Because fuser
goes into the kernel to scan the system
file table, it must be run as root. One of the more powerful
applications of fuser
is finding the process that's got
an open file or working directory on the CD device, preventing the
CD-ROM caddy from ejecting with "file busy" messages:
luey# fuser /dev/dsk/c0t6d0s0 /dev/rdsk/c0t6d0s0: 166o
Process 166 has a file open on the CD-ROM; killing it will let you
eject away. Note that fuser
writes process IDs to stdout,
but the open status letters go to stderr. If you want a list of
process IDs to hand to kill or another command, catch the standard
output of fuser
and ignore the commentary on standard
error.
A similar tool for SunOS 4.1.x is
ofiles
.
It takes a filename or device and tells you the processes that are
using it and the type of use, including read or write locks on the
file. ofiles
merges the file descriptor table walk with
some of the functionality of ps
, making it a simpler tool
to use:
luey% ofiles /var /var/ /var/ (mount point) USER PID TYPE FD CMD root 59 file/x 7 ypbind root 186 cwd cron root 186 file 3 cron root 95 file 9 syslogd root 103 cwd sendmail root 194 file/x 3 lpd
ofiles
has the added advantage of understanding protocol
control blocks (PCB) associated with sockets. Again, this only works
under SunOS 4.1.x, but you can answer nasty questions like "Who owns
the socket on port 139?" First use netstat -A
to dump out
the port number and PCB table:
luey% netstat -A Active Internet connections PCB Proto Recv-Q Send-Q Local Address Foreign Address (state) f8984400 tcp 0 0 duck.139 pcserver.3411 TIME_WAIT
Then feed the PCB address to ofiles
to locate the owner of the socket
endpoint:
luey% ofiles -n f8984400 USER PID TYPE FD CMD daemon 4559 sock 4 lanserver
ofiles
should be run as root or made a setgid executable
owned by group kmem so that it too can read the kernel's memory.
The third tool of the trio is
lsof
,
which produces a list of open files. Consider it a version of
ls
that shows no regard for privacy:
luey% lsof /dev/rdsk/c0t6d0s0 COMMAND PID USER FD TYPE DEVICE SIZE/OFF INODE/NAME vold 166 root 10r VCHR 32, 48 0x0 7869 /dev/rdsk/...
lsof
is a superset of ofiles
and
fuser
. It understands how to convert socket descriptors
into process IDs, and can handle specifications in terms of filenames,
filesystem names, or even TCP/IP addresses and service port numbers.
lsof
also lets you build in a secure mode, so that users
can only get information on their own open files. Besides coming to
the rescue of users held captive by CD-ROMs that won't unmount, what
else can you do with lsof
? Use it to generate "hot lists"
of files and directories to feed capacity planning exercises and drive
disk space allocation. After all, it doesn't matter by what name your
users call their files -- as long as they don't have to call you to
access them.
|
Resources
ofiles
ftp://ftp.std.com/customers3/src/util/ofiles2.gz
lsof
ftp://ftp.std.com/customers3/src/util/lsof/lsof2.tar.gz
If you have technical problems with this magazine, contact webmaster@sunworld.com
URL: http://www.sunworld.com/swol-09-1995/swol-09-sysadmin.html
Last modified: