Fiddling around with files, part four
A look at file flags in the file descriptor and file structure
We've covered a fair amount of ground in our discussion of files in Solaris. This month Jim continues the series with a look at file flags at the file structure, vnode, and inode layers. (3,000 words)
he Solaris kernel maintains several flags at different layers in the file code. Within the file structure, there are flags that can be set when the file is opened (open(2) system call). Subsequent file operations using the fcntl(2) (file control) system call allow for setting and clearing file flags after the file has been opened. Some of these flags get pushed down to the file's vnode, such that file manipulation at the vnode layer has the information available without referencing the file structure. Finally, there are flags maintained in the inode that are not exported to the process (application) level.
The system begins looking at file flags in the very early stages of opening a file. (Remember, every file we do I/O to in Solaris must first be "opened," which is why we'll begin with examining what happens when a file is opened.) When the open system call is entered, a check is made to ensure that either the read or write flags are set, and the kernel looks at the O_NDELAY and O_NONBLOCK flags, which are setable by the programmer (see open(2)). O_NDELAY and O_NONBLOCK have the same meaning -- they specify non-blocking I/O. Two flags exist because of evolving standards -- the O_NDELAY flag emerged as a Unix SVR4 standard, while the O_NONBLOCK flag comes from the POSIX.1 specification. Solaris supports both standards, so they are both available for compatibility reasons. New applications should use the POSIX.1 standard (NONBLOCK), which is what the system chooses if both are set.
Non-blocking I/O, as the term implies, instructs the system not to allow a read or write to the file to block if the operation cannot be done right away. The specific conditions under which the kernel will implement this functionality vary depending on the file type. For regular files, non-blocking I/O is possible when used in conjunction with mandatory file locking. If a read or write is attempted on a regular file, and a record lock exists on the section of the file (the "record") to be read or written, the read or write will block until the I/O can be completed -- unless the O_NDELAY or O_NONBLOCK flag is set, in which case the read or write fails, and the kernel sets the errno (error number) to EAGAIN (a hint to the programmer to try again later). Blocking simply means that the process or thread will be put to sleep by the kernel. When an event occurs that the process or thread is waiting for, in this case a file lock being released, the kernel will issue a wake-up, and the process will be placed on a dispatch queue so the scheduler can schedule it for execution on a processor. For other file types, such as sockets, FIFOs and device special files, there may not be any data available for a read (e.g. read from a named pipe or socket when data has not been written yet). Without the O_NONBLOCK or O_NDELAY flag set, the read or write would simply block until data became available again (e.g. as a result of a write done to a FIFO or socket).
As documented in the open(2) man page, if both flags are set in the open(2) call, the O_NONBLOCK flag takes precedence, and the O_NDELAY flag is cleared in the kernel open code. In order to maintain the blocking (or non-blocking) requirements for read/write operations to the file, a flag in the file structure, FNDELAY, is set when the file is opened with O_NONBLOCK or O_NDELAY set, and the file structure flags are referenced in the kernel during I/O operations on the file so the system knows what to do. The NONBLOCK/NDELAY flags do not have corresponding flags at the vnode or inode layers because there is no need for the kernel to push this information down to those structures.
File flags, set 2
The next set of flags checked by the operating system are slightly further down the code path in the file open process. The O_EXCL flag, exclusive open, can be used in conjunction with the O_CREAT flag to request that the system return an error if the file already exists. If the file does not already exist, and both flags are set, the file creation and return of a file descriptor to the calling process are guaranteed to be atomic. (Atomic means that the steps involved are guaranteed to complete entirely or not at all. The system does not allow for an operation to partially complete and have something else come along that could alter the desired state. In this case, that would be an exclusive create operation in progress and another process executing an open to the same file before the first process' exclusive open completed. The atomicity of this implementation will not allow that to happen.) The O_CREAT flag instructs the system to create the file if it does not already exist. The following pseudo code demonstrates what the results of a file open will be under different conditions. (Note that the information below assumes all other permission conditions are such that they would not result in an error, e.g. read/write permission to the working directory is good.)
if (file exists) if (O_CREAT is clear and O_EXCL is clear) return file descriptor (open succeeds) if (O_CREAT is clear and O_EXCL is set) return file descriptor (open succeeds) if (O_CREAT is set and O_EXCL is clear) return file descriptor (open succeeds) if (O_CREAT is set and O_EXCL is set) return "file exists" error (open fails) if (file does not exist) if (O_CREAT is clear and O_EXCL is clear) return "no such file" error (open fails) if (O_CREAT is clear and O_EXCL is set) return "no such file" error (open fails) if (O_CREAT is set and O_EXCL is clear) create file return file descriptor (open succeeds) if (O_CREAT is set and O_EXCL is set) create file return file descriptor (open succeeds)
The O_CREAT and O_EXCL flags have no meaning beyond the open operation and are not preserved in any of the underlying file support structures.
The file modes (discussed in a previous column) are established as a result of the (optional) mode bits that can be passed in the open system call, along with the process file creation mask. The file creation mask is set with the umask(1) command (or the umask(2) system call programmatically), and allows a user to define default permissions for files they create. Once set, the value is stored in the user area of the process structure, in the variable u_cmask. Readers can reference the umask(1) man page for specifics; but briefly the umask value represents the file mode bits the user wants unset for file creation. Put another way, the umask value determines which file permission bits will be turned off by default. The kernel simply uses standard C language bitwise operators to turn off whatever bits are defined in the u_cmask value for the file's permission modes when a file is created. Note that this applies only to newly created files. An open(2) can of course be issued on an existing file, in which case the permission checks are done to ensure that the calling process has proper permissions. If the open(2) has a mode value defined as an optional third argument, the permission bits of an existing file are not changed. That requires use of chmod(1) or chmod(2).
Moving on, we have the FAPPEND flag available in the file structure. If the file is opened with the O_APPEND flag set in the open(2) call, the kernel will set the corresponding FAPPEND flag in the file structure. This flag instructs the system to position the file pointer (offset) to the end of the file prior to every write. This is implemented pretty simply -- inside the file system-specific write code (e.g. ufs_write() for UFS files), the kernel sets the file offset to the file size value maintained in the file's inode before doing the write.
When it comes to data integrity and file I/O, file flags exist within Solaris that provide for different levels of data synchronization and file integrity, which gives application developers some flexibility when designing applications that read and write files. There are three applicable flags that can be set in the open system call, O_SYNC, O_RSYNC, and O_DSYNC. The file structure that is allocated when the file is opened has three corresponding flags that will be set in the structure's f_flag field based on what is passed in the open(2) call. Table 1 lists the flags and provides a definition of each. Any of these flags can be set on an open file by issuing a fcntl(2) system call on the file descriptor with the appropriate arguments.
|Flag in open(2)||Corresponding
flag in file structure
|O_SYNC||FSYNC||Data and Inode integrity when writing|
|O_DSYNC||FDSYNC||Data integrity when writing|
|O_RSYNC||FRSYNC||Read data synchronization|
From this point forward, we'll refer to the flags as simply SYNC, DSYNC, and RSYNC. The SYNC flag, as the definition in table 1 states, tells the operating system that file data and inode integrity must be maintained. Simply put, when the SYNC flag is set on a file descriptor, and a write is done to the file, the write system call does not return to the calling process or thread until the data has been written to the disk, and the file inode data has been updated. Without the SYNC flag, the write will return when the data has been committed to a page in the buffer cache (physical memory), with the inode information being cached as well. This is the default behavior, which is essentially asynchronous in nature. Better overall file I/O throughput is achieved, as non-SYNC writes take advantage of the caching mechanisms implemented in Solaris. The SYNC flag provides optional functionality for applications that require file integrity (commit file data to disk platters) for every write.
The DSYNC flag also provides synchronous write functionality, meaning that the write system call does not return to the calling process until the write data has been committed to the disk storage. Unlike SYNC, however, DSYNC does not require that the file inode data be committed to the disk. The data-only synchronization implementation comes as part of support for POSIX.4, which is why Solaris defines two levels of integrity for file I/O operations: synchronized I/O file integrity and synchronized I/O data integrity (see fcntl(5)). File integrity has to do with data and inode information; data integrity covers just file data. For each of the SYNC flags available, one or the other level of synchronized I/O integrity is guaranteed. Readers should reference the fcntl(5) man page (the notes section) for the documented definitions of file and data integrity.
The RSYNC flag provides for read synchronization and is used in conjunction with the SYNC or DSYNC flag. With the SYNC flag, file integrity (data and inode) is enforced. When set with the DSYNC flag, data integrity is enforced, meaning that any pending writes to the file are completed before the read is done. If RSYNC and SYNC are both set, pending write I/Os are completed, and the inode information is updated prior to the kernel processing the read request. Note that all read operations are guaranteed not to return stale data. A read issued on a file without any of the SYNC flags set simply implies that a read from a buffer may be reading data that has not yet been written to the disks. The open(2) and fcntl(2) man pages do not state explicitly that RSYNC must be used along with either SYNC or DSYNC, though it is somewhat implied. In examining the kernel source, it appears that RSYNC only has meaning when used with one of the other flags. By itself, RSYNC does not alter file I/O behavior.
The various SYNC flags have no corresponding flags at the vnode layer. The inode maintains flags to indicate synchronous inode operations. These flags are set in the read/write code path in the kernel file system code (e.g. ufs_read(), ufs_write()) based on the status of the SYNC flags in the file structure for the file. They're used by the kernel in the lower level I/O routines. And as a final note, Solaris provides two library routines that can be called from applications to do file or data integrity synchronization prior to issuing a read or write -- fdatasync(3R) and fsync(3C). These calls allow for per-file syncing of data to disk, and these routines return only when the write to disk has been completed.
There are other flags that are maintained in the per-process file descriptor and are not pushed down to the file structure. Remember from the first column on files, the user area embedded in the process structure has an array of uf_entry structures, one for every file the process has opened. The file descriptor is an integer value that indexes into this array for operations on a particular file. The uf_entry uf_pofile member maintains the file descriptor flags. Whenever a process opens a file, a new uf_entry structure is allocated, and a file structure is also allocated (even multiple opens to the same file). An exception to this is when the dup(2) system call is used. Dup(2) duplicates a file descriptor and will result in a new uf_entry structure for the file descriptor. The difference is that the uf_ofile pointer, which points to file structure, will point to the same file structure as the duped file descriptor.
We illustrate this with a code sample, using the
utility. The test program, called ofd, does two opens of the same
file. The second open adds setting the O_SYNC flag (which I only
added to demonstrate how these file structure flags can be
examined). After the two opens, a dup(2) is executed to duplicate the
second file descriptor. Below is some output and an example of
using /usr/proc/bin/pfiles to look at a process's open
fawlty> ofd &  24788 fawlty> fd1: 3, fd2: 4,dfd: 5 fawlty> /usr/proc/bin/pfiles 24788 24788: ofd Current rlimit: 64 file descriptors 0: S_IFCHR mode:0620 dev:32,0 ino:324208 uid:20821 gid:7 rdev:24,1 O_RDWR 1: S_IFCHR mode:0620 dev:32,0 ino:324208 uid:20821 gid:7 rdev:24,1 O_RDWR 2: S_IFCHR mode:0620 dev:32,0 ino:324208 uid:20821 gid:7 rdev:24,1 O_RDWR 3: S_IFREG mode:0755 dev:32,8 ino:18 uid:20821 gid:30 size:0 O_RDWR 4: S_IFREG mode:0755 dev:32,8 ino:18 uid:20821 gid:30 size:0 O_RDWR|O_SYNC 5: S_IFREG mode:0755 dev:32,8 ino:18 uid:20821 gid:30 size:0 O_RDWR|O_SYNC fawlty>
On the first line, I execute the test program that prints the three
open file descriptor values after the opens and dup calls have
executed. Because every process already has file descriptors 0, 1, and
2 allocated, my two open files and duped file are fd 3, 4, and 5.
pfiles command dumps some information on process open
files, and we see that fd 4 and fd 5 have the O_SYNC flag set, which is logical
as they're both referencing the same file structure.
Now we'll use
crash(1M) to illustrate the file structure sharing.
fawlty> su Password: # /etc/crash dumpfile = /dev/mem, namelist = /dev/ksyms, outfile = stdout > p 59 PROC TABLE SIZE = 1962 SLOT ST PID PPID PGID SID UID PRI NAME FLAGS 59 r 24774 24716 24774 24716 20821 0 ofd load > u 59 PER PROCESS USER AREA FOR PROCESS 59 PROCESS MISC: command: ofd, psargs: ofd start: Tue Apr 21 21:55:51 1998 mem: 43449, type: exec vnode of current directory: 60025394 OPEN FILES, POFILE FLAGS, AND THREAD REFCNT: : F 0x60b6eca8, 0, 0 : F 0x60b6eca8, 0, 0 : F 0x60b6eca8, 0, 0 : F 0x60824c58, 0, 0 : F 0x60824870, 0, 0 : F 0x60824870, 0, 0 > f 60824c58 ADDRESS RCNT TYPE/ADDR OFFSET FLAGS 60824c58 1 UFS /608ba0e8 0 read write > f 60824870 ADDRESS RCNT TYPE/ADDR OFFSET FLAGS 60824870 2 UFS /608ba0e8 0 read write sync >
I cut some stuff out of the
crash(1M) output to reduce the size of
the example. The u utility in
crash dumps the uarea of a process.
For each open file, it provides the address of the corresponding
file structure. As you can see, fd 3 and 4 reference different
file structures, even though we opened the same file in the same
process. File descriptors 4 and 5 show the same file structure
address (60824870), because 5 is a dup of 4. Finally, the f (file)
utility dumps file structure information. We first dump the file
structure for fd 3 (f 60824c58) and then for fd 4 (f 60824870). You
can see the file structure has a reference count of two, as there
are two file descriptors referencing it, and also shows the "sync"
flag set in the file structure, which we set in the original open(2)
system call. The TYPE/ADDR column in the f utility output provides
the file type and kernel address of the vnode. Note that the vnode
is the same as any open file in the system will have one vnode
despite the number of opens and file structures allocated to it.
The flags that are used at the file descriptor level are mostly for the operating system to use -- they are not user setable. One such flag is, FCLOSEXEC, which is the close-on-exec flag. This notifies the operating system to close the file descriptor in the new process if a process executes an exec(2) call. Normally, all open files are inherited by child processes, which is not always the desired behavior. The kernel implements this with a simple close_exec() kernel routine that gets called from exec(2), which walks through the file descriptor array and closes any file with the FCLOEXEC flag set. The fcntl(2) call is used to set/clear/examine the file descriptor's flag field, and currently FCLOEXEC is the only flag that can be set, cleared, or examined.
That's a wrap for this month. Next month we look at large files and access control lists.
About the author
Jim Mauro is currently an area technology manager for Sun Microsystems in the Northeast area, focusing on server systems, clusters, and high availability. He has a total of 18 years industry experience, working in service, educational services (he developed and delivered courses on Unix internals and administration), and software consulting. Reach Jim at email@example.com.
If you have technical problems with this magazine, contact firstname.lastname@example.org