Fiddling around with files, part two

This month we'll continue our discussion of files in Solaris with a look at the APIs and how they affect the number of files you can have open. We'll also begin explaining file types

March 1998

Abstract

Last month we began our discussion of the implementation of files in Solaris with an overview of the main kernel structures involved in file maintenance. We also addressed the question of how many files can be opened at a systemwide and per-process level. This month we continue that discussion with a look at which file APIs (application programming interfaces) are used and the types of files Solaris supports. (4,200 words)
Note to readers: It's difficult sometimes to assess the amount of space required to cover a particular topic. I've found that once the research and writing begin the effort takes on a life of its own, and I often end up discovering a whole slew of items that warrant discussion if the topic at hand is to be given proper coverage. Our current topic falls into this category; thus we'll likely require at least one additional column to complete our discussion of files.

Mail this
article to
a friend

n last month's column I discussed the kernel structure used for maintaining open files, the file data structure, and described the various structure members. A reader sent me an e-mail with an interesting question that I believe warrants covering here. The question was around the file offset that is used to reflect the current character position in the file (a pointer to a location in the file).

The first point to make here is that the same file can have multiple file data structures associated with it. If different processes executing on the same system open the same file, each process will have a unique representation of the file through its file descriptor to a process-specific file data structure. Thus, as each process reads and writes the file, the process will change the f_offset member of its file structure.

This behavior differs with file descriptors that are inherited via the fork(2) system call, or when processes issue the dup(2) or dup(2) system calls for duplicating file descriptors. In these scenerios, the file structure is shared, and the f_offset field is shared, meaning that a read or write to the file changes the f_offset as it is seen by the child process (in the case of a fork(2)), or references to the file descriptors returned by dup(2) and dup2(2).

Finally, for processes that are multithreaded, multiple threads within a process doing reads and writes to the same file descriptor will alter the offset field as seen by all threads in the process, since once again there is one file structure involved. Logically, if one thread closes the file, the file will be closed process wide.

There is an lseek(2) system call available for manipulating the offset value of an opened file. The above described behavior applies when file offsets are altered using lseek(2). For managing synchronized access to the same file from multiple processes or threads, file locking interfaces are provided and will be discussed next month.

On a related note, Solaris (and Unix in general), does not technically support the notion of file records, which is simply some grouping of bytes within a file. However, the Solaris file APIs allow for specifying a range of bytes within a file for certain file operations, such as file locking.

Advertisements

File APIs
There are several interfaces available in Solaris for doing file I/O. The traditional interfaces are known as the Standard I/O functions, abbreviated as stdio. (See the stdio(3S) manpage.) These are the buffered I/O interfaces that were added to Unix System V by AT&T in the early days of the Unix evolution (actually, they were probably added back in the days of Unix System III, mid-to-late 1970s, but I'm not 100 percent certain). These stdio library interfaces exist as a layer above the lower level system calls and improved the portability of the system as it was ported to more and more hardware platforms. The stdio routines call the corresponding lower level system calls, e.g., fopen(3S) calls open(2), fwrite(3S) calls write(2), etc. They provide additional buffering in the process address space, before entering the kernel through the appropriate system call.

The implementation of stdio is based on a buffered file referred to as a stream, which is described by a data structure called a FILE. A FILE is described in the header file /usr/include/stdio.h.

typedef struct 
{
int             _cnt;   /* number of available characters in buffer */
unsigned char   *_ptr;  /* next character from/to here in buffer */
unsigned char   *_base; /* the buffer */
unsigned char   _flag;  /* the state of the stream */
unsigned char   _file;  /* Unix System file descriptor */
} FILE;

The members of the FILE structure include a count of the number of characters in the buffer (_cnt), a buffer pointer (_base), and a pointer to the current character position in the buffer (_ptr). Also, there is a flag field (_flag) that stores status bits about the state of the file. Examples of flag bits are indications that the file is currently being read or written, the file pointer is at EOF (end-of-file), or an I/O error occurred on the file.

Finally, the file descriptor itself is stored as an unsigned char data type in the _file member. It is because of this FILE structure member, _file, and the fact that it is an unsigned char data type, that a process cannot have more than 256 open files when using the stdio routines. Simply, the value range for this data type is 0 to 255; only 8 bits (1 byte) are maintained when an unsigned char is used. Thus, only a maximum of 256 files can be uniquely identified within a process using the stdio(3S) interfaces.

That is the first example of where selection of a programming interface can affect the number of open files. A second interface that needs to be addressed is the select(3C) routine.

The select(3C) routine provides a polling capability, where sets of file descriptors can be "polled" to determine if data is available for reading, if the file descriptor is ready for writing, or if an exception condition occurred on the file descriptor that needs to be handled. (See the manpage for more.)

The use of select(3C) in code requires the inclusion of the select.h header file (/usr/include/sys/select.h), which contains the following entry:

#ifndef FD_SETSIZE
#define FD_SETSIZE      1024
#endif

The above definition, FD_SETSIZE, defines the maximum number of file descriptors that can reside is a file set polled by select(3C). As you can see, a limit of 1k open files is imposed when using select(3C). The code in select(3C) makes the following check to determine if the number of file descriptors (nfds) is greater than the FD_SETSIZE value:

	if (nfds < 0 || nfds > FD_SETSIZE) {
		errno = EINVAL; 
		return (-1); 
	}

In such a situation, the select(3C) call will fail, and an "Invalid Argument" error (EINVAL) is set.

For applications that require more open files in earlier versions of Solaris, an alternative interface must be used. The poll(2) system call provides functionality similair to select(3C), but does not have any inherent open file limits beyond what was discussed in part one last month. Note that the implementation of select(3C) in Solaris, which is a library routine, calls the poll(2) system call.

These two areas, stdio(3S) and select(3C), are typically where folks hit a wall when opening files. Note that this is not an implied guarantee that use of other interfaces will work perfectly fine with large numbers of opened files. It's up to you to examine the interface when looking to exceed documented, supported limits, and test accordingly.

File types
Last month we briefly mentioned the different types of files that exist in Solaris. More accurately, we mentioned some of the files types that the Solaris kernel acknowledges. The distinction seems subtle, but in reality it's key to understanding the concept of file types.

A file type in Solaris is a file that the operating systems has an embedded knowledge of, where the kernel distinguishes a particular file type from others, and knows about the format and other attributes particular to files of a given type.

Let's discuss some high-level background information that will aid those not familiar with file systems in understanding the file specifics we will get into shortly. If you're comfortable with the main components of a Unix file system, you can skip the next four paragraphs.

One of my favorite interview questions to ask is "What's a file system?" You would be surprised at the variety of answers I get back. Simply put, a file system is an operating system-imposed structure that gets loaded on disk subsystems (disk drives or RAID logical volumes) for the purposes of managing the disk resources for the storage and retrieval of files. The view Solaris has of a raw disk drive (or raw RAID logical volume) is a series of disk blocks, starting with block number zero and going to the last block on the disk. There's actually more to the geometry of a disk than that, but we don't need to go into it here. The point is that without a file system the management of disk hardware resources can be very difficult. File systems add a layer on top of raw disk geometry, allowing the system to more easily allocate and de-allocate resources for file management, making it easier to add optimizations for performance, improve fault tolerance, and create management tools.

All the file types supported within Solaris exist as entities in the UFS file system. Actually, this is only partially true, as Solaris also supports several pseudo file systems. However, for the purposes of our discussion, we can keep things in the context of files in the file system for now and expand on the notion of pseudo file systems later. The UFS (Unix File System) in Solaris is derived from the Berkeley Fast File System, which evolved many years ago as a logical improvement on the traditional AT&T Unix System V File System. The original design of the Unix File System included very few components. A file system had a superblock that described it, inodes to describe files in the file system, and disk blocks to store the body of the files (actual file data). This design was relatively simple and easy to implement, and it did the job for a while. But certain performance-related shortcomings became increasingly apparent over time. For example, the distance between a file's inode and disk blocks often resulted in a lot of disk-seek activity when files were referenced. It wasn't very resilient and could easily end up corrupted beyond the system's ability to repair it (fsck(1M)).

The Berkeley Fast File System addressed these issues. Performance was improved with the introduction of cylinder groups, in which file inodes and data blocks where bundled together in groups on the disk to keep inodes and data blocks closer together, thus reducing seek times. A larger file system block was implemented (8k, as opposed to the 1k file system block in the System V File System), so more data could be read with each file system block read. Resilience was improved by replicating the superblock across various platters on the disk, such that if the primary superblock got corrupted, a backup copy could be retrieved and the file system could be repaired. Other improvements came later, such as the implementation of extent block allocation, for improved sequential I/O performance. Detailed internals coverage of the Unix File System implemented in Solaris is planned for a future column.

The basic file system components of UFS in Solaris are the same as those described above for the Berkeley Fast File System. The component we're most interested in for the purposes of this month's column is the inode. Every file in the file system requires an inode, regardless of what type of file it is. The inode stores information about the file, such as the file size, number of links to the file, the file's owner and group, access modes for the owner, group the owner and everyone else belongs to, the file type, etc. Several dates are maintained in the file's inode: last access, last modification, and last time the inode changed. Most of this information is made available via the ls(1) command, with the -l (long listing) flag:

fawlty> ls -l
-rw-r--r--   1 jim      tech        7036 Feb 13 09:50 files.p2
fawlty> ls -li
1446883 -rw-r--r--   1 jim      tech        7036 Feb 13 09:50 files.p2

The first example above is the typical ls -l output, showing the access permissions (read/write for the owner, "jim," read only for the group tech and the rest of the world). There is one link to the file, meaning that the file appears in only one directory in the file system. The owner is jim, jim's group is tech, the file is 7036 bytes in size, it was last written on Feb 13th at 9:50AM, and the file name is files.p2. The second example adds the -i flag, which provides the file's inode number in addition to all the other information. In this case, the inode number is 1446883.

This brings us back around to the topic at hand, file types. The type of file is maintained in the file's inode and is displayed visually in the output of the ls -l command in the left-most column, just before the file permission and mode bits. In the above example, the file type for the file files.p2 is displayed as a single dash, the "-" character. The dash means files.p2 is a regular file, the most common file type in Solaris. A complete list of the various file types that can exist within a Solaris system, along with the characters used to describe each file in the ls -l output, can be found in Table 1 below.

File Type	Character
Named Pipe (FIFO)	p
Character Device Special	c
Directory	d
Block Device Special	b
Regular	-
Symbolic Link	l
Door (2.6 and beyond)	D
Socket (2.6 and beyond)	s

A useful command to be aware of when discussing file types is the file(1) command, which will take a file name as an argument and attempt to determine its file type. Some files have a magic number embedded in their first couple bytes. That magic number traditionally was used to distinguish between different types of executable files (SPARC binary, Intel binary, etc.), and has evolved to include a variety of different file types. You can examine the /etc/magic file to see all the magic numbers currently included and the corresponding file types. Depending on the type of file, the magic number will be at a different byte offset than the beginning of the file. The number of bytes required to hold the magic number also varies. The offset and length is included in the /etc/magic file.

What changes in terms of accessing the different file types supported within Solaris is the way the operating system treats the file. Regular files have no operating system-imposed structure to them. While at a user level we may make important distinctions between different types of files, such as ASCII text files, executable programs, binary data files, etc., the operating system makes no such distinctions. These are all simply regular files, and at operating system level they're treated as a number of data blocks being passed up to a higher-level user program. If you've ever accidentally used vi(1) on an executable or tried to cat(1) a directory file, you've seen examples of the operating system treating all regular files the same.

Directory files have a specific structure and purpose. Directories are files that contain the names of other files and directories, and every entry in a directory file matches a directory entry structure:

struct  direct {
        uint32_t        d_ino;          /* inode number of entry */
        u_short d_reclen;               /* length of this record */
        u_short d_namlen;               /* length of string in d_name */
        char    d_name[MAXNAMLEN + 1];  /* name must be no longer than this */
};

The entries in a directory are the file's inode number and the file name, which has a maximum length of 255 characters. The other two fields in the structure, d_reclen and d_namlen, record length and file name length respectively, and are not seen by the user, but rather are used internally by the kernel for directory operations. Blocks allocated to directory files are done in units that correspond directly to the underlying storage device, typically 512 bytes. Each 512 byte directory block contains some number of directory entries, defined by the structure above. By maintaining the total size of the directory entry, d_reclen, and the size of the file name character string, d_namlen, the system is able to optimize operations that change directory sizes and search directories for file names.

Interesting to note is that the name of a file is not contained in the file's inode. The file name only exists in the directory entry. Once the file has been located in a directory, the kernel has the file's inode, which is all the kernel needs to do any file operations.

Another bit of data on directories, which pretty much every Unix user is aware of, but which I'm compelled to include, (particularly for those readers who are relatively new to Unix) are the notorious dot (.) and dot dot (..) files that exist in every directory and are put there by the kernel when the directory is created. These represent the current directory, ".", and the parent directory, "..".

1  fawlty> ls -ldi .
2      739716 drwxr-xr-x  40 jim      tech        5120 Feb 16  1998 .
3  fawlty> mkdir mydir
4  fawlty> ls -ldi mydir
5      788462 drwxr-xr-x   2 jim      tech         512 Feb 16  1998 mydir
6  fawlty> cd mydir
7  fawlty> ls -lia
8  total 12
9      788462 drwxr-xr-x   2 jim      tech         512 Feb 16  1998 .
10     739716 drwxr-xr-x  41 jim      tech        5120 Feb 16  1998 ..
11  fawlty>

The above example illustrates an important point about directories. The output of the ls -ldi . command (line 1) shows the current directory, represented as ".", and has an inode number of 739716 (line 2). We make a subdirectory called mydir (line 3) and take a look at it with ls -ldi (line 4). The mydir directory is inode number 788462 (line 5). We then cd down to the mydir directory and look to see if anything is in it (remember, we just created it; use the -a flag for ls(1) to get files that begin with "."). Note that the "." entry has the same inode number as mydir, line 5. Thus, "." in mydir is the mydir directory file. The ".." file has the same inode number as the "." entry from line 2, which is the parent directory for mydir. Simply put, ".." always represents one directory up the tree from the current directory.

Some final points on directories: Solaris implements a cache of file names to speed up directory searches and translation of filenames to inodes. This cache is known as the dnlc, directory-name-lookup cache, and caches directory entries (i.e. file names) up to 31 characters in length. Sun performance expert Adrian Cockcroft discusses tuning the dnlc in his book Sun Performance Tuning. (See Resources below -- look for the newly revised second edition coming out this month!)

Also, Solaris provides APIs specifically for dealing with directories programmatically. The opendir(3C) routine opens a specified directory and returns a pointer to a director stream, called DIR. It's exactly analogous to the FILE stream used by the stdio routines, discussed above. It's an extra level of indirection between the user's code and the kernel directory entry structure. Once a directory is successfully opened, it can be traversed using readdir(3C). Other APIs include rewinddir(3C), which sets the pointer back to the beginning of the directory, seekdir(3C) for moving the pointer to another location in the directory, and closedir(3C), for when you're all done.

Another type of file that's been with us for as long as I've been using Unix, and still exists in Solaris today, is the device special file. Device special files all live in the /dev directory and are special in that they contain no data. They are used specifically as a linkage to physical devices on the system, such as disks, network interfaces, and tapes. They provide an entry point into the appropriate device drivers, which are the segments of kernel code written to do I/O to a particular device. Special files come in two flavors, block special and character special. As the names imply, the special file type used has to do with the type of I/O the underlying device is capable of. Disks, for example, have corresponding block and character special files because they are capable of either type of I/O. They can be read and written using large chunks of bytes, or blocks, which typically correspond to a block size for a file system installed on a disk (8 kilobytes is the default for a UFS). Character I/O, often referred to as raw, does not follow a specific block size or pattern, but rather treats the underlying storage as a linear array of bytes.

Other types of devices, such as network interfaces and terminal interfaces, implement only character special files, as they do not support true block-mode I/O. Also, we have pseudo devices, with corresponding special files. A pseudo device is something that does not have a corresponding physical interface on the system. The best example of this is the pseudo terminal, the pty special files in /dev. Pseudo terminals are used for network communication (e.g. rlogin, telnet, etc.), where the communications software needs to treat the send and receive sides of the data stream as a terminal character flow, even though it's just a window on someone's workstation. The windowing system software that runs on workstations, be it straight X Windows, OpenWindows, or CDE, uses the pseudo terminal drivers for managing windows on the screen. (See pty(7D) and pts(7D) for more.)

As we said, special files do not actually contain data. Rather, they provide two values known as a major number and minor number. The major number is used to provide an entry point into the appropriate device driver, and the minor number is used to uniquely identify the physical (or pseudo) device. Disks, for example, each have multiple device special files associated with them because a single disk can be formatted into multiple logical partitions, with each partition occupying some number of cylinders on the disk. Each individual partition can have a file system created on it or be used as a raw partition via its character special interface. Thus, we require a block and character device file for each possible disk partition.

Tape devices also typically have multiple special files per tape unit. This allows for specifying, via the actual device name used, things like what density to write or read the tape at, whether or not to rewind the tape when the operation is complete, etc.

In Solaris, the entries in the /dev directory are really symbolic links (which are another file type we haven't talked about yet) to the actual device files that live in the /devices directory. When a Sun system is powered up, the openboot prom firmware sizes the system, locating all the devices and interfaces. It then builds a hierarchical device tree that gets passed to the Solaris operating system during the boot process. The device name space, that is, the device special files, are created starting in the /devices directory. Other Solaris commands get executed (typically at installation or when an -r flag is used for booting) that create the symbolic links under /dev that link to the actual special files in /devices. You can see this for yourself by listing the files under /devices and /dev and tracing the symbolic links on your Solaris system.

Solaris recently added a pseudo file system to the kernel for the managment of special files, called specfs. The specfs file system is implemented using the same vnode abstraction that other file systems in Solaris use (vnodes where covered briefly in the December 1997 Inside Solaris column, "Swap space implementation, part one" -- See Resources below). In doing this, a file-system-specific data structure was created, called an snode, that describes a special file. Just as every file in a UFS has a corresponding inode, every special file has a corresponding snode. The fields in the snode structure maintain specific information about the special file, such as access, modification, and access times, a reference count, and various state flags. Detailed information on the use of specfs and snodes is not typically of interest to the general Solaris community, so we won't get into specfs internals here. Send me an e-mail if you'd like to see detailed coverage of specfs in a future column.

That'll do it for this month. Next month, in part three, we'll discuss symbolic links and named pipes and then get into file access modes and flag bits.

Resources

"Fiddling around with files, part one," February 1998 Inside Solaris column in SunWorld
http://www.sun.com/sunworldonline/swol-02-1998/swol-02-insidesolaris.html
"Swap space implementation, part one," December 1997 Inside Solaris column in SunWorld
http://www.sun.com/sunworldonline/swol-12-1997/swol-12-insidesolaris.html
Adrian Cockcroft's Sun Performance and Tuning: SPARC & Solaris
http://www.amazon.com/exec/obidos/ISBN=0131496425/sunworldonlineA/
Full listing of Adrian Cockcroft's Performance Q&A columns in SunWorld
http://www.sun.com/sunworldonline/common/swol-backissues-columns.html#perf
Full listing of past Inside Solaris columns:
http://www.sun.com/sunworldonline/common/swol-backissues-columns.html#insidesolaris
Goodheart, B. & Cox, J. The Magic Garden Explained: The Internals of Unix System V Release 4, Prentice Hall
http://www.amazon.com/exec/obidos/ISBN=0130981389/sunworldonlineA/
Vahalia, Uresh. Unix Internals: The New Frontiers, Prentice-Hall
http://www.amazon.com/exec/obidos/ISBN=0131019082/sunworldonlineA/

About the author
Jim Mauro is currently an area technology manager for Sun Microsystems in the Northeast area, focusing on server systems, clusters, and high availability. He has a total of 18 years industry experience, working in service, educational services (he developed and delivered courses on Unix internals and administration), and software consulting. Reach Jim at jim.mauro@sunworld.com.

If you have technical problems with this magazine, contact webmaster@sunworld.com

URL: http://www.sunworld.com/swol-03-1998/swol-03-insidesolaris.html
Last modified:

Comments:
Name:
Email:
Company Name: