Fiddling around with files, part twoThis month we'll continue our discussion of files in Solaris with a look at the APIs and how they affect the number of files you can have open. We'll also begin explaining file types |
Last month we began our discussion of the implementation of files in Solaris with an overview of the main kernel structures involved in file maintenance. We also addressed the question of how many files can be opened at a systemwide and per-process level. This month we continue that discussion with a look at which file APIs (application programming interfaces) are used and the types of files Solaris supports. (4,200 words)Note to readers: It's difficult sometimes to assess the amount of space required to cover a particular topic. I've found that once the research and writing begin the effort takes on a life of its own, and I often end up discovering a whole slew of items that warrant discussion if the topic at hand is to be given proper coverage. Our current topic falls into this category; thus we'll likely require at least one additional column to complete our discussion of files.
Mail this article to a friend |
file
data structure, and described the various
structure members. A reader sent me an e-mail
with an interesting question that I believe warrants covering here.
The question was around the file offset that is used to reflect the
current character position in the file (a pointer to a location in
the file).
The first point to make here is that the same file can have multiple
file
data structures associated with it. If different
processes executing on the same system open the same file, each
process will have a unique representation of the file through its
file descriptor to a process-specific file
data
structure. Thus, as each process reads and writes the file, the
process will change the f_offset
member of its
file
structure.
This behavior differs with file descriptors that are inherited via
the fork(2)
system call, or when processes issue the
dup(2)
or dup(2)
system calls for
duplicating file descriptors. In these scenerios, the
file
structure is shared, and the f_offset
field is shared, meaning that a read or write to the file changes
the f_offset
as it is seen by the child process (in the
case of a fork(2)
), or references to the file
descriptors returned by dup(2)
and
dup2(2)
.
Finally, for processes that are multithreaded, multiple threads
within a process doing reads and writes to the same file descriptor
will alter the offset field as seen by all threads in the process,
since once again there is one file
structure involved.
Logically, if one thread closes the file, the file will be closed
process wide.
There is an lseek(2)
system call available for
manipulating the offset value of an opened file. The above described
behavior applies when file offsets are altered using
lseek(2)
. For managing synchronized access to the same
file from multiple processes or threads, file locking interfaces are
provided and will be discussed next month.
On a related note, Solaris (and Unix in general), does not technically support the notion of file records, which is simply some grouping of bytes within a file. However, the Solaris file APIs allow for specifying a range of bytes within a file for certain file operations, such as file locking.
|
|
|
|
File APIs
There are several interfaces available in Solaris for doing file
I/O. The traditional interfaces are known as the Standard I/O
functions, abbreviated as stdio. (See the stdio(3S) manpage.) These
are the buffered I/O interfaces that were added to Unix System V by
AT&T in the early days of the Unix evolution (actually, they were
probably added back in the days of Unix System III, mid-to-late
1970s, but I'm not 100 percent certain). These stdio library
interfaces exist as a layer above the lower level system calls and
improved the portability of the system as it was ported to
more and more hardware platforms. The stdio routines call the
corresponding lower level system calls, e.g., fopen(3S)
calls open(2)
, fwrite(3S)
calls
write(2)
, etc. They provide additional buffering in the
process address space, before entering the kernel through the
appropriate system call.
The implementation of stdio is based on a buffered file referred to
as a stream, which is described by a data structure called
a FILE
. A FILE
is described in the header file
/usr/include/stdio.h.
typedef struct
{
int _cnt; /* number of available characters in buffer */
unsigned char *_ptr; /* next character from/to here in buffer */
unsigned char *_base; /* the buffer */
unsigned char _flag; /* the state of the stream */
unsigned char _file; /* Unix System file descriptor */
} FILE
;
The members of the FILE
structure include a count of the number of
characters in the buffer (_cnt
), a buffer pointer (_base
), and a
pointer to the current character position in the buffer (_ptr
).
Also, there is a flag field (_flag
) that stores status bits about
the state of the file. Examples of flag bits are indications that
the file is currently being read or written, the file pointer
is at EOF (end-of-file), or an I/O error occurred on the file.
Finally, the file descriptor itself is stored as an unsigned
char data type in the _file
member. It is because of this
FILE
structure member, _file
, and the fact that it
is an unsigned char data type, that a process cannot have more than 256 open
files when using the stdio routines. Simply, the value range for this data type
is 0 to 255; only 8 bits (1 byte) are maintained when an unsigned
char is used. Thus, only a maximum of 256 files can be uniquely
identified within a process using the stdio(3S)
interfaces.
That is the first example of where selection of a programming
interface can affect the number of open files. A second interface
that needs to be addressed is the select(3C)
routine.
The select(3C)
routine provides a polling capability,
where sets of file descriptors can be "polled" to determine if data
is available for reading, if the file descriptor is ready for
writing, or if an exception condition occurred on the file
descriptor that needs to be handled. (See the manpage for more.)
The use of select(3C)
in code requires the inclusion of
the select.h header file (/usr/include/sys/select.h),
which contains the following entry:
#ifndef FD_SETSIZE #define FD_SETSIZE 1024 #endif
The above definition, FD_SETSIZE
, defines the maximum
number of file descriptors that can reside is a file set polled by
select(3C)
. As you can see, a limit of 1k open files is
imposed when using select(3C)
. The code in
select(3C)
makes the following check to determine if
the number of file descriptors (nfds) is greater than the
FD_SETSIZE
value:
if (nfds < 0 || nfds > FD_SETSIZE) { errno = EINVAL; return (-1); }
In such a situation, the select(3C)
call will fail, and
an "Invalid Argument" error (EINVAL) is set.
For applications that require more open files in earlier versions of
Solaris, an alternative interface must be used. The
poll(2)
system call provides functionality similair to
select(3C)
, but does not have any inherent open file
limits beyond what was discussed in part one last month. Note that
the implementation of select(3C)
in Solaris, which is a
library routine, calls the poll(2)
system call.
These two areas, stdio(3S)
and select(3C)
,
are typically where folks hit a wall when opening files. Note that
this is not an implied guarantee that use of other interfaces will
work perfectly fine with large numbers of opened files. It's up to
you to examine the interface when looking to exceed documented,
supported limits, and test accordingly.
File types
Last month we briefly mentioned the different
types of files that exist in Solaris. More accurately, we mentioned
some of the files types that the Solaris kernel acknowledges.
The distinction seems subtle, but in reality it's key to understanding
the concept of file types.
A file type in Solaris is a file that the operating systems has an embedded knowledge of, where the kernel distinguishes a particular file type from others, and knows about the format and other attributes particular to files of a given type.
Let's discuss some high-level background information that will aid those not familiar with file systems in understanding the file specifics we will get into shortly. If you're comfortable with the main components of a Unix file system, you can skip the next four paragraphs.
One of my favorite interview questions to ask is "What's a file system?" You would be surprised at the variety of answers I get back. Simply put, a file system is an operating system-imposed structure that gets loaded on disk subsystems (disk drives or RAID logical volumes) for the purposes of managing the disk resources for the storage and retrieval of files. The view Solaris has of a raw disk drive (or raw RAID logical volume) is a series of disk blocks, starting with block number zero and going to the last block on the disk. There's actually more to the geometry of a disk than that, but we don't need to go into it here. The point is that without a file system the management of disk hardware resources can be very difficult. File systems add a layer on top of raw disk geometry, allowing the system to more easily allocate and de-allocate resources for file management, making it easier to add optimizations for performance, improve fault tolerance, and create management tools.
All the file types supported within Solaris exist as entities in the
UFS file system. Actually, this is only partially true, as Solaris
also supports several pseudo file systems. However, for the purposes of
our discussion, we can keep things in the context of files in the
file system for now and expand on the notion of pseudo file systems
later. The UFS (Unix File System) in Solaris is derived from the
Berkeley Fast File System, which evolved many years ago as a logical
improvement on the traditional AT&T Unix System V File System. The
original design of the Unix File System included very few
components. A file system had a superblock that described
it, inodes to describe files in the file
system, and disk blocks to store the body of the files
(actual file data). This design was relatively simple and easy to
implement, and it did the job for a while. But certain performance-related
shortcomings became increasingly apparent over time. For example, the distance between
a file's inode and disk blocks often resulted in a lot of disk-seek activity
when files were referenced. It wasn't very resilient and could easily end up
corrupted beyond the system's ability to repair it
(fsck(1M)
).
The Berkeley Fast File System addressed these issues. Performance was improved with the introduction of cylinder groups, in which file inodes and data blocks where bundled together in groups on the disk to keep inodes and data blocks closer together, thus reducing seek times. A larger file system block was implemented (8k, as opposed to the 1k file system block in the System V File System), so more data could be read with each file system block read. Resilience was improved by replicating the superblock across various platters on the disk, such that if the primary superblock got corrupted, a backup copy could be retrieved and the file system could be repaired. Other improvements came later, such as the implementation of extent block allocation, for improved sequential I/O performance. Detailed internals coverage of the Unix File System implemented in Solaris is planned for a future column.
The basic file system components of UFS in Solaris are the same as those
described above for the Berkeley Fast File System. The component
we're most interested in for the purposes of this month's column is
the inode. Every file in the file system requires an inode,
regardless of what type of file it is. The inode stores information
about the file, such as the file size, number of links to the file,
the file's owner and group, access modes for the owner, group the
owner and everyone else belongs to, the file type, etc. Several
dates are maintained in the file's inode: last access, last
modification, and last time the inode changed. Most of this
information is made available via the ls(1)
command,
with the -l
(long listing) flag:
fawlty> ls -l -rw-r--r-- 1 jim tech 7036 Feb 13 09:50 files.p2 fawlty> ls -li 1446883 -rw-r--r-- 1 jim tech 7036 Feb 13 09:50 files.p2
The first example above is the typical ls -l
output, showing the
access permissions (read/write for the owner, "jim," read only for the
group tech and the rest of the world). There is one link to the
file, meaning that the file appears in only one directory in the
file system. The owner is jim, jim's group is tech, the file is 7036
bytes in size, it was last written on Feb 13th at 9:50AM, and the
file name is files.p2. The second example adds the -i
flag,
which provides the file's inode number in addition to all the other
information. In this case, the inode number is 1446883.
This brings us back around to the topic at hand, file
types. The type of file is maintained in the file's inode and is
displayed visually in the output of the ls -l
command in the
left-most column, just before the file permission and mode bits.
In the above example, the file type for the file files.p2
is displayed as a single dash, the "-
" character. The dash means
files.p2 is a regular file, the most common
file type in Solaris. A complete list of the various file types that
can exist within a Solaris system, along with the characters used to
describe each file in the ls -l
output, can be found in Table 1
below.
File Type | Character |
---|---|
Named Pipe (FIFO) | p |
Character Device Special | c |
Directory | d |
Block Device Special | b |
Regular | - |
Symbolic Link | l |
Door (2.6 and beyond) | D |
Socket (2.6 and beyond) | s |
A useful command to be aware of when discussing file types is the
file(1)
command, which will take a file name as an
argument and attempt to determine its file type. Some files
have a magic number embedded in their first couple bytes.
That magic number traditionally was used to distinguish
between different types of executable files (SPARC binary, Intel
binary, etc.), and has evolved to include a variety of different file
types. You can examine the /etc/magic file to see all the
magic numbers currently included and the corresponding file types.
Depending on the type of file, the magic number will be at a
different byte offset than the beginning of the file. The number
of bytes required to hold the magic number also varies. The offset
and length is included in the /etc/magic file.
What changes in terms of accessing the different file types
supported within Solaris is the way the operating system
treats the file. Regular files have no operating system-imposed
structure to them. While at a user level we may make important
distinctions between different types of files, such as ASCII text
files, executable programs, binary data files, etc., the operating
system makes no such distinctions. These are all simply regular
files, and at operating system level they're treated as a
number of data blocks being passed up to a higher-level user
program. If you've ever accidentally used vi(1)
on an
executable or tried to cat(1)
a directory file, you've
seen examples of the operating system treating all regular files the
same.
Directory files have a specific structure and purpose. Directories
are files that contain the names of other files and directories, and
every entry in a directory file matches a directory entry
structure:
struct direct { uint32_t d_ino; /* inode number of entry */ u_short d_reclen; /* length of this record */ u_short d_namlen; /* length of string in d_name */ char d_name[MAXNAMLEN + 1]; /* name must be no longer than this */ };
The entries in a directory are the file's inode number and the file
name, which has a maximum length of 255 characters. The other two
fields in the structure, d_reclen
and
d_namlen
, record length and file name length
respectively, and are not seen by the user, but rather are used
internally by the kernel for directory operations. Blocks allocated
to directory files are done in units that correspond directly to the
underlying storage device, typically 512 bytes. Each 512 byte
directory block contains some number of directory entries, defined
by the structure above. By maintaining the total size of the
directory entry, d_reclen
, and the size of the file
name character string, d_namlen
, the system is able to
optimize operations that change directory sizes and search
directories for file names.
Interesting to note is that the name of a file is not contained in the file's inode. The file name only exists in the directory entry. Once the file has been located in a directory, the kernel has the file's inode, which is all the kernel needs to do any file operations.
Another bit of data on directories, which pretty much every Unix
user is aware of, but which I'm compelled to include,
(particularly for those readers who are relatively new to Unix) are
the notorious dot (.
) and dot dot
(..
) files that exist in every directory and are put
there by the kernel when the directory is created. These
represent the current directory, ".
", and the parent directory,
"..
".
1 fawlty> ls -ldi . 2 739716 drwxr-xr-x 40 jim tech 5120 Feb 16 1998 . 3 fawlty> mkdir mydir 4 fawlty> ls -ldi mydir 5 788462 drwxr-xr-x 2 jim tech 512 Feb 16 1998 mydir 6 fawlty> cd mydir 7 fawlty> ls -lia 8 total 12 9 788462 drwxr-xr-x 2 jim tech 512 Feb 16 1998 . 10 739716 drwxr-xr-x 41 jim tech 5120 Feb 16 1998 .. 11 fawlty>
The above example illustrates an important point about directories. The output of the
ls -ldi .
command (line 1) shows the current
directory, represented as ".
", and has an inode number of 739716 (line
2). We make a subdirectory called mydir (line 3) and take a look
at it with ls -ldi
(line 4). The mydir directory is inode number
788462 (line 5). We then cd
down to the mydir directory and look
to see if anything is in it (remember, we just created it; use the
-a
flag for ls(1)
to get files that begin with
".
"). Note that the ".
" entry has the same inode
number as mydir, line 5. Thus,
".
" in mydir is the mydir directory file. The "..
" file has
the same inode number as the ".
" entry from line 2, which is the parent
directory for mydir. Simply put, "..
" always represents one directory up
the tree from the current directory.
Some final points on directories: Solaris implements a cache of file names to speed up directory searches and translation of filenames to inodes. This cache is known as the dnlc, directory-name-lookup cache, and caches directory entries (i.e. file names) up to 31 characters in length. Sun performance expert Adrian Cockcroft discusses tuning the dnlc in his book Sun Performance Tuning. (See Resources below -- look for the newly revised second edition coming out this month!)
Also, Solaris provides APIs specifically for dealing with
directories programmatically. The opendir(3C)
routine
opens a specified directory and returns a pointer to a director
stream, called DIR
. It's exactly analogous to the
FILE
stream used by the stdio routines, discussed
above. It's an extra level of indirection between the user's code and
the kernel directory entry
structure. Once a directory
is successfully opened, it can be traversed using
readdir(3C)
. Other APIs include
rewinddir(3C)
, which sets the pointer back to the
beginning of the directory, seekdir(3C)
for moving the
pointer to another location in the directory, and
closedir(3C)
, for when you're all done.
Another type of file that's been with us for as long as I've been using Unix, and still exists in Solaris today, is the device special file. Device special files all live in the /dev directory and are special in that they contain no data. They are used specifically as a linkage to physical devices on the system, such as disks, network interfaces, and tapes. They provide an entry point into the appropriate device drivers, which are the segments of kernel code written to do I/O to a particular device. Special files come in two flavors, block special and character special. As the names imply, the special file type used has to do with the type of I/O the underlying device is capable of. Disks, for example, have corresponding block and character special files because they are capable of either type of I/O. They can be read and written using large chunks of bytes, or blocks, which typically correspond to a block size for a file system installed on a disk (8 kilobytes is the default for a UFS). Character I/O, often referred to as raw, does not follow a specific block size or pattern, but rather treats the underlying storage as a linear array of bytes.
Other types of devices, such as network interfaces and terminal interfaces, implement only character special files, as they do not support true block-mode I/O. Also, we have pseudo devices, with corresponding special files. A pseudo device is something that does not have a corresponding physical interface on the system. The best example of this is the pseudo terminal, the pty special files in /dev. Pseudo terminals are used for network communication (e.g. rlogin, telnet, etc.), where the communications software needs to treat the send and receive sides of the data stream as a terminal character flow, even though it's just a window on someone's workstation. The windowing system software that runs on workstations, be it straight X Windows, OpenWindows, or CDE, uses the pseudo terminal drivers for managing windows on the screen. (See pty(7D) and pts(7D) for more.)
As we said, special files do not actually contain data. Rather, they provide two values known as a major number and minor number. The major number is used to provide an entry point into the appropriate device driver, and the minor number is used to uniquely identify the physical (or pseudo) device. Disks, for example, each have multiple device special files associated with them because a single disk can be formatted into multiple logical partitions, with each partition occupying some number of cylinders on the disk. Each individual partition can have a file system created on it or be used as a raw partition via its character special interface. Thus, we require a block and character device file for each possible disk partition.
Tape devices also typically have multiple special files per tape unit. This allows for specifying, via the actual device name used, things like what density to write or read the tape at, whether or not to rewind the tape when the operation is complete, etc.
In Solaris, the entries in the /dev directory are
really symbolic links (which are another file type we haven't talked about yet)
to the actual device files that live in the
/devices directory. When a Sun system is powered up, the
openboot prom firmware sizes the system, locating all the devices
and interfaces. It then builds a hierarchical device tree that
gets passed to the Solaris operating system during the boot process.
The device name space, that is, the device special files, are
created starting in the /devices directory. Other
Solaris commands get executed (typically at installation or when an
-r
flag is used for booting) that create the symbolic links under
/dev that link to the actual special files in
/devices. You can see this for yourself by listing the
files under /devices and /dev and tracing the symbolic links on your Solaris system.
Solaris recently added a pseudo file system to the kernel for the managment of special files, called specfs. The specfs file system is implemented using the same vnode abstraction that other file systems in Solaris use (vnodes where covered briefly in the December 1997 Inside Solaris column, "Swap space implementation, part one" -- See Resources below). In doing this, a file-system-specific data structure was created, called an snode, that describes a special file. Just as every file in a UFS has a corresponding inode, every special file has a corresponding snode. The fields in the snode structure maintain specific information about the special file, such as access, modification, and access times, a reference count, and various state flags. Detailed information on the use of specfs and snodes is not typically of interest to the general Solaris community, so we won't get into specfs internals here. Send me an e-mail if you'd like to see detailed coverage of specfs in a future column.
That'll do it for this month. Next month, in part three, we'll discuss symbolic links and named pipes and then get into file access modes and flag bits.
Resources
About the author
Jim Mauro is currently an area technology manager for Sun
Microsystems in the Northeast area, focusing on server systems,
clusters, and high availability. He has a total of 18 years industry
experience, working in service, educational services (he developed
and delivered courses on Unix internals and administration), and
software consulting.
Reach Jim at jim.mauro@sunworld.com.
If you have technical problems with this magazine, contact webmaster@sunworld.com
URL: http://www.sunworld.com/swol-03-1998/swol-03-insidesolaris.html
Last modified: