Inside Solaris by Jim Mauro

Fiddling around with files, part three

Jim Mauro continues fiddling around with file types and moves on to look into file access modes and other file bits

April  1998
[Next story]
[Table of Contents]
Subscribe to SunWorld, it's free!

I should rename this column "Forging ahead with files"! This, the third column in the series, continues with file types. Following that, we'll get into file access modes, file flags, and file locking.

Note to readers: Last month's column went online with a few bugs. Thanks to our astute readers, we located the problems and applied fixes. It's possible, however, that you read the column before it was corrected, as the final fixes did not get out until March 10. If you read the column prior to that date, you may want to take a minute to reread the first six paragraphs of the column along with the section on APIs.

Some of the information provided in last month's "Readers Speak Out: Letters to the Editor," where Jim addressed a question from a reader, was also incorrect. That information has been corrected now as well. (3,800 words)

Mail this
article to
a friend
In addition to the file types we discussed last month, regular files, directories, and device special files, Solaris supports symbolic links. Other file types, FIFOs, sockets, and doors, have a supporting underlying file system and require rather lengthy treatment. We will cover FIFOs, sockets, and doors in separate columns in the near future.

Links actually come in two flavors, hard links and symbolic links. Hard links are not really file types, but rather are simply directory entries. A file with multiple hard links is a file whose name appears in multiple directories. The name does not have to be the same, as the reference to the same file is managed by the operating system using the file's inode number (remember from last month -- the file entry in a directory includes the file's inode number, which is guaranteed to be unique within the same file system). The actual number of hard links to a file are maintained in the file's inode, so as new links get created, the link count is incremented. The kernel can then track unlinks (to remove a file the shell rm(2) and rmdir(2) commands actually use the unlink(2) system call) such that the kernel does not really remove a file until the unlink results in a link count of zero, which means that no other links to the file exist. If additional file links exist, when an unlink to a file is executed the kernel will simply remove the directory entry (providing the permissions allow for the operation).

Creating hard links to directories is allowed only if the effective uid of the operation is root (superuser). As you can imagine, creating hard links to directories without careful thought could create some interesting directory traversal problems.

The ln(1) command is used to create hard links.

sunsys> ls -li of.c
     11294 -rw-r--r--   1 jim      staff        253 Mar  8 21:08 of.c
sunsys> ln of.c ../oflink.c
sunsys> ls -li of.c
     11294 -rw-r--r--   2 jim      staff        253 Mar  8 21:08 of.c
sunsys> ls -li ../oflink.c
     11294 -rw-r--r--   2 jim      staff        253 Mar  8 21:08 ../oflink.c

In the above example, we do a long listing, including the inode number of the file of.c. Note that the inode number is 11294 and there is only one link to the file (third column from the left is the link count). We then create a link to the of.c file in the parent directory and do another long listing of of.c, as well as the newly created link, oflink.c. Note that the link count goes up to two, and the inode number remains the same. We now have a file with two names, each of which appears in a different directory in the same file system. By definition, hard links cannot span file systems.

Symbolic links are a very different animal and are actual file types. A symbolic link is logically a "pointer" to a file somewhere else in a file system. Unlike hard links, symbolic links can reference files in other file systems. Symbolic links are created using the same ln(1) command with the -s command flag.

sunsys> ln -s files/files.p3 flink
sunsys> ls -li flink
  5651 lrwxrwxrwx   1 jim      staff         14 Mar  3 22:16 flink -> files/files.p3
sunsys> ls -li files/files.p3
 11284 -rwxr-xr-x   1 jim      staff       3680 Mar  3 22:11 files/files.p3

In the above example, we create a symbolic link to the file files/files.p3 called flink. Taking a look at the flink, we see that the ls -li flink output properly identifies flink as a symbolic link with an "l" in the left-most column of the mode field. Also, the symbolic link has a different inode number, 5651, than the file it is linked to, files.p3, inode number 11284.

Solaris provides "fast" symbolic links. Remember, every file in the file system has a corresponding inode. The file's inode contains, among other things, two arrays used to point to the data blocks that contain the body of the file (the file's actual data). The first array, called ic_db, has 12 locations for storing disk block addresses. The second array, ic_ib, is for indirect addresses (blocks that contain pointers to blocks) and has three locations. Since a disk block address occupies four bytes, the combined space available in both arrays is (12 x 4) + (3 x 4) = 60 bytes. Both arrays are contiguous in the inode, so the 60 available bytes is addressable from the first array's location. The actual size Solaris allows for a fast symbolic link is 56 bytes, allowing room for a null terminator on the path name stored.

When a symbolic link is created, the system checks to see if the link size, the size of the character string that represents the link name (the pathname), is 56 bytes or less. If it is, the link name can be stored right in the inode in the space allocated for the disk block address arrays. This speeds up symbolic link references considerably because the kernel does not have to traverse a disk address pointer and fetch a disk block to get the link name -- it will get the name from the inode. The kernel maintains a flag field in the inode. One of the possible flag bits, IFASTSYMLNK, is used to indicated that the file described by the inode is a fast symbolic link. The file mode field in the inode, where the type of file is stored, will be IFLNK, which identifies the file type as a symbolic link. Remember, not all symbolic links are fast symbolic links; the pathname must meet the aforementioned size criteria. That's why the flag is needed.

Symbolic links are used extensively by the core Solaris system in the device name space. All the files under the /dev directory are actually symbolic links to the character and block device special files located in the /devices directory. Symbolic links have other practical uses. We use them often in database installations. Creating symbolic links to the actual devices that hold the database objects allows for using descriptive names (e.g. /oracle/dbdevices/index1_dev) to describe what's actually on the device. It also makes it easier to make changes to the underlying storage without having to recreate new device names for the database. You simply tell the database about the link names and change what the link points to if you need to alter the database storage (e.g. move some data from a RAID 5 volume to a RAID 1+0 volume).

That concludes our discussion of file types. The remaining files will be discussed in a future column that focuses on the file and its associated file system. Sockets, for example, now are implemented on sockfs, the socket file system. Named pipes, or FIFOs, are implemented on the fifofs and use the kernel STREAMS infrastructure.


Bits is bits
Let's now turn our attention to another aspect of files in Solaris that often raises questions: access modes, file flags, and the sticky bit.

File modes are maintained in a field in the file's inode, generically referred to as the mode field, and use various bit combinations to identify the type of file, access permissions to the file, and do other miscellaneous things (See Figure 1). This is why, when we do a long file listing (ls -l command), the file types, permissions, set uid, set gid, and sticky bit are all grouped together in the far left-hand column of the ls -l output. All this information is maintained in the same inode structure member, so the kernel simply walks through the bits, displaying the appropriate information based on which bits are set and which are not.

The kernel manipulates file attributes at a higher level of abstraction, one just "above" the inode -- the vnode. Recall from our previous column that Solaris implements a vnode layer above the file-system specific interfaces and data structures. The kernel maintains a vnode for every file in the system, and pretty much every file attribute maintained in the inode gets mapped to a corresponding location in the vnode. They vnode code defines a vattr vnode attributes data structure, which has members for information such as the file access modes, uid, gid, etc. File attribute operations use this vattr structure for getting and setting attributes, and changes get pushed to the appropriate underlying file-system specific code. In the case of a UFS file, ufs_setattr() (set attributes) ensures inode updates are done.

File Type Character
Named Pipe (FIFO) p
Character Device Special c
Directory d
Block Device Special b
Regular -
Symbolic Link l
Door (2.6 and beyond) D
Socket (2.6 and beyond) s

We already discussed file types -- what the possible file types are and how they are represented in the left-most column of a long listing. The kernel uses four bits of the mode field. Bits 12 through 15 describe the type of file (the table to the right here shows a list of possible file types and the characters used to describe them). The remaining 12 bits are used to define the access modes to the file, some possible execute actions, and the sticky bit. The access modes are pretty straightforward, so we'll review them briefly here.

For a given file, regardless of the type, we can potentially do one of three things: read the file, write the file, or execute the file. Permissions are maintained for three possible users: the owner of the file, the group the user belongs to, and everyone else (typically referred to as "other" or "world" meaning the rest of the world). Every user on a Solaris system has a user ID (uid) and group ID (gid) that is established when the user's account is created. This information is referred to as the user's credentials, and is maintained in a data structure, the credential structure, that is embedded in the process structure. Remember, every operation you attempt on a file will be initiated from within a process, either something executed explicitly from the command line, or from the user's shell.

The credential structure has fields for a user's real uid and gid, and effective uid and gid. An effective ID comes into play if a user executes a file that has either setuid or setgid bits set, which effectively alter the uid or gid of the user during the execution of the file (more on this in a few minutes). When a user attempts to do something on a file, the user's ID credentials are checked against the file permissions. If the operation is allowed, the system performs the requested I/O (read or write) or executes the file.

The behavior for directory files is a little different than it is for regular files. Read and write permissions allow you to read the contents of a directory and create files in a directory, respectively. Write also allows for deleting files in a directory. Execute permissions affect a user's ability to cd(1) to the directory or access the files in the directory for the purposes of reading, writing, or execution. For example, if permissions where such that you had read and write on a directory, but not execute, you could list the files in the directory (ls directory_name), but you could not make the directory your current working directory, which is what happens when you cd directory_name. Simply put, without execute permissions to a directory, all you can really do (potentially) is list the files.

The kernel checks file access in a hierarchical fashion when a user attempts a file operation. First, a check is made to determine if the effective uid of the process is root. If it is, permission is allowed. Next, if the uid of the process matches the uid of the file owner, the owner mode bits allow for the attempted operation. If not, the gid of the process is matched against either the group the file belongs to or one of the supplementary gids (other groups that the user belongs to). If there's a match, the group mode bits are checked again to determine if the operation is allowed. Finally, if it's not root, a uid match, or a gid match, the system checks the mode for the rest of the world (other) and determines whether or not to allow the operation.

Files in Solaris support an additional security feature, access control lists, or ACLs. ACLs provide a method of defining finer-grained permissions for access and execution of files, above and beyond the traditional mode bits discussed above. We will cover the implementation of ACLs in Solaris in part four next month.

The setuid and setgid bits provide the ability to have a user execute a file that requires permissions to do things that the user may normally not have. The most common example is the Solaris passwd(1) command. In order for users to change their passwords, they must be able to write to the /etc/passwd and /etc/shadow files. Obviously, it would be a serious security hole to leave these files writeable by anyone. For this reason, the /usr/bin/passwd(1) command is a setuid and setgid file, which means that when the command is executed, the user's effective uid and gid become that of the file, which in this case is root (user) and sys (group). Thus, the setuid and setgid mode bits tell the system to alter the uid and gid of the user executing the file to that of the file they wish to execute. Once execution has completed, the user's uid and gid are restored to whatever they where prior to executing the file. Note that in order for the file owner to set the setuid or setgid mode bits, the file must be executable first (the execute mode must be true for the corresponding setid bit: user execution for setuid and group execution for setgid). As I said, there are 16 bits (two bytes) available for the file modes, but a long listing via ls -l displays only nine of these. Therefore, depending on the file mode, the system will alter what appears in a particular mode field in the output. For setuid and setgid, a lowercase s appears where the execution bit is normally displayed. For executable files that are not setuid or setgid, the familiar lowercase x appears.

Here's another example of using setuid (the lefthand column represents line numbers used in the annotation that follows).

1  fawlty> setfdlimit 10000
2  WARNING: setfdlimit open files may not work as expected
3  Setting Open FDs to 10000
4  setfdlimit[17]: ulimit: exceeds allowable limit
5  64 open files set
6  fawlty> su
7  Password: 
8  # chown root setfdlimit
9  # chmod 4755 setfdlimit
10  # ls -l setfd
11 -rwsr-xr-x   1 root     tech         312 Mar 17 16:54 setfdlimit
12 # exit
13 fawlty> setfdlimit 8192
14 Setting Open FDs to 8192
15 8192 open files set
16 fawlty> setfdlimit 10000
17 WARNING: /bin/ksh open files may not work as expected
18 Setting Open FDs to 10000
19 10000 open files set
20 fawlty> 

setfdlimit is a shell script that does some rudimentary checking and is used to increase the number of open files a user can have. In order to circumvent the hard limit in Solaris, one needs to be root. When I tried executing the program when it was owned by me (user jim), I could not set my file descriptor limit to 10000 (lines 1 - 5). So, I made myself root (lines 6 to 7), changed the owner of the file to root (line 8), and changed the mode bits to read and execute for the world and group, read, write, execute for the owner, and the setuid bit (line 9). Next I checked the mode of the setfdlimit file and verified the modes I set were in fact true for the file. I became user jim again (line 12), and re-ran the program (line 13). This time it worked, as I successfully bumped my fds to 8000. I tried it again (lines 16 to 19) and was able to bump fds up to 10000.

These are the types of applications that make setuid and setgid useful. However, it should be obvious that setuid programs that make the user "root" have security implications, and thus should be written and implemented with care. Note also that in this example I used the octal notation for the mode bits I wished to set for the file. The chmod(1) command also takes ASCII text flags as command line arguments for those users not familiar with octal representation.

The mode bit for setgid, bit 10 (see Figure 1), plays a dual role. If bit 10 is set, and the group execute bit is also set (bit 3), bit 10 provides setgid behavior as described above. However, if bit 10 is set and bit 3 (group execute) is clear, this means that mandatory locking has been set on the file. Solaris supports two methods of file locking for regular files, advisory and mandatory. Advisory locking requires that the developer writing the file I/O code uses the proper interfaces for setting and getting file locks. This is typically done with the fcntl(2) system call. It is up to the program to follow the rules of checking for locks prior to issuing a read or write to the file. It's analogous to using mutex locks in multithreaded code; the kernel does not prevent a process from grabbing a shared resource that is locked -- the code must explicitly check for the locks. The same applies for advisory file locks.

Mandatory locking is enforced by the kernel. When the UFS read or write code is entered, the file mode is checked to determine if mandatory locking has been set for the file. If it has, a chklock() kernel routine is called to determine if the requested file operation is allowable. The kernel will return an EAGAIN error to read(2), write(2), or fcntl(2) if the requested I/O is not allowed due to mandatory locking (this is not a complete list of all the possible system calls that could result in such an error return).

The implementation of file locking in Solaris is basically driven by the fcntl(2) system call. Solaris provides for shared or exclusive locks, and the user defines how much of the file to lock by establishing a starting offset value in the file and some number of bytes from the offset. The programmer must set up an flock data structure to be passed in the fcntl(2) system call:

struct flock {
	short   l_type;     /* lock operation type */
        short   l_whence;   /* lock base indicator */
        off_t   l_start;    /* starting offset from base */
        off_t   l_len;      /* lock length; l_len == 0 means
                               until end of file */
        long    l_sysid;    /* system ID running process holding lock */
        pid_t   l_pid;      /* process ID of process holding lock */

It's up to the programmer to set the lock type, F_RDLCK for shared or F_WRLCK for exclusive in the l_type field. The l_whence field provides a method of having the l_start field offset from the beginning, current position, or end of the file. The l_len is the number of bytes from the offset to lock. Reference the fcntl(2) manual page for more information on doing file locking in code. A library routine, lockf(3C), is also available to use for file locking. It is simpler to use, but not as flexible as fcntl(2).

What's the sticky bit?
Finally, we get to the sticky bit, which is bit 9 in the inode's file mode field. The sticky bit originated as a means of telling the operating system to keep the text portion of an executable file available on the swap device if it had to be moved out of physical memory due to space constraints. It provided a method of keeping commonly used executable text on the faster swap device so the system could retrieve the pages faster if they were needed again after being paged out.

The sticky bit in Solaris affects different file types in different ways. For simple text files, it basically has no affect. For directories, it provides an additional level of security for non-root users that do not own the directory file. If the directory has the sticky bit set, the user must own the directory or the target file in order to remove or rename a file in it, whereas without the sticky bit set on the directory file, a user can remove or rename the file in the directory even if they do not own the file or have write permission to it, as long as they have write permission to the directory.

Put another way, removing a file from a directory requires the user to have write permission for the directory, but not necessarily write or ownership of the underlying file. With the sticky bit set for the directory, the user would not be allowed to remove the file under the same set of conditions unless the user owned the directory, owned the target file, or had write permissions to the target file.

As far as the sticky bit and executable files go, Solaris no longer implements the traditional "save-text-in-swap" functionality that originated with the sticky bit. Again, with the natural evolution of the virtual memory system design, much faster disks, shared object libraries, and memory page locking capability, this type of functionality is somewhat dated.

Solaris does, however, use the sticky bit for one other case. For swap files that are created using the mkfile(1M) command, Solaris sets a flag in the vnode that identifies the file as a swap file. This flag instructs the kernel not to cache this file in the page cache (because it's a swap file, it's already in memory1).

Readers should reference the chmod(1) and chmod(2) manual pages for information on how to set file modes and descriptions of the various mode fields.

That's it for part three. We'll continue with files and file systems next month. On the short list for coverage are 64-bit files, flags, ACLs, FIFOs, sockets, and socketfs, and a detailed discussion of UFS (not necessarily in that order!)

Thanks for staying with us on what I originally thought would be a one- or maybe two-part topic.


About the author
Jim Mauro is currently an area technology manager for Sun Microsystems in the Northeast area, focusing on server systems, clusters, and high availability. He has a total of 18 years industry experience, working in service, educational services (he developed and delivered courses on Unix internals and administration), and software consulting. Reach Jim at

What did you think of this article?
-Very worth reading
-Worth reading
-Not worth reading
-Too long
-Just right
-Too short
-Too technical
-Just right
-Not technical enough

[Table of Contents]
Subscribe to SunWorld, it's free!
[Next story]
Sun's Site

[(c) Copyright  Web Publishing Inc., and IDG Communication company]

If you have technical problems with this magazine, contact

Last modified: