Inside Solaris by Jim Mauro

Fiddling around with files, part one

This month, we address one of the most frequently asked questions about Solaris, "How many files can I have open?" We'll examine the file interfaces in Solaris, how the kernel manages open files, and what the limits are and why they exist

SunWorld
February  1998
[Next story]
[Table of Contents]
[Search]
Subscribe to SunWorld, it's free!

Abstract
Everyone who touches a Solaris system (or any Unix system) does file I/O, either implicitly as an end user, or explicitly as an application developer, systems programmer, or systems administrator. The ubiquitous nature of file I/O in Solaris creates a constant flurry of questions about dealing with files. How many files can I have open at any point in time? Per user? Systemwide? How do file permissions work? What does the "sticky bit" do? This month, we'll get inside the implementation of files in Solaris with an eye to answering the above questions and more. (2,600 words)


Mail this
article to
a friend

From its inception, Unix has been built around two fundamental entities: processes and files. Everything that is executed on the system is a process, and all process I/O is done to a file. File I/O takes the form of intuitive read and write operations to a file in the file system, and more abstract forms, as in the implementation of device special files (for I/O to hardware devices) and the process file system (ProcFS), which allows for reading and writing process information using file-like interfaces and semantics.

Solaris, as with other modern implementations of Unix, follows this convention, employing traditional file abstractions for the purpose of reading and writing files and doing device I/O. Some things haven't changed much over the years, such as the notion of file modes, where permissions to read, write, and/or execute a file are defined. A subset of the file mode, called the sticky bit, has changed only somewhat over time, while other things have changed a great deal. For example, the types of files known within a file system have increased from the original regular, directory, and device special files to include file types such as named pipes, symbolic links, and sockets.


Advertisements

Kernel file table and open file limits
There are several things that can potentially affect the per-process limit and the number of open files Solaris can maintain at any one time. We'll start with a system view, looking at the kernel file table.

Every opened file in a Solaris system occupies a slot in the system file table, defined by the kernel file structure:

typedef struct file {
        kmutex_t        f_tlock;        /* short term lock */
        ushort_t        f_flag;
        ushort_t        f_pad;          /* Explicit pad to 4 byte boundary */
        struct vnode    *f_vnode;       /* pointer to vnode structure */
        offset_t        f_offset;       /* read/write character pointer */
        struct cred     *f_cred;        /* credentials of user who opened it */
        caddr_t         f_audit_data;   /* file audit data */
        int             f_count;        /* reference count */
} file_t;

The fields maintained in the file structure are, for the most part, self-explanatory. The f_tlock kernel mutex lock protects the various structure members. These include the f_count reference count, which lists how many threads the file has opened, and the f_flag file flags, which maintain information such as file locking (read and write locks for application programs, used to manage I/O to the same file by multiple threads or processes). It also protects the close-on-exec flag, which instructs the system to close the file when the process executes an exec(2) system call. (We'll be covering file flags in more detail later in the column. You can reference the fcntl(2), fcntl(5), and open(2) man pages for additional information on file flags.)

Early versions of Unix implemented the system file table as a static array, which was initialized at boot time to some predetermined size. If the required number of open files systemwide exceeded the size of the file table, file opens would fail. When this happened, the system had to have the tunable parameter that defined the file table size increased, a new kernel needed to be built, and the system rebooted.

Solaris implements kernel tables as dynamically allocated linked lists. Rather than populate and free slots in a static array, the kernel allocates file structures for opened files as needed, and thus grows the file table dynamically to meet the requirements of the system load. Therefore, from a pure systemwide standpoint, the maximum size of the kernel file table, or the maximum number of files that can be opened systemwide at any point in time, is limited by available kernel address space and nothing more. The actual size the kernel can grow to depends on the hardware architecture of the system, as well as the version of Solaris the system is running. Table 1 provides a matrix with the kernel address size limits for a given hardware architecture and Solaris release. By way of a data point, I've never seen Solaris run out of kernel address space due to file table growth.


sun 4m sun 4d sun 4u
Pre Solaris 2.5/2.5.1 128MB 256MB N/A
Solaris 2.5/2.5.1 256MB 576MB 4GB
Solaris 2.6 256MB 576MB 4GB

The Solaris kernel actually implements kernel memory allocation through an object caching interface. This was first introduced in Solaris 2.4 in order to provide a faster and more efficient method of allocating and de-allocating kernel memory for objects that are frequently allocated and freed. File table entries are a good example of this because the volume and rate of file opens and closes can be quite high on a large production server. A detailed explanation of the kernel memory allocator and object cache interface is beyond the scope of this month's column. We'll address it in a future installment.

The system initializes the file table during startup, by calling a routine in the kernel memory allocator code that creates a kernel object cache. Basically, the initialization of the file table is really the creation of an object cache for file structures. The file table will have several entries in it by the time the system has completed the boot process and is available for users, as all of the system processes that get started have some opened files. As files get opened/created, the system will either reuse a freed cache object for the file table entry or create a new one if needed. You can use /etc/crash to examine the file table (as root):

# /etc/crash
dumpfile = /dev/mem, namelist = /dev/ksyms, outfile = stdout
> file
ADDRESS  RCNT    TYPE/ADDR       OFFSET   FLAGS
f5951008   1    SPEC/f609246c          0   read write
f5951030   1    FIFO/f5ecb800          0   read write
f5951058   1    SPEC/f60589b4          0   read write
...
>quit
#

I didn't include the entire listing because it's quite long. ADDRESS is the kernel virtual memory address of the file structure. RCNT is the reference count field (f_count). TYPE is the type of file, and ADDR is the kernel virtual address of the vnode. OFFSET is the current file pointer, and FLAGS are the flags bits currently set for the file.

You can use sar(1M) for a quick look at how large the file table is:

# sar -v 1

SunOS sunsys 5.6 Generic sun4m    01/13/98

22:46:51  proc-sz    ov  inod-sz    ov  file-sz    ov   lock-sz
22:46:52   66/4058    0 2035/17564    0  407/407     0    0/0   
# 

In this example, I have 407 file table entries. The format of the sar output is a holdover from the early days of static tables, which is why it is displayed as 407/407. Originally, the value on the left represented the current number of occupied table slots, and the value on the right represented the maximum number of slots. Since the file table is completely dynamic in nature, both values will always be the same.

That pretty much addresses the maximum number of systemwide open files question. Now we'll get into the per-process limits and what they're all about.

Per-process open file limits
Every process on a Solaris system has a process table entry with an embedded data structure, historically called the user area or uarea. Each process maintains its own list of open files with a pointer rooted in the uarea called u_flist. The u_flist pointer points to uf_entry structures, which look like this:

struct uf_entry {
        struct file *uf_ofile;
        short  uf_pofile;
        short  uf_refcnt;
};

The uf_entry structure contains a pointer to the file structure (uf_ofile) -- a uf_pofile flag field used by the kernel to maintain state information that the operating system needs to be aware of. The possible flags are FRESERVED, to indicate that the slot has been allocated, FCLOSING, to indicate that a file-close is in progress, and FCLOSEXEC, which is the user-settable close-on-exec flag. It also maintains a reference count in the uf_refcnt member. Remember, Solaris is a multithreaded operating system that supports multithreaded applications that can run on multiprocessor systems. Multiple threads in a process can be referencing the same file, which makes a per-process reference count handy. For example, say you need to prevent one process from closing a file that another process is reading; state flags, used in conjunction with mutex locks (which protect the setting of the state flags), will ensure that multiple processes hitting the same kernel code path do not

The u_flist pointer is implemented as an array pointer, where the open file list is an array of uf_entry structures. The kernel allocates the required structures dynamically, as processes open files. A process uses one uf_entry structure for each open file, and the kernel does allocation in increments of 24. When a process first comes into existence, the number of available uf_entry structures is set to 24. When the available entries all get used, subsequent allocations are done in chunks of 24. Since the uf_entry structure is only 8 bytes in size, virtual address space limits are definitely not a concern here. Consider, for example, that even 124,000 open files would only require 1 megabyte of process VA space to store the open file list. At a system level, each file structure is 40 bytes, so 124,000 open files would require about 5 megabytes of kernel address space for the system file table. Figure 1 shows the big picture.


So what we have is an implementation where structures for open files are allocated dynamically both systemwide and per process. Within the context of processes, the limitation that exists comes in the form of resource limits. Resource limits exist for several system resources that processes use when they execute. They can be displayed with the ulimit(1) command if you're using the Korn shell (/usr/bin/ksh) or Bourne Shell (/usr/bin/sh). Use limit(1) if you're using the C shell (/bin/csh).

sunsys> ulimit -a
time(seconds)        unlimited
file(blocks)         unlimited
data(kbytes)         2097148
stack(kbytes)        8192
coredump(blocks)     unlimited
nofiles(descriptors) 64
vmemory(kbytes)      unlimited

The above example was executed under the ksh. The limit we're interested in is displayed as nofiles(descriptors), which has a default value of 64 open files per process.

The system establishes default values and maximum values for each of the resources controlled by the resource limits. For each of the seven resource limits defined, there is a rlim_cur (current resource limit), which is the default value, and a rlim_max (maximum resource limit), which is the system-imposed maximum value for the resource. In Solaris 2.5.1 and 2.6, the hard limit value set by the system is 1,024. You can reset the limits in the /etc/system file as follows:

set rlim_fd_cur = 1024
set rlim_fd_max = 8192

If you put the above in the /etc/system file, users will have a default of 1k open files and be able to set their own limit up to 8k. If you're superuser (root), you can go beyond the max limit to some arbitrarily high value.

Here is some real output to drive this point home (note the annotations between /* and */):

fawlty> ulimit -n		/* System name is "fawlty." Check open 
				file limit with ulimit(1) */
64				/* Output from ulimit, default value of 64 */
fawlty> of			/* Run a program called "of," which is a `C'   
    				program that opens files in a loop until the
				open(2) call returns an error. It counts the
				number of successful opens, and outputs that
      				value */
61 File Opens			/* Output from "of" program. 61 files opened */
fawlty> ulimit -n 1024		/* Change my open file limit to the max, 1024 */
fawlty> of			/* Run the "of" program again */
1021 File Opens			/* 1021 files opened this time */
fawlty> su			/* Make myself root */
Password: 
# ulimit -n			/* Check the open file limit at root */
1024				/* It's 1k */
# ulimit -n 10000		/* Set my open file limit to 10k open files */
# ulimit -n			/* Check it, to make sure */
10000				/* Looks good */
# ./of				/* Run "of" as root */
9997 File Opens			/* 9,997 file opens */
# ulimit -n 100000		/* Let's try 100k */
# ulimit -n			/* Took it */
100000
# ./of				/* Run the "of" program */
99997 File Opens		/* Son-of-a-gun. 100k open files. */
# 

As illustrated, we can use the appropriate shell command to tweak our open file limit to either the max limit for non-root users or some arbitrarily high value as root. The "supported" limits are up to 1k open files per user in 2.5.1 and 64k open files in Solaris 2.6. The example above was done on a 2.6 system, but I ran the same test on Solaris 2.5.1, and it was successful. An alternative method of manipulating resource limits is using the C language interfaces. Solaris provides two system calls, setrlimit(2) and getrlimit(2) for setting and getting resource limits programmatically. See the man pages for more.

In case you're curious, the reason the number of open files was always the max value minus three is because every process has three open files as soon as it comes into existence: stdin, stdout, and stderr(standard input, output, and error), which represent the input, output, and error output files for the process. Also, for the non-programmers among you, here's what the of program source code looks like:

#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>

main()
{
	int count=0, err;
	while (err != -1) {
		err=open("/dev/null",O_RDWR);
		if (err != -1)
			count++;
	}
	printf("%d File Opens\n",count);
}

So the rule of thumb here is that on a per-process basis, a process can only open up to rlim_fd_cur files, whatever that value is. You can determine that by using the ulimit(1) or limit(1) command, depending on which shell you're using. As root, you can bypass the resource limit check and set the open process limit to theoretically 3 billion (signed int data type max value). Obviously, you would never get anywhere near that amount because you would run out of process virtual address space for the per-process file structures (uf_entry) that you need for every file you open. Fortunately, we never encounter situations where a requirement exists that needs that many files opened. We do, of course, see installations more and more where per-process open files go into the thousands, and even tens of thousands. The support issues can get tricky here, so I always suggest working with your local service organization if you have a production environment that requires more than the documented limits.

We now have an understanding of the main structures involved in open files and what the open file resource limit is all about, so this seems like a good place to break. We'll finish up next month. In March, we'll cover how the API you use can affect your open file limits. We'll also get into file flags and talk a little about the sticky bit.

See you next month.


Resources


About the author
Jim Mauro is currently an area technology manager for Sun Microsystems in the Northeast area, focusing on server systems, clusters, and high availability. He has a total of 18 years industry experience, working in service, educational services (he developed and delivered courses on Unix internals and administration), and software consulting. Reach Jim at jim.mauro@sunworld.com.

What did you think of this article?
-Very worth reading
-Worth reading
-Not worth reading
-Too long
-Just right
-Too short
-Too technical
-Just right
-Not technical enough
 
 
 
    

SunWorld
[Table of Contents]
Subscribe to SunWorld, it's free!
[Search]
Feedback
[Next story]
Sun's Site

[(c) Copyright  Web Publishing Inc., and IDG Communication company]

If you have technical problems with this magazine, contact webmaster@sunworld.com

URL: http://www.sunworld.com/swol-02-1998/swol-02-insidesolaris.html
Last modified: