Inside Solaris by Jim Mauro

Asynchronous I/O and large file support in Solaris

Delve into asynchronous I/O facilities and 64-bit file support in Solaris 2.6 and beyond

July  1998
[Next story]
[Table of Contents]
Subscribe to SunWorld, it's free!

Asynchronous I/O interfaces have been available in Solaris for some time, providing a means by which applications could issue I/O requests and not have to "block" or cease working until the I/O was completed. 64-bit file support was added to the asynchronous I/O interfaces before the full-blown large file support that came with Solaris 2.6.

With Solaris 2.6, file sizes in Solaris are no longer limited to a maximum size of 2 gigabytes. In compliance with the specifications established by the Large File Summit, a number of changes have been made in the kernel, including extensions to the file APIs and shell commands for the implementation of large files.

This month, Jim examines the asynchronous I/O facilities and 64-bit file implementation in Solaris. (4,200 words)

Mail this
article to
a friend

The first 64-bit file I/O interfaces found their way into Solaris in the 2.5.1 release, with the introduction of just two new read and write APIs: aioread64(3) and aiowrite64(3), which are extended versions of the aioread(3) and aiowrite(3) asynchronous I/O (aio) routines that have been available in Solaris for some time. The goal was to provide relational database vendors a facility for asynchronous (async) I/O on raw devices that weren't limited to 2 gigabytes in size. Since a vast majority of Sun servers run some form of database application, it made sense to get 64-bit file support out the door early for these applications.

Async I/O routines provide the ability to do real asynchronous I/O in an application. This is accomplished by allowing the calling process or thread to continue processing after issuing a read or write and receive notification either upon completion of the I/O operation, or of an error condition that prevented the I/O from being completed. This is because the routine calling either aioread(3) or aiowrite(3) is required to pass, as one of the required arguments, a pointer to an aio_result structure. The aio_result structure has two structure members: aio_return and aio_errno. The system uses these to set the return value of the call or, in the case of an error, the errno, or error number. From /usr/include/sys/aio.h:

typedef struct aio_result_t {
        int aio_return;         /* return value of read or write */
        int aio_errno;          /* errno generated by the IO */
} aio_result_t;

Two different sets of interfaces exist to do async I/O in Solaris: The aforementioned aioread(3) and aiowrite(3) routines, and the POSIX-equivalent routines, aio_read(3R) and aio_write(3R), which are based on the POSIX standards for realtime extensions. Realtime applications must, by definition, deal with an unpredictable flow of external interrupt conditions that require predictable, bounded response times. In order to meet that requirement, a complete non-blocking I/O facility is needed. This is where asynchronous I/O comes in, as these interfaces can meet the requirements of most realtime applications. The POSIX and Solaris asynchronous I/O interfaces are functionally identical. The real differences exist in the semantics of using one interface or the other. This month's column will provide information that is applicable to both sets of interfaces.

Asynchronous I/O is implemented using the lwp (light-weight process) system calls, which are the lower level implementation of the user-level threads library. Multithreaded applications can be developed in Solaris using either Solaris threads (e.g., thr_create(3T) to create a new thread within a process) or POSIX threads (e.g., pthread_create(3T) to create new threads). Both the Solaris and POSIX thread interfaces are library routines that do some basic housekeeping functions in user-mode before entering the kernel through the system call interface. The system calls that ultimately get executed for threads are the _lwp_xxxx(2) routines (for example, thr_create(3T) and pthread_create(3T), which should enter the kernel via _lwp_create(2), the lower level interface. It's possible to use the _lwp_xxxx(2) calls directly from your program, but these routines are more difficult to use and they break code portability. (That's why we have library routines.) (We'll be covering the topic of processes, threads, and lwps in Solaris in a future Inside Solaris column.)

Anyway, back to async I/O. As I said, the original implementation of the aioread(3) and aiowrite(3) routines creates a queue of I/O requests and processes them through user-level threads. When the aioread(3) or aiowrite(3) is entered, the system will simply put the I/O in a queue and create an lwp (a thread) to do the I/O. The lwp returns when the I/O is complete (or when an error occurs), and the calling process is notified via a special signal, SIGIO. It's up to you to put a signal handler in place to receive the SIGIO and take appropriate action, which minimally includes checking the return status of the read or write by reading the aio_result structure's aio_return value. As an alternative to the signal-based SIGIO notification, you have the option of calling aiowait(3) after issuing an aioread(3) or aiowrite(3). This will cause the calling thread to block until the pending async I/O has completed. There's a time-value that can be set and passed as an argument to aiowait(3) such that the system only waits for a specified amount of time.

While the threads library implementation of async I/O works well enough for many applications, it didn't necessarily provide optimal performance for applications that made heavy use of the async I/O facilities. Commercial relational database systems, for example, use the async I/O interfaces extensively. Overhead associated with the creation, management, and scheduling of user threads motivated the decision that an implementation that required less overhead and provided better performance and scalability was in order. A review of the existing async I/O architecture and subsequent engineering effort resulted in an implementation called kernel asynchronous I/O, or kaio.

Kaio first appeared in Solaris 2.4 (with a handful of required patches) and has been available, with some restrictions, in every Solaris release since. The restrictions have to do with which devices and software include kaio support and which ones don't. The good news, from an application standpoint, is that the question of whether or not there is kaio support for a given combination of storage devices, volume managers, and file systems is transparent: If kaio support exists, it will be used. If it doesn't, the original library-based async I/O will be used. Applications don't change in order to take advantage of kaio. The system figures out what is available and allocates accordingly.

What kaio does, as the name implies, is implement async I/O inside the kernel rather than in user-land via user threads. The I/O queue is created and managed in the operating system. The basic sequence of events is as follows: When an application calls aioread(3) or aiowrite(3), the corresponding library routine is entered. Once entered, the library first tries to process the request via kaio. A kaio initialization routine is executed, which creates a "cleanup" thread, which is intended to ensure that there are no remaining memory segments that have been allocated but not freed during the async I/O process. Once that's complete, kaio is called, at which point a test is made to determine if kaio is supported for the requested I/O.

Support for kaio requires specific async I/O read and write routines at the device-driver level. Solaris provides this support in the SCSI driver for all currently shipping Sun storage products. This includes the fiber-based storage products, which implement the SCSI protocol over the Fibre Channel connect. The other implementation restriction is that kaio only works when the target of the async I/O is a character device-special file. In other words, kaio works only on raw disk device I/O. Additional support has been added with the host-based volume management software used for creating RAID volumes on SPARC/Solaris servers: Sun Enterprise Volume Manager, based on Veritas, and Solstice DiskSuite. RAID devices created using either Veritas or DiskSuite have raw device entry points via the /dev/vx/rdsk and /dev/md/rdsk device-special files, and the pseudo drivers implemented for these volume managers include async I/O routines for kaio support. Note that this is generally true for current versions -- check with your local Sun office if you're running older versions and are interested in verifying kaio support. You can also use the kaio test described below.

If kaio support is available, the kernel allocates a aio_req structure from the queue (or creates a new one in kernel memory via kmem_alloc) and calls the async I/O routine in the appropriate device driver. Inside the driver, the required kernel data structures are set up to support the I/O, and an async I/O-specific physical I/O routine, aphysio, is entered. Synchronous raw device I/O uses the kernel physio function, where kernel buffers are set up, and the driver strategy routine is called to do the actual device I/O. The physio routine waits for the driver strategy routine to complete through the buffer I/O biowait kernel code. The aphysio routine sets up the async I/O support structures and signal mechanism, then calls the driver strategy routine, without waiting for the I/O to complete.

If kaio support isn't available, the code path taken is very different. If the kaio system call entered when the libaio routine is called returns an ENOTSUP error, the system implements the original user-thread async I/O facility. This basically involves putting the I/O in an async I/O queue and handing the aioread or aiowrite off to a worker thread that was created when the libaio library initialized (more on this after the example below). The thread assigned to do the I/O on behalf of the calling process uses the pread(2)/pwrite(2) system calls to enter the kernel. Pread(2) and pwrite(2) are similar to read(2) and write(2) except they take an additional file offset argument, allowing the caller to specify a byte offset into the file with which to begin the I/O. Essentially, the user-thread implementation of async I/O makes the I/O appear asynchronous to the calling process by creating a thread to do the I/O and allowing the caller to return to do other work without waiting. The implementation actually uses the traditional synchronous Solaris system calls (pread(2) and pwrite(2)) to do the I/O.

Sun's internal benchmarking and testing has shown that the implementation of kaio was truly a worthwhile effort. The reductions in overhead and dedicated device driver-level support yield overall faster and more efficient async I/O operations. On relatively small, lightly loaded systems, the improvement is less dramatic, with typical performance improvements on the order of 5 to 6 percent. As the size of the system (number of processors and amount of RAM) increases, and the number of async I/O requests (load) grows, the kaio approach delivers much more scalable performance, with improvements measuring up to 30 percent under some benchmarks. Your mileage will, of course, vary.

Kaio is implemented as a loadable kernel module, /kernel/sys/kaio, and is loaded the first time an async I/O is called. You can determine if the module is loaded or not with modinfo(1M):

fawlty> modinfo | grep kaio 
105 608c4000   2efd 178   1  kaio (kernel Async I/O) 

Note that if the above command doesn't return anything, it simply means the kaio kernel module hasn't yet been loaded. It can be explicitly loaded via modload(1M), using the forceload directive in the /etc/system file, or, as I said, the system will load it automatically as needed.

There's a relatively simple test you can run to determine whether or not you have kaio support available for a given file. (Thanks to Phil Harman, SE extraordinaire out of the U.K., for devising this test.) It involves compiling and running a small program that calls aioread(3) and using truss(1) on the program to watch the system-call activity. First, here's the C language source code:

* Quick kaio test. Read 1k bytes from a file using async I/O.
* To compile:
* cc -o aio aio.c -laio
* To run:
* aio file_name

#define BSIZE 1024

main(int argc, char *argv[])
	aio_result_t res;
	char buf[BSIZE];
	int fd;

	if ((fd=open(argv[1], O_RDONLY)) == -1) {
	aioread(fd, buf, BSIZE, 0L, SEEK_SET, &res);
	if (res.aio_return == BSIZE) {
		printf("aio succeeded\n");

Once you have the test program compiled, use the truss(1) command to do a system-call trace of the execution path:

# truss -t kaio,lwp_create aio /dev/rdsk/c0t3d0s0
kaio(5, 0xFFFFFFE8, 0xFFFFFFFF, 0xEF68FB50, 0x00000000, 0x00000000, 0x00000000) = 0
lwp_create(0xEFFFEE10, 0, 0xEF68FF44)           = 2
lwp_create(0x00000000, 0, 0x00000000)           = 0
kaio(AIOREAD, 3, 0xEFFFF190, 1024, 0, 0xEFFFF590) = 0
kaio(AIOWAIT, 0x00000000)                       = -268438128
aio succeeded

Note that I made myself root for the test because I'm reading directly from a raw device-special file. On the command line, I use the -t flag with truss to instruct truss to trace only the system calls I specify, which in this example are the kaio and lwp_create calls. This reduces the truss output so it's less cluttered.

In the example above, I used a raw device-special file as the argument to the aio program. (You have to specify a full pathname for a file when you invoke aio.) The truss(1) output shows a kaio system call, followed by two lwp_create calls, and finally two more entries into the kaio routine to do the actual read, followed by the aiowait. The last line is the output from the aio program (aio succeeded). In this example, kaio is in fact a supported facility for this I/O operation, as the kaio call with the AIOREAD flag did not return an error.

In the example below, kaio is not supported for the I/O:

# truss -t kaio,lwp_create aio /space/jim/junkfile
kaio(5, 0xFFFFFFE8, 0xFFFFFFFF, 0xEF68FB50, 0x00000000, 0x00000000, 0x00000000) = 0
lwp_create(0xEFFFEE08, 0, 0xEF68FF44)           = 2
lwp_create(0x00000000, 0, 0x00000000)           = 0
kaio(AIOREAD, 3, 0xEFFFF188, 1024, 0, 0xEFFFF588) Err#48 ENOTSUP
lwp_create(0xEFFFEDA8, 0, 0xEF686F44)           = 3
lwp_create(0x00000000, 0, 0x00000000)           = -278369456
lwp_create(0xEFFFEDA8, 0, 0xEF67DF44)           = 4
lwp_create(0x00000000, 0, 0x00000000)           = -278406320
lwp_create(0xEFFFEDA8, 0, 0xEF674F44)           = 5
lwp_create(0x00000000, 0, 0x00000000)           = -278443184
lwp_create(0xEFFFEDA8, 0, 0xEF66BF44)           = 6
lwp_create(0x00000000, 0, 0x00000000)           = -278480048
lwp_create(0xEFFFEDA8, 0, 0xEF662F44)           = 7
lwp_create(0x00000000, 0, 0x00000000)           = -278516912
lwp_create(0xEFFFEDA8, 0, 0xEF659F44)           = 8
lwp_create(0x00000000, 0, 0x00000000)           = -278553776
lwp_create(0xEFFFEDA8, 0, 0xEF650F44)           = 9
lwp_create(0x00000000, 0, 0x00000000)           = -278590640
lwp_create(0xEFFFEDA8, 0, 0xEF647F44)           = 10
lwp_create(0x00000000, 0, 0x00000000)           = -278627504
lwp_create(0xEFFFEDA8, 0, 0xEF63EF44)           = 11
lwp_create(0x00000000, 0, 0x00000000)           = -278664368
kaio(AIOWAIT, 0x00000000)                       Err#22 EINVAL
kaio(AIOWAIT, 0x00000000)                       Err#22 EINVAL
kaio(AIONOTIFY, 170560)                         = 0
aio succeeded

In this example, I'm reading a file in the file system, /space/jim/junkfile. Since kaio isn't supported for file system I/O, I knew it would fail. As you can see, I entered the kernel with a kaio system call, created a couple of lwps, and attempted to do the async I/O through the kaio facility. The second kaio call failed with the ENOTSUP (error: not supported) error, and the system dropped back to the library implementation, resulting in the creation of a bunch of threads (lwps) up to the completion of the I/O. Note that the last line, aio succeeded from the program output indicates that the aioread(3) was in fact successful -- we just didn't use the kaio facility to do it.

It's clear that the user-threads implementation results in a lot more thread/lwp creation and management, which adds overhead and reduces efficiency. So why does Solaris create 11 lwps for a single async I/O request? The answer has to do with the initialization of the async I/O library. The first time an aioread(3) or aiowrite(3) is called, an async I/O library initialization routine is called, which creates several worker threads to do async I/Os. The system is simply getting ready to process several async I/Os, and thus creates four reader lwps and four writer lwps, along with one thread to do file syncs. That accounts for the nine successful lwp_create calls you see above. If you were to add a second aioread(3) or aiowrite(3) to the test program and rerun it with the truss(1), you wouldn't see all the lwp_creates for the second async I/O. The library handles subsequent async I/O requests with the worker threads created during initialization (though it will, of course, create more worker threads to keep up with the level of incoming async I/O requests).

While the creation of a pool of worker threads up front helps provide better scalability for the user-threads async I/O facility, it still involves more overhead and a longer code path than kernel async I/O.


The Large File Summit and Unix standards
In order to maintain consistency across the various implementations of Unix that are commercially available today, a number of summits were held over the course of about a year, including the Large File Summit held in January 1995. The idea was to bring together major vendors such that consensus could be reached on the interfaces and implementation of large files. One of the results of the summit was that the interfaces for large file support became part of the Single Unix Specification (SUS), the definitive source of specifications that define what a Unix system is today.

The SUS is maintained by the The Open Group, the same organization that owns the Unix trademark. Any operating system that carries the X/Open brand has been certified in compliance with the SUS, currently Version 1.

Version 2 of the SUS, also called Unix 98, provides the specifications for full 64-bit support. Solaris 2.7 (due in November 1998) is a complete 64-bit implementation, and will be Unix 98 compliant.

Sun, of course, participated in the summit, and Solaris 2.6 is X/Open-branded, which, as a result of the Large File Summit efforts, includes the definitions for 64-bit file interfaces. Visit the Web pages in the Resources section below to learn more about The Open Group, X/Open branding, and Unix industry standards.

64-bit files in Solaris
The 2-gigabyte (GB) file size limitation in releases of Solaris prior to 2.6 was the result of the data type used for the file offset value, or file pointer. Every file I/O requires a pointer that points to the current byte in the file and moves along the file as bytes to be read or written. The data type used was a signed long, which has a maximum value of about two billion, thus the 2-GB file size limit.

The support of large files required changing the data type used for file offsets in all the library routines and system calls involved in file I/O. It also became necessary to provide for the inclusion of the 64-bit file offset in the various data structures and other related library and kernel variables used in the file I/O facilities. The implementation involved additions to the Solaris header file types.h, which is where all core data types that the system uses are defined. In addition to the fundamental data types, char (character), signed and unsigned short (2 bytes), signed and unsigned integer (4 bytes) and signed and unsigned long (4 bytes), the types.h file defines many derived types via the C language's typedef facility. All of the derived types follow a simple name rule: The data type always ends with an underscore followed by a lowercase t (_t). A fair amount of these derived-type declarations are to provide POSIX and SUS compliance, such that an application source file that defined

uid_t userid;	/* user ID data type */

in code would compile and behave consistently across different Unix releases. (That's why we have standards.)

The data type used in the kernel for file offsets was declared in types.h as

typedef long off_t

Because a long is 4 bytes (32 bits), we had a 2-GB limit. But support for large files required that the file offset be extended to 64 bits. As you can see, the file size limit has nothing to do with any boundaries imposed by the file system itself. The limits on how large a file system can be and on how large a file in the file system can be (from a file system support standpoint) are determined as follows: For a UFS in Solaris, a file system maintains a data variable in the superblock called fs_bsize, which is the number of 512-byte blocks available to the file system. Since fs_bsize is a signed integer, it has a maximum value of two billion bytes (2 GB); multiply that by 512 (bytes) and you have the maximum size of a UFS in Solaris -- 1 terabyte (TB).

For files, the file data is referenced via the direct and indirect block pointers in the file's inode. There are 12 direct pointers (each pointer directly references a data block) and 3 indirect pointers (one single indirect, one double indirect, and one triple indirect). An indirect pointer points to a block that contains pointers, not data. An 8-KB block can hold 2 KB pointers (a pointer is 4 bytes), and each level of indirection adds another layer of blocks that contain pointers. So, the maximum file size you can have in a UFS, based on the file system's ability to map the data blocks, can be calculated as follows:

(12 x 8 KB) + (2 KB x 8 KB) + ((2 KB x 2 KB) x 8 KB) + ((2 KB x 2 KB x 2 KB) x 8 KB)

The actual math is left as an exercise for the reader, but the result comes in around 70 TB or so. Since a file can't span multiple file systems, the size of a file is limited now (with Solaris 2.6) by the size of a file system, which is 1 TB.

So the real effort in expanding to large file support wasn't around file system code and file system structures; rather, it was around embedded data types in the operating system, which are used to provide an offset into a file. Sun kept an eye on two things as the design and implementation of large files evolved. The first was binary compatibility for existing applications, such that a binary that worked on earlier versions of Solaris would continue to behave properly when run on Solaris 2.6. The second was source compatibility.

Source code that was brought over from older versions required that the compilation environment be consistent. To solve this, the large file implementation in Solaris 2.6 provides an extension to the existing API to protect binary compatibility, and a transitional API to ensure that the applications can be modified to take advantage of large file support without altering existing interfaces.

Solaris defines applications that will run under 2.6 as being either large file safe or large file aware. An application that is large file safe is one that can encounter an error that is the result of attempting to do an I/O on a large file and deal with that error gracefully such that no file data corruption occurs. In other words, the operating system needs to provide a means by which a 32-bit interface call (i.e., open(2)) can be notified when a reference is made to a large file. Hopefully, all code run on Solaris systems tests the return values of system calls and library routines. Every callable interface has a defined return value for success and failure (typically a -1 is returned due to failure), and it's up to you to ensure that return values are tested before proceeding with the program sequence. The operating system cannot prevent an application that has encountered an error during an interface call from continuing to run if the code has been written that way. The extension to the existing API will set the errno to a meaningful value if an I/O operation is attempted on a large file using the wrong API. (For example, open(2) will get an EOVERFLOW error if a file larger than 2 GB is encountered.)

The transitional API provides a 64-bit equivalent interface for each of the existing 32-bit-file-related interfaces that use the corresponding interface name followed by a 64. Thus, this API includes open64(), read64(), write64(), etc. Along with the interface name, a data type following the same convention exists for each of those interfaces for data types related to file offsets, such as off64_t instead of the 32 bit off_t. These transitional APIs can be called explicitly from the program, or compilation flags passed at build time can force their inclusion.

The compilation environment for building applications that run on Solaris 2.6 has a couple flags that can be used, depending on the program requirements. For existing code, nothing needs to be changed if you don't require the program to work with large files. Only error return handling needs to be checked (as above) in case a large file is encountered. In this environment, opens of large files will fail, and read(2) and write(2) can't seek beyond 2 GB. For compiling programs that are large file aware, meaning that the program can process large files properly, the _LARGEFILE64_SOURCE flag can be set in the source code, which will make all 64-bit functions and data types available to the programmer. It is up to the programmer to make explicit use of the large file interfaces if desired.

Finally, a programmer can set _FILE_OFFSET_BITS to 64, which will force the inclusion of all the 64-bit interfaces (i.e., all open(2)s will be mapped to open64(2), etc., by the compiler). You need not make any source code changes to take advantage of large files. (User-defined data that is used for file offsets in the exception: this would need to be changed to a 64-bit data type.) The xxx64() interfaces will work just fine with any size file. Here's a quick example. First, a test program:

main(int argc, char *argv[])
        int fd;
        if ((fd=open("/space/jim/junkfile",O_RDWR|O_CREAT, 0777)) == -1) {
        printf("open succeeded\n");

Now, we'll compile and run it, trussing the output for open(2) calls.

sunsys> gcc -o lf lf.c
sunsys> truss -t open lf
open("/space/jim/junkfile", O_RDWR|O_CREAT, 0777) = 3
open succeeded
sunsys> gcc -o lf lf.c -D_FILE_OFFSET_BITS=64
sunsys> truss -t open lf
open64("/space/jim/junkfile", O_RDWR|O_CREAT, 0777) = 3
open succeeded

To reduce clutter in the column, I deleted the opens of the system libraries that take place when program execution begins. The program name is lf.c, and I first compiled it without any flags. As you can see from the truss, we executed the 32-bit open(2) call. The next compilation included setting _FILE_OFFSET_BITS to 64. This time, the compiler linked to the open64(2) routine, which is what was executed when I ran the program the second time.

Since the file I/O subsystem is tightly coupled to the virtual memory system, large file implementation required 64-bit offset support in several areas of the kernel in order to facilitate the Solaris dynamic page cache for caching of file system objects (files) in physical memory. The virtual memory system added support for 64-bit offsets in the page structure, vnode layers, and segment drivers that manage address space mappings for VM objects. Support was added in the anonymous memory and swap layers, so swap files can be larger than 2 GB in 2.6. In addition, large file support was added to the NFS (with NFS V3), CacheFS, and tmpfs file systems.

Readers are encouraged to read the interface64(5) and lfcompile(5) manual pages, which provide a detailed breakdown of the data structures and transitional interfaces available in 2.6.

That's it for this month. I hope you've enjoyed this series on files. Next month we'll embark on something new. See you then!


About the author
Jim Mauro is currently an area technology manager for Sun Microsystems in the Northeast area, focusing on server systems, clusters, and high availability. He has a total of 18 years industry experience, working in service, educational services (he developed and delivered courses on Unix internals and administration), and software consulting. Reach Jim at

What did you think of this article?
-Very worth reading
-Worth reading
-Not worth reading
-Too long
-Just right
-Too short
-Too technical
-Just right
-Not technical enough

[Table of Contents]
Subscribe to SunWorld, it's free!
[Next story]
Sun's Site

[(c) Copyright  Web Publishing Inc., and IDG Communication company]

If you have technical problems with this magazine, contact

Last modified: