Inside Solaris by Jim Mauro

Demangling message queues

We've already covered two of the three traditional Unix System V IPC (Interprocess Communication) facilities: shared memory and semaphores. We'll finish up with a look at the message queue facility -- the allocation of kernel resources, kernel implementation, and the actual sending and receiving of messages

November  1997
[Next story]
[Table of Contents]
Sun's Site

Message queues appeared in an early release of Unix System V Release III (pre SVR4) as a means of doing asynchronous message passing between processes. Message queues allowed application developers to pass data around running processes in an ordered fashion.

We will conclude our coverage of the traditional Unix System V IPC facilities with a discussion of message queues that follows a pattern similar to that of the past two columns. We'll drill down on the kernel tunable parameters and take a close look at the kernel implementation of message queues.

Once again, our intent here is not to explain how to write application code by using the message queue facility. That information is readily available in any number of books on programming in Unix, as well as the Solaris Developers Kit (SDK) documentation. (3,800 words)

Mail this
article to
a friend
Understanding the general flow of application code using the message queue facility can make the following sections on the kernel implementation more palatable, especially for non-programmers. As we said above, we're not going into nearly enough detail here to get you comfortable writing applications using message queues; we'll cover just enough so the kernel stuff makes sense.

The APIs used for the System V message queue facility are in the form of system calls, documented in section 2 of the man pages. Not to be confused with the POSIX message queue facility, documented in section 3R of the man pages. The POSIX interfaces do not use the kernel resources discussed in this month's column. The POSIX code only works in Solaris 2.6. Prior releases do not support the POSIX messaging facility.

As with the other IPC facilities, the initial call when using message queues is an ipcget call, in this case msgget(2). The msgget(2) system call takes a key value and some flags as arguments. The key is a generic IPC convention that allows for the retrieval of commonly shared resources from different processes. Consider an application being developed where multiple processes will be sending and receiving messages to and from the same message queue. In order for these different processes to use the same message queue, we need to grab the correct message queue indentifier, which is what the msgget(2) system call returns to the calling application. The way we ensure that the processes are using a common message queue is via a key value. As long as each process passes the same key value in the msgget(2) call, the correct message-queue identifier will be returned; the same identifier goes to all processes using the same key. This is assuming, of course, that the correct flags are used and the permissions are correctly established.

Once the message queue has been established, it's simply a matter of sending and receiving messages. Applications use the msgsnd(2) and msgrcv(2) to accomplish this. The sender simply constructs the message, assigns a message type, and calls msgsnd(2). The system will place the message on the appropriate message queue until a msgrcv(2) is successfully executed. Sent messages are placed in the back of the queue; messages are received from the front of the queue.

The message queue facility implements a message type field, which is user (programmer) defined. This gives programmers some flexibility because the kernel has no embedded or predefined knowledge of different message types. Programmers typically use the type field for priority messaging or directing a message to a particular recipient.

Lastly, applications use the msgctl(2) system call in order to get or set permissions on the message queue, and to remove the message queue from the system when the application is finished with it (e.g., as a clean way to implement an application shutdown procedure; the system will not remove an empty and unused message queue unless it is explicitly removed or the system is rebooted).

As a general note, the shared memory and semaphore IPC facilities also provide a "control" system call (shmctl(2) and semctl(2)), which is used for the same purpose: getting or setting permissions for the resource, getting general information on use of the resource, and removing the resource programmatically. The system ipcs(1M) and ipcrm(1M) commands can be used to retrieve resource usage information and remove the resource via the command line.


Kernel resources
As with the IPC facilities previously discussed, the message queue facility comes in the form of a dynamically loadable kernel module, /kernel/sys/msgsys, and depends on the IPC support module, /kernel/misc/ipc, to be loaded in RAM. We talked a bit about loadable kernel modules in the shared memory column of September 1997 (see Resources below). The key points to remember are that the system will load the module into system memory when the first message system call is executed, and the IPC facility that is required to provide some low-level support routines will get loaded at the same time.

A system administrator can use the forceload utility in the /etc/system to force the system to load the message queue module at boot time. The ipcs(1M) command can be used to determine if the module is loaded and examine current utilization of message queue resources (more on this shortly).

The number of resources that the kernel will allocate for message queues is tunable. Values for various message queue tunable parameters can be increased from their default values so more resources are made available for systems running applications that make heavy use of message queues. A summary of the tunable parameters, along with the default and maximum values, can be found in Table 1. We'll take a closer look at each one now.

Name Default Data Type Max Description
msgmap 100 signed int 2GB Number of message map entries
magmax 2048 signed int 2GB Maximum message size
msgmnb 4096 signed int 2GB Max bytes on a msg queue
msgmni 50 signed int 2GB Max msg queue identifiers
msgssz 8 signed int 2GB Message segment size
msgtql 40 signed int 2GB Max message headers
msgseg 1024 unsigned short 32k Max message segments

Note: The maximum value listed in the "Max" column in table 1 is a value based on the data type. It is a theoretical maximum only, and should not be construed as something attainable on production systems.

The msgmap tunable is described as defining the number of entries in the message map and is essentially the same as the semmap parameter used for semaphores described last month. Both IPC facilities use resource allocation maps in the kernel. A kernel resource map is simply an array of map structures used for the allocation and deallocation of segments of an address space. They provide a convenient means of managing small segments of kernel memory where there is frequent allocation and deallocation, as is the case with message queues (and semaphores). The system grabs message map entries when it needs space to store new messages destined for a message queue.

The msgmni tunable is (hopefully!) a now familiar IPC identifier "mni" parameter. As with shared memory segments and semaphores, message queues have an identifier associated with them, with a corresponding id data structure, the msqid_ds structure. The value of msgmni determines the maximum number of message queues the kernel can maintain. As we'll see below, the system allocates kernel memory based on the value of msgmni, so one should not set this arbitrarily high. Hopefully, the system software engineers will have a sense for how many message queues are needed by the application, and can set msgmni appropriately, adding 10 percent or so for headroom.

The msgmax parameter defines the maximum size a message can be, in bytes. The kernel does not allocate resources up front based on msgmax, but it is something that the application developer needs to be aware of, as the system will not allow messages that have a size larger than msgmax on the message queue. An error will be returned to the calling code indicating the message is too large. Even with a theoretical size limit of 2 GB for the maximum message size (see Table 1), message queues are probably not the most efficient way to move large blocks of data between processes. If the data requirements are relatively large, the software engineers should consider using shared memory instead of message queues for data sharing among processes -- or one of the more recent additions to Unix, such as a FIFO (named pipe). When I say "more recent," I mean recent relative to message queues. FIFOs have actually been around for quite a while, but not nearly as long as message queues.

msgmnb is used to determine the maximum number of bytes on a message queue. More succinctly put, the sum total of all the bytes of all the messages on the queue can not exceed msgmnb. When the message queue is initialized (the first msgget(2)) call executed with the IPC_CREAT flag set), the kernel sets a member of the msgid_ds structure, the msg_qbytes field, to the value of msgmnb. This makes the information available to programmers, as the msgctl(2) can be used to retrieve the msgid_ds data. More importantly, code executed with an effective UID of root (typically 0) can programatically increase this variable in case the queue needs to hold more message bytes than originally allocated at boot time. If an application attempts to put a new message on a message queue that will result in the total bytes being greater than msgmnb, the msgsnd(2) call will either return an error, or the process will block waiting for one or more messages to be removed (read) from the queue, such that the total number of bytes on the queue, plus the size of the new message, is less-then or equal-to msgmnb . Whether or not the process plugs depends on the state of the IPC_WAIT flag.

msgtql defines the maximum number of message headers. Each message on a message queue requires a message header, which is defined in the kernel by the msg structure (more on that in the next section). Basically, this tunable should reflect the maximum number of messages (message queues, times messages per queue) the application will need, plus a little headroom.

msgssz establishes the maximum message segment size, and msgseg determines the maximum number of message segments. msgseg is stored as a short in the kernel, and thus, can not be greater then 32,768 (32 K) bytes in size. The kernel creates a pool of memory to hold the message data, and the size of that pool is the product of msgssz and msgseg parameters. Described more clearly, the number of units of allocation from the data space is msgseg, and the size of each allocation unit from the space is msgssz.

Kernel resource allocation
We'll take a look at how the kernel allocates resources for message queues based on the tunables, then we'll put it all together in the last section.

When the /kernel/sys/msgsys module is first loaded, an initialization routine executes, which does pretty much the same sort of work that is done for shared memory and semaphore initialization. That is, a check is made on the amount of kernel memory that will be required for resources based on the tunable parameters discussed previously, and providing the required amount is no greater than 25 percent of available kernel memory, the system allocates the resources.

The amount of kernel memory required is calculated as follows:

kernel_memory_required = ((msgseg * msgssz) * sizeof char datatype) +
                          (msgmap * sizeof map structure) +
            		  (msgmni * sizeof msqds_id structure) +
            		  (msgmni * sizeof msglock structure) +
            		  (msgtql * sizeof msg structure)
The sizes of the structures in bytes will be provided in the next few paragraphs, and once again the arithmetic using either the default or custom values is left as an exercise for the reader. The char datatype on SPARC/Solaris is one byte in size.

Assuming everything will fit, the system grabs various chunks of kernel memory as follows, assigning kernel pointers described below:

        msg     = allocate_kernel_memory(msgseg * msgssz) * sizeof (char);
        msgmap  = allocate_kernel_memory(msgmap * sizeof (struct map));
        msgh    = allocate_kernel_memory(msgtql * sizeof (struct msg));
        msgque  = allocate_kernel_memory(msgmni * sizeof (struct msqid_ds));
        msglock = allocate_kernel_memory(msgmni * sizeof (struct msglock));
The msg pointer is set to point to the beginning of the pool of memory used to store message data, described earlier. msgmap points to the beginning of the map structures used for maintaining resource allocation maps, also described earlier. A map structure is eight bytes in size, and looks like:

struct map {
	ulong_t m_size;	/* the size of the map segment */
	ulong_t m_addr; /* the address of the start of the segment */

The kernel data structure that describes each message queue is the msqid_ds structure:

struct msqid_ds {
        struct ipc_perm  msg_perm;      /* operation permission structure */
        struct msg      *msg_first;     /* ptr to first message on q */
        struct msg      *msg_last;      /* ptr to last message on q */
        ulong           msg_cbytes;     /* current # bytes on q */
        ulong           msg_qnum;       /* # of messages on q */
        ulong           msg_qbytes;     /* max # of bytes on q */
        pid_t           msg_lspid;      /* pid of last msgsnd */
        pid_t           msg_lrpid;      /* pid of last msgrcv */
        time_t          msg_stime;      /* last msgsnd time */
        long            msg_pad1;       /* reserved for time_t expansion */
        time_t          msg_rtime;      /* last msgrcv time */
        long            msg_pad2;       /* time_t expansion */
        time_t          msg_ctime;      /* last change time */
        long            msg_pad3;       /* time expansion */
        kcondvar_t      msg_cv;
        kcondvar_t      msg_qnum_cv;
        long            msg_pad4[3];    /* reserve area */

The structure field descriptions above are basically self-explanatory. The msg_perm field is an IPC permissions structure used in all the IPC facilities to maintain the access permissions to the resource based on the same "owner," "group," and "other" convention used for files in the file system. It is described in the September 1997 column on shared memory segments (see Resources below). The permissions get established by the process that creates the shared segment, and they can be changed via the msgctl(2) system call. The kernel pointer msgque points to the beginning of the kernel space allocated to hold all the system msqid_ds structures. It is simply an array of msqid_ds structures, with msgque pointing to the first structure in the array. The total number of structures in the array is equal to the msgmni tunable. Each structure is 112 bytes in size.

The messages in a message queue are maintained in a linked list, with the root of the list in the msqid_ds data structure (the msg_first pointer), which points to the message header for the message. The kernel also maintains a linked list of message headers, rooted in the kernel msgh pointer.

struct msg {
        struct msg      *msg_next;      /* ptr to next message on q */
        long            msg_type;       /* message type */
        ushort_t        msg_ts;         /* message text size */
        short           msg_spot;       /* message text map address */
The kernel message structure (actually, message *header* structure is a more accurate name) is 12 bytes in size and, as we said, one exists for every message on every message queue (the msgtql tunable).

The last chunk of kernel memory allocated is for the message queue synchronization locks. The method of synchronization used is a condition variable protected by a mutex (mutual exclusion) lock, defined in the msglock structure:

struct msglock {
	char msglock_lock;
	kcondvar_t msglock_cv;
There is a msglock created for every message queue (one per message queue identifier).

Condition variables are a means of allowing a process (thread) to test whether or not a particular condition is true under the protection of a mutex lock. The mutex ensures that the condition can be checked for atomicity, and no other thread can change the condition while the first thread is testing the condition. The thread will block holding the mutex until the condition changes state (becomes true), at which point the thread can continue execution. A good example is the existence of a message on a queue. If none exists, the thread blocks (sleeps) on the condition variable. When a message appears on the queue, the system sends a broadcast; and the thread is woken up, ready to pull the message off the queue. We'll cover this a bit more in the next section.

A final note on kernel locking. All versions of Solaris, up to and including 2.5.1, do very coarse-grained locking in the kernel message queue module. Specifically, there is one kernel mutex initialized that protects the message queue kernel code and data. The net-net of this is that applications running on multiprocessor platforms using message queues will not scale very well. This is changed in Solaris 2.6, which implements a finer-grained locking mechanism, allowing for greater concurrency. The improved message queue kernel module has been backported and is available as a patch for Solaris 2.5 and 2.5.1.

Figure 1 illustrates the general layout of things after initialization of the message queue module in complete, along with the kernel pointers described above.

Figure 1 - Kernel Message Queue Resources

Kernel implementation
We'll walk through the kernel flow involved in the creation of a message queue and the sending and receiving of messages, as these represent the vast majority of message queue activity.

The creation of a message, on behalf of an application calling the msgget system call, starts with a call to the kernel ipcget routine. ipcget is a generic interface implemented in the kernel /kernel/misc/ipc module. It is used by all the IPC "get" routines.

ipcget examines the key value, and if it is IPC_PRIVATE (which is defined as a value of zero in the kernel), the code locates the next available ipc_perm structure, in preparation for creating a new message queue. There will be an ipc_perm structure available for every message queue identifier (msgmni). Once a structure has been allocated, the system initializes the structure members based on the UID and GID of the calling process. The permission mode bits get set based on values passed by the calling code, and finally the IPC_ALLOC bit gets set to indicate that the ipc_perm structure has been allocated.

If the key is not equal to IPC_PRIVATE, the following psuedo-code illustrates the loop implemented:

set pointer to the first ipc_perm structure
while (we're not at the end of ipc_perm structures)
	if (match on key value)
		if (IPC_EXCL and IPC_CREAT are true)
		else if (permissions do not allow)
			return (ACCESS ERROR)
			return (success)
 * When we get here, we've walked through all of the ipc_perm structures
 * and didn't find on match on the key. So we are probably creating a new
 * one that's not IPC_PRIVATE.
if (IPC_CREAT flag is not TRUE)
	return (not found error)
	allocate and initialize the next available entry
	return (success)
Once the ipcget work is done, the remaining msgget initializes the rest of the msqid_ds structure members, such as the message header pointer (to NULL, because there are no messages yet), the creator PID, and byte fields, etc. At this point, the application code has a valid message queue identifier and can send and receive messages, as well as do message control (msgctl(2)) operations.

A message send (msgsnd(2)) call requires the application to construct a message, setting a message type field (discussed earlier) and creating the body of the message (e.g., a text message).

The message send kernel support code does some general housekeeping when the code path is first entered (such as incrementing the processor statistics to indicate a message queue system call is being executed, verifying access permissions to the message queue by the calling process, and ensuring the message size does not exceed the msgmax tunable). The message type field is then copied from the user address space to a designated area in the kernel.

The rest of the message send flow is best represented in pseudo-code:

if (the message queue no longer exists)
	return (queue ID removed error)
if (current bytes on queue + bytes in new message > msgmax)
	if (IPC_NOWAIT flag is true)
		return (error -- try again)
		set MSGWAIT flag in msqid_ds.msg_perm.mode field
		set up a condition variable to wait for available
		space on the message queue.
		when wake-up received on condition variable
		check for space on the queue again

 * code will loop rechecking for space unless an
 * error condition occurs. moving beyond this point
 * means we have the space we need on the queue now 

grab space for the message from the resource map (msgmap)
if (space currently not available AND IPC_NOWAIT is true)
	return (try again error)
	set a condition variable and wait for an available map location

 * OK. we got this far, so we have the kernel resources we need
copy the message data from user space to the allocated kernel space from the map.
update the following members of the msqid_ds structure;
	increment the msg_qnum field 
	add the appropriate byte count to the msg_cbytes field
	set the correct PID in the msg_lspid field
	set the correct time in msg_stime

update the following fields in the message header;
	set the message type value in msg_type	
	set the text size in msg_ts
	set the msg_spot pointer to point to the appropriate resource
	map location where the body of the message was stored

adjust the queue pointers (msg_first, msg_last) appropriately
At this point, the message has been placed on the end of the designated message queue, and the return code is sent to the calling program.

The msgrcv support code is a little less painful, because now we're looking for a message on the queue (as opposed to putting one on the queue). Kernel resources do not need to be allocated for a msgrcv.

The general flow of the kernel code path for receiving messages goes something like:

check permissions for operation
loop through all the messages on the queue
if (requested type = message type)
	copy the message type to the user supplied location
	copy the message data to the user supplied location
	update the msqid_ds structure fields;
		subtract the message size from msg_cbytes;
		set PID in msg_lrpid
		set time in msg_rtime
	free the message resources
		free the message header (msg structure)
		free the resource map entry
if (looped through all messages, no matching type)
	return (no message error)
That's basically what happens with a message receive. When completed, the application code will have the message type and data in a buffer area supplied in the msgrcv(2) system call.

The only remaining callable routine for applications to use is the msgctl(2) system call, which we discussed briefly earlier in the column. The control functions are pretty straightforward, as they typically involve either retrieving or setting values in a message queues ipc_perm structure. When msgctl(2) is invoked with the IPC_RMID flag, meaning the caller wishes to remove the message queue from the system, the kernel will walk the linked list of messages on the queue, freeing up the kernel resources associated with each message. Processes (threads) sleeping on the message queue will be sent a wake-up signal and ultimately end up with an EIDRM error (ID removed). The system will simply mark the msqid_ds structure as being available and return.

That's it for IPC facilities. Next month we'll get into an area on Solaris that generates a fair number of questions: swap space, the swap file system, and swap allocation.


About the author
Jim Mauro is currently an Area Technology Manager for Sun Microsystems in the Northeast area, focusing on server systems, clusters, and high availability. He has a total of 18 years industry experience, working in service, educational services (he developed and delivered courses on Unix internals and administration), and software consulting. Reach Jim at

What did you think of this article?
-Very worth reading
-Worth reading
-Not worth reading
-Too long
-Just right
-Too short
-Too technical
-Just right
-Not technical enough

[Table of Contents]
Sun's Site
[Next story]
Sun's Site

[(c) Copyright  Web Publishing Inc., and IDG Communication company]

If you have technical problems with this magazine, contact

Last modified: