Inside Solaris by Jim Mauro

Shared memory uncovered

Discover how shared memory is implemented in Solaris, the boundaries of the shared memory tunable parameters, what "intimate shared memory" is, and how these things play into implementation

SunWorld
September  1997
[Next story]
[Table of Contents]
[Search]
Sun's Site

Abstract
Shared memory is an interprocess communication (IPC) facility that exists in every major version of Unix available today. It is ubiquitous in its use by applications developed for Unix systems and is used extensively by commercial relational database management systems (RDBMS) as a means of implementing a cache. This month we'll look at the implementation of shared memory internally in SunOS.

Note that a non-goal of this month's column is a programmers "how-to" discussion on writing applications that use shared memory. Information on how to code using the shared memory interfaces is available from numerous sources, including the Solaris Developers Kit (SDK) documentation (see Resources below). (3,800 words)



Mail this
article to
a friend
Shared memory provides an extremely efficient means of sharing data between multiple processes on a Solaris system because the data need not actually be moved from one process's address space to another. As the name implies, shared memory is exactly that; the sharing of the same physical RAM pages by multiple processes, such that each process has mappings to the same physical pages and can access the memory through pointer dereferencing in code. The use of shared memory in an application requires implementing just a few interfaces bundled into the standard C library, /usr/lib/libc. These interfaces are listed in Table 1 below. Consult the man pages for more detailed information. In the following sections, we'll examine what these interfaces do from a kernel implementation standpoint.

The kernel implementation of shared memory requires two loadable kernel modules, the shmsys module, which contains the kernel support routines for the shared memory library calls (Table 1) and the ipc module, which contains two kernel routines, ipcget() and ipcaccess(), that apply to all the interprocess communication (IPC) facilities. The location of these dynamically loadable modules is the directory /kernel/sys for the shmsys module, and /kernel/misc for the ipc module (See the kernel(1M) man page for information on loadable kernel modules.)

Table 1
System Call Arguments Passed Returns Description
shmget(2) key, size, flags identifier Creates a shared segment if one with a matching key does not exist (and the appropriate flags are set), or locates an existing segment based on key. Returns a shared memory identifer.
shmat(2) identifier, address, flags pointer to the shared segment Attaches the shared segment to the process's address space.
shmdt(2) address 0 or 1 (success or failure) Detaches a shared segment from a process's address space.
shmctl(2) identifier, command, status structure 0 or 1 (success or failure) Allows for some basic shared memory control functions, such as getting statistics, setting permissions, etc

These modules are not loaded automatically by SunOS at boot time. The kernel will dynamically load a required module when a call is made that requires the module. Thus, if the shmsys and ipc modules are not loaded, the first time an application makes a shared memory system call (e.g. shmget(2)), the kernel will load the module and execute the system call. The module will remain loaded until it is explicitly unloaded, via the modunload(1M) command, or the system reboots. This explains an FAQ on shared memory -- why, when the ipcs(1M) command is executed, it sometimes comes back with:

# ipcs
IPC status from  as of Tue Jul 22 21:49:34 1997
Message Queue facility not in system.
Shared Memory facility not in system.
Semaphores:
#

The "facility not in system" message means the module is not loaded. You can tell the operating system to load the module during bootup by using the forceload operation in the /etc/system file:

forceload: sys/shmsys

Also, you can use the modload(1M) command, which allows a root user to load any loadable kernel module from the command line. The modinfo(1M) command can be used to see which loadable modules are currently loaded in the kernel. Note that SunOS is smart enough not to allow the unloading (modunload(1M)) of a loadable module that is in use. Note also that the code is written to be aware of dependencies, such that loading the shmsys module will also cause the ipc module to be loaded.


Advertisements

Shared memory tunable parameters
The kernel maintains certain resources for the implementation of shared memory. Specifically, a shared memory identifier (shmid) is initialized and maintained by the operating system whenever a shmget(2) system call is executed successfully (recall from Table 1 that shmget(2) returns a shared memory identifier upon successful completion). The shmid identifies a shared segment, which has two components -- the actual shared RAM pages and a data structure that maintains information about the shared segment, the shmid_ds data structure, detailed in Table 3.

The system allocates kernel memory for some number of shmid_ds structures at boot time, based on the shared memory tunable parameter called shmmni. All together, there are only four tunable parameters associated with shared memory. They are listed in Table 2, with a description, default, data type, and minimum and maximum values.

Table 2
Name Default Value Minimum Value Maximum Value Data Type Description
shmmax 1048576 1 4294967295 (4 GB) unsigned int Maximum size for a shared segment
shmmin 1 1 4294967295 (4 GB) unsigned int Minimum size for a shared segment
shmmni 100 1 2147483648 (2 GB) signed int Max number shared memory identifiers
shmseg 6 1 32767 (32 k) short Max number shared segments per process

Table 3
Member Name Data Type Corresponding ipcs(1) column Description
shm_perm structure see ipc_perm table 4 Embedded ipc_perm structure. Generic structure for IPC facilies that maintains permission information
shm_segsz unsigned int SEGSZ Size in byte of the shared segment
shm_amp pointer none Pointer to corresponding anon_map structure
shm_lkcnt unsigned short none Number of locks on the shared segment
shm_lpid long LPID PID of last process that did a shared memory operation
shm_cpid long CPID PID of process that created the shared segment
shm_nattch unsigned long NATTCH Number of attaches to the shared segment
shm_cnattch unsigned long none Creator attaches??? Not currently used
shm_atime long ATIME Time of last attach to shared segment
shm_dtime long DTIME Time of last detach from shared segment
shm_ctime long CTIME Time of last change to shmid_ds structure
shm_cv cond. var none A kernel condition variable. Not currently used
shm_sptas pointer none Pointer to address space structure. Used with ISM for managing shared page tables (translation tables)

When the system first loads the shared memory module, it allocates kernel memory to support the shmid structures and other required kernel support structures. The kernel memory required is based on the shmmni tunable, since that defines the requested number of unique shared memory identifiers the system maintains. Each shmid_ds structure is 112 bytes in size and has a corresponding kernel mutex lock, which is an additional eight bytes. Thus, the amount of kernel memory required by the system to support shared memory can be calculated as ((shmmni * 112) + (shmmni * 8)). The default value of 100 for shmmni requires the system to allocate about 13 kilobytes of kernel memory for shared memory support. The system makes some attempt at protecting itself against allocating too much kernel memory for shared memory support by checking for the maximum available kernel memory, dividing that value by four, and using the resulting value as a limit for allocating resources for shared memory. Simply put, the system will not allow more than 25 percent of available kernel memory to be allocated. Note that the above applies to Solaris 2.5, 2.5.1, and 2.6. Prior releases, up to and including Solaris 2.4, did not impose a 25 percent limit check. Nor did they require the additional eight bytes per shmid_ds for a kernel mutex lock because shared memory used very coarse-grain locking in the earlier releases and only implemented one kernel mutex in the shared memory code. Beginning in 2.5, finer-grained locking was implemented, allowing for greater potential parallelism of applications using shared memory.

It should be clear that one should not set shmmni to an arbitrarily large value simply to ensure sufficient resources. There are limits as to how much kernel memory the system supports. On sun4m-based platforms, the limits are on the order of 128 megabytes (MB) prior to Solaris 2.5, and 256 MB for 2.5, 2.5.1, and 2.6. On sun4d systems (SS1000 and SC2000), the limits are about 576 MB in 2.5 and later. On UltraSPARC[sun4u]-based systems, the kernel has its own four gigabyte (GB) address space, so it's much less constrained. Still, keep in mind that the kernel is not pageable, and thus whatever kernel memory is needed remains resident in RAM, reducing available memory for user processes. Given the fact that Sun ships systems today with very large RAM capacities, this may not be an issue, but it should be considered nonetheless. (sun4m, sun4d, and sun4u is Sun nomenclature for defining different "kernel architectures.") Every operating system has some hardware-independent and hardware-dependent components to it -- the hardware-dependent components are at the lower levels, where hardware registers, and things get touched. Different kernel architectures exist for different Sun desktop and server systems and vary due to processor technology (SuperSPARC, UltraSPARC, etc.) and system infrastructure (Mbus, XDbus, Gigaplane, etc.). Use uname(1M) with the -m flag to determine what your systems kernel architecture is:

% uname -m
sun4u

Note that the maximum value for shmmni listed in Table 2 is 2 GB. This is a theoretical limit, based on the data type (a signed integer) and should not be construed as something configurable today. Applying the math from above, you see that two billion shared memory identifiers would require over 200 GB of kernel memory! One should assess to the best of their ability the number of shared memory identifiers required by the application and set shmmni to that value plus 10 percent or so for headroom.

The remaining three shared memory tunables are quite simple in their meaning. Shmmax defines the maximum size a shared segment can be. The size of a shared memory segment is determined by the second argument to the shmget(2) system call. When the call is executed, the kernel checks to ensure that the size argument is not greater than shmmax. If it is, an error is returned. Setting shmmax to its maximum value does not effect the kernel size -- no kernel resources get allocated based on shmmax, so this can be tuned to its maximum value of 4 GB (0xffffffff), as in this (/etc/system) entry:

set shmsys:shminfo_shmmax=0xffffffff /* hexidecimal */
set shmsys:shminfo_shmmax=4294967295 /* decimal     */

Actually, the 4 GB size applies only to Solaris 2.5.1 and 2.6. Prior to those releases, the maximum value is 2 GB:

set shmsys:shminfo_shmmax=0x80000000 /* hexidecimal */
set shmsys:shminfo_shmmax=2147483648 /* decimal     */

The maximum size change is due in part to changing the shmmax data type from a signed integer to an unsigned integer in the kernel code.

Keep in mind that SunOS today supports a maximum virtual address space of 4 GB, due to the current 32-bit implementation. Since shared memory mappings apply to a process's virtual address space, one could never address a full 4 GB of shared memory. Every process has some address space used for text (execution code), stack space, and data, which all get charged against the 4-GB total, leaving something less than 4 GB for shared memory.

The shmmin tunable defines the smallest possible size a shared segment can be, as per the size argument passed in the shmget(2) call. There's no real compelling reason to set this from the default value of 1. Lastly, there's shmseg, which defines the number of shared segments a process can attach (map pages) to. Processes may attach to multiple shared memory segments for application purposes, and this tunable determines how many mapped shared segments a process can have attached at any one time. Again, the 32-kilobyte (K) limit (maximum size) in Table 2 is based on the data type (short), and does not necessarily reflect a value that will provide application performance that meets business requirements if some number of processes attach to 32,000 shared memory segments. Things like shared segment size and system size (amount of RAM, number/speed of processors, etc.) will all factor into determining the extent to which you can push the boundaries of this facility.

Intimate shared memory
Intimate shared memory (ISM) is an optimization introduced first in Solaris 2.2. It allows for the sharing of the translation tables involved in the virtual to physical address translation for shared memory pages, as opposed to just sharing the actual physical memory pages. Typically, non-ISM systems maintain a per-process mapping for the shared memory pages. With many processes attaching to shared memory, this creates a lot of redundant mappings to the same physical pages that the kernel must maintain. Additionally, all modern processors implement some form of a translation lookaside buffer (TLB), which is (essentially) a hardware cache of address translation information. SPARC processors are no exception, and, just like an instruction and data cache, the TLB has limits as to how many translations it can maintain at any one time. As processes get context switched in and out, we can reduce the effectiveness of the TLB. If those processes are sharing memory, and we can share the memory mappings also, we can make more effective use of the hardware TLB.

The actual mapping structures differ across processors. UltraSPARC (SPARC V9) processors implement translation tables, comprised of translation table entries (TTEs). SuperSPARC (SPARC V8) systems implement page tables, which contain page table entries (PTE). They both do essentially the same thing -- provide a means of mapping virtual to physical addresses. However, the two SPARC architectures differ pretty substantially in MMU (memory management unit) implementation. (The MMU is the part of the processor chip dedicated to the address transaction process.) SPARC V8 defines the SPARC Reference MMU (SRMMU) and provides implementation details. SPARC V9 does not define an MMU implementation, but rather provides some guidelines and boundaries for the chip designers to follow. The actual MMU implementation is left to the chip design folks.

Additionally, there is a significant amount of kernel code dedicated to the address translation process (such as the creation and management of the translation tables). The actual details of translating a virtual address to a physical address and tying the hardware and software pieces together make for YAITOWA (yet another interesting thing to write about). Figure 1 provides a diagram of the virtual-to-physical address translation tables, with and without ISM. The diagram is of course an over-simplification, as a process will have other address mappings for its text, data, etc., in addition to the shared (or unshared) shared memory segment mappings. Also, the diagram is generic, not specific to a particular platform (as above, that would require a different diagram for several processor/server combinations).


Figure 1

Let's consider just one simple example of how ISM can save kernel space. Oracle uses shared memory for its Shared Global Area (SGA), which is how Oracle does its caching of data, indexes, stored procedures, etc. Assume Oracle is configured with a 2-GB SGA, and there are 400 Oracle processes (each attaching to the shared segment holding the SGA) running concurrently on the system at any point in time. 2 GB of RAM equates to 262,144 8-K pages. Assuming that the kernel needs to maintain eight bytes of information for each page mapping (two four-byte pointers), that's about 2 MB of kernel space needed to hold the translation information for one process. Without ISM, those mappings get replicated for each process, so multiply the number times 400, and we now need 800 MB of kernel space just for those mappings. With ISM, the mappings get shared, so we only need the 2 MB of space, regardless of how many processes attach.

In addition to the translation table sharing, ISM also provides another feature. When ISM is used, the shared pages are locked down in memory, such that they'll never get paged out. This feature was added for the RDBMS vendors. As we said earlier, shared memory is used extensively by commercial RDBMS systems to cache data (among other things, such as stored procedures). Non-ISM implementations treat shared memory just like any other chunk of anonymous memory -- it gets backing store allocated from the swap device, and the pages themselves are fair game to get paged out if memory contention becomes an issue.

The effects of paging out shared memory pages that are part of a database cache would be disastrous from a performance standpoint (RAM shortages are never good for performance...). Because a vast majority of customers that purchase Sun servers use them for database applications and because database applications make extensive use of shared memory, addressing this issue with ISM was an easy decision.

Memory page locking is implemented in SunOS by setting some bits in the memory page's page structure (every page of memory has a corresponding page structure that contains information about the memory page. Page sizes vary across different hardware platforms. UltraSPARC-based systems implement an 8-K memory page size, which means that 8 K is the smallest unit of memory that can be allocated and mapped to a process's address space). The page structure contains several fields, among which is a field called p_cowcnt and p_lckcnt, page copy-on-write count and page lock count, respectively. Copy on write tells the system that this page can be shared as long as it's being read, but once a write to the page is executed, makes a copy of the page and maps it to the process that is doing the write. Lock count maintains a count of how many times page locking was done for this page. Because many processes can share mappings to the same physical page, the page may be locked from several sources. The system maintains a count to ensure that processes that complete and exit will not result in a page being unlocked that has mappings from other processes. The system's pageout code, which runs if free memory gets low, checks the status to the page's p_cowcnt and p_lckcnt fields. If either of these fields are non-zero, the page is considered locked in memory and thus not marked as a candidate for freeing. Shared memory pages using the ISM facility do not use the copy-on-write lock (that would make for a non-shared page after a write). Pages locked via ISM implement the p_lckcnt page structure field.

Even though ISM locks pages in memory such that they'll never get paged out, Solaris still treats ISM shared segments the same way it treats non-ISM shared segments and other anonymous memory pages -- it makes sure there is sufficient backing store in swap before completing the page mapping on behalf of the requesting process. While this seems superfluous for ISM pages (allocating disk swap space for pages that can't be swapped out), it makes the implementation cleaner. Solaris 2.6 changes this somewhat, and in 2.6 swap is not allocated for ISM pages. The net effect of this is that allocation of shared segments using ISM requires sufficient available swap space for the allocation to succeed, at least until Solaris 2.6.

Using ISM requires setting a flag in the shmat(2) system call. Specifically, the SHM_SHARE_MMU flag must be set in the shmflg argument passed in the shmat(2) call to instruct the system to set the shared segment up as intimate shared memory. Otherwise, the system will create the shared segment as a non-ISM shared segment.

Note that memory pages can be locked through other means. A root user can use the mlock(3) library routine or the memcntl(2) system call (with the MLOCK flag) to lock pages in memory. One another note: In case you're not familiar with SunOS Virtual Memory nomenclature, anonymous memory is any memory page that does not have a corresponding named location in the file system. Things like files, executables, and shared libraries originate as files in the file system, and thus can be restored to memory if the memory page they were mapped to gets freed and used for another object. Things like heap space (malloc(3), sbrk(2) calls) and shared memory pages have no corresponding named location, and thus swap disk must be allocated in case the page must be pushed out to make room for something else.

Implementation flow
In this section we will look at the flow of kernel code that executes when the shared memory system calls are called.

Applications first call shmget(2) to get a shared memory identifier. A key value is passed in the call that the kernel uses to locate (or create) a shared segment.

(application) shmget(key, size, flags (PRIVATE or CREATE))

(kernel)	shmget()
		    ipcget()
			if (key equals IPC_PRIVATE)
			 	create shared segment
				return unique shm_id
			else if (key exists)
				return shm_id
			else if (key does not exist AND IPC_CREAT is set)
				create shared segment
				return unique shm_id
		    if (this is a new shared segment)
			check size against min & max tunables
			get resources for anonymous memory mapping
			(swap allocation)
			init shmid_ds structure with appropriate information
		  	(set permissions for read/write based on flags and
			 effective UID & effective GID of process)
		    else (this is an existing segment)
			check size

         	return shmid (or error) back to application

(application) shmat(shmid, segment address, flags)
		
(kernel)	shmat()
		   ipc_access() check access permissions (read/write).
		   if (system as ISM disabled)
			clear SHM_SHARE_MMU flag
		   if (ISM and SHM_SHARE_MMU flag)
			determine number of pages and align/roundup size
			if (address is 0)
				system finds an address in range	
			else (user supplied an address in shmat call)
				check that address is properly aligned and
			 	usable (an address space in the address range
				is available)
				map segment to specified address
			create shared mapping tables
			map segment
		   else (not ISM)
			 if (address is 0)
				system finds address in range
				map segment
			 else (user supplied address)
				check that address is properly aligned and
				usable
				map segment
		
		return address (pointer to shared segment) or error.

At this point, applications have a pointer to the shared segment that they use in their code to read/write data. The shmdt(2) interface allows a process to unmap the shared pages (detach itself). This will not cause the system to remove the shared segment, even if all attached processes have detached themselves. A shared segment must be explicitly removed by using the shmctl(2) call with the IPC_RMID flag set, or via the command line using the ipcrm(1) command. Obviously, permissions must allow for the removal of the shared segment.

It should be pointed out that the kernel makes no attempt at coordinating concurrent access to shared segments. This must be done by the software developer using shared memory in order to prevent multiple processes attached to the same shared pages from writing to the same locations at the same time. There are several ways this can be done, the most common of which is the use of another IPC facility, semaphores.

The shmctl(2) interface can also be used to get information on the shared segment (return a populated shmid_ds structure), set permissions, and lock the segment in memory (processes attempting to lock shared pages must have an effective UID of root).

The ipcs(1) command can be used to look at active IPC facilities in the system. When shared segments are created, the system maintains permission flags similar to the permission bits used by the file system. They determine who can read and write the shared segment based on the user ID (UID) and group ID (GID) of the process attempting the operation. Extended information on the shared segment can be seen by using the -a flag with the ipcs(1) command. The information is fairly intuitive and documented in the ipcs(1) man page. We also indicated in Table 3 which members of the shmid_ds structures are displayed by ipcs(1) output and what the corresponding column name is. The permissions (mode) and key data for the shared structure are maintained in the ipc_perm data structure, which is embedded (a member of) the shmid_ds structure, and described in Table 4.

Table 4
Member Name Data Type Corresponding ipcs(1) column Description
uid long OWNER UID of shared segment owner. ipcs(1) column reports corresponding owner name, nit numeric uid.
gid long GROUP GID of group that owner belongs to. Reported by ipcs(1) as above.
cuid long CREATOR UID of process that created the shared segment
cgid long CGROUP GID of process that created the shared segment
mode mode MODE Access mode (r/w permissions) of shared segment
seq unsigned long none Shared memory slot usage sequence number
key long KEY Shared segment key value

Closing notes
Shared memory is a powerful and relatively simple way to share data between processes. The use of shared memory by applications requires setting the shared memory tunable parameters to provide sufficient resources for the application. Hopefully, the use and implementation of these tunables has been made clear.

Intimate shared memory is an important optimization that makes more efficient use of the kernel and hardware resources involved in the implementation of virtual memory and provides a means of keeping heavily used shared pages locked in memory.

Semaphores, another common IPC facility, and one often used with shared memory in order to synchronize access to shared segments, will be the topic of next month's column.


Resources


About the author
Jim Mauro is currently an Area Technology Manager for Sun Microsystems in the Northeast area, focusing on server systems, clusters, and high availability. He has a total of 18 years industry experience, working in service, educational services (he developed and delivered courses on Unix internals and administration), and software consulting. Reach Jim at jim.mauro@sunworld.com.

What did you think of this article?
-Very worth reading
-Worth reading
-Not worth reading
-Too long
-Just right
-Too short
-Too technical
-Just right
-Not technical enough
 
 
 
    

SunWorld
[Table of Contents]
Sun's Site
[Search]
Feedback
[Next story]
Sun's Site

[(c) Copyright  Web Publishing Inc., and IDG Communication company]

If you have technical problems with this magazine, contact webmaster@sunworld.com

URL: http://www.sunworld.com/swol-09-1997/swol-09-insidesolaris.html
Last modified: