|
Shared memory uncoveredDiscover how shared memory is implemented in Solaris, the boundaries of the shared memory tunable parameters, what "intimate shared memory" is, and how these things play into implementation |
Shared memory is an interprocess communication (IPC) facility that exists in every major version of Unix available today. It is ubiquitous in its use by applications developed for Unix systems and is used extensively by commercial relational database management systems (RDBMS) as a means of implementing a cache. This month we'll look at the implementation of shared memory internally in SunOS.Note that a non-goal of this month's column is a programmers "how-to" discussion on writing applications that use shared memory. Information on how to code using the shared memory interfaces is available from numerous sources, including the Solaris Developers Kit (SDK) documentation (see Resources below). (3,800 words)
Mail this article to a friend |
/usr/lib/libc
. These interfaces are
listed in Table 1 below. Consult the man pages for more detailed
information. In the following sections, we'll examine what these
interfaces do from a kernel implementation standpoint.
The kernel implementation of shared memory requires two loadable
kernel modules, the shmsys
module, which contains the
kernel support routines for the shared memory library calls (Table
1) and the ipc
module, which contains two kernel
routines, ipcget()
and ipcaccess()
, that
apply to all the interprocess communication (IPC) facilities. The
location of these dynamically loadable modules is the directory
/kernel/sys
for the shmsys
module, and
/kernel/misc
for the ipc
module (See the
kernel(1M) man page for information on loadable kernel modules.)
Table 1 | |||
---|---|---|---|
System Call | Arguments Passed | Returns | Description |
shmget(2) | key, size, flags | identifier | Creates a shared segment if one with a matching key does not exist (and the appropriate flags are set), or locates an existing segment based on key. Returns a shared memory identifer. |
shmat(2) | identifier, address, flags | pointer to the shared segment | Attaches the shared segment to the process's address space. |
shmdt(2) | address | 0 or 1 (success or failure) | Detaches a shared segment from a process's address space. |
shmctl(2) | identifier, command, status structure | 0 or 1 (success or failure) | Allows for some basic shared memory control functions, such as getting statistics, setting permissions, etc |
These modules are not loaded automatically by SunOS at boot time.
The kernel will dynamically load a required module when a call is
made that requires the module. Thus, if the shmsys
and
ipc
modules are not loaded, the first time an
application makes a shared memory system call (e.g.
shmget(2)
), the kernel will load the module and execute
the system call. The module will remain loaded until it is
explicitly unloaded, via the modunload(1M)
command, or
the system reboots. This explains an FAQ on shared memory -- why,
when the ipcs(1M)
command is executed, it sometimes
comes back with:
# ipcs IPC status fromas of Tue Jul 22 21:49:34 1997 Message Queue facility not in system. Shared Memory facility not in system. Semaphores: #
The "facility not in system" message means the module is not loaded.
You can tell the operating system to load the module during bootup
by using the forceload
operation in the
/etc/system
file:
forceload: sys/shmsys
Also, you can use the modload(1M)
command, which allows
a root user to load any loadable kernel module from the command
line. The modinfo(1M)
command can be used to see which
loadable modules are currently loaded in the kernel. Note that SunOS
is smart enough not to allow the unloading
(modunload(1M)
) of a loadable module that is in use.
Note also that the code is written to be aware of dependencies, such
that loading the shmsys
module will also cause the
ipc
module to be loaded.
|
|
|
|
Shared memory tunable parameters
The kernel maintains certain resources for the implementation of
shared memory. Specifically, a shared memory identifier
(shmid
) is initialized and maintained by the operating
system whenever a shmget(2)
system call is executed
successfully (recall from Table 1 that shmget(2)
returns a shared memory identifier upon successful completion). The
shmid
identifies a shared segment, which has two
components -- the actual shared RAM pages and a data structure that
maintains information about the shared segment, the
shmid_ds
data structure, detailed in Table 3.
The system allocates kernel memory for some number of
shmid_ds
structures at boot time, based on the shared
memory tunable parameter called shmmni
. All together,
there are only four tunable parameters associated with shared
memory. They are listed in Table 2, with a description, default,
data type, and minimum and maximum values.
Table 2 | |||||
---|---|---|---|---|---|
Name | Default Value | Minimum Value | Maximum Value | Data Type | Description |
shmmax | 1048576 | 1 | 4294967295 (4 GB) | unsigned int | Maximum size for a shared segment |
shmmin | 1 | 1 | 4294967295 (4 GB) | unsigned int | Minimum size for a shared segment |
shmmni | 100 | 1 | 2147483648 (2 GB) | signed int | Max number shared memory identifiers |
shmseg | 6 | 1 | 32767 (32 k) | short | Max number shared segments per process |
Table 3 | |||
---|---|---|---|
Member Name | Data Type | Corresponding ipcs(1) column | Description |
shm_perm | structure | see ipc_perm table 4 | Embedded ipc_perm structure. Generic structure for IPC facilies that maintains permission information |
shm_segsz | unsigned int | SEGSZ | Size in byte of the shared segment |
shm_amp | pointer | none | Pointer to corresponding anon_map structure |
shm_lkcnt | unsigned short | none | Number of locks on the shared segment |
shm_lpid | long | LPID | PID of last process that did a shared memory operation |
shm_cpid | long | CPID | PID of process that created the shared segment |
shm_nattch | unsigned long | NATTCH | Number of attaches to the shared segment |
shm_cnattch | unsigned long | none | Creator attaches??? Not currently used |
shm_atime | long | ATIME | Time of last attach to shared segment |
shm_dtime | long | DTIME | Time of last detach from shared segment |
shm_ctime | long | CTIME | Time of last change to shmid_ds structure |
shm_cv | cond. var | none | A kernel condition variable. Not currently used |
shm_sptas | pointer | none | Pointer to address space structure. Used with ISM for managing shared page tables (translation tables) |
When the system first loads the shared memory module, it allocates
kernel memory to support the shmid
structures and other
required kernel support structures. The kernel memory required is
based on the shmmni
tunable, since that defines the
requested number of unique shared memory identifiers the system
maintains. Each shmid_ds
structure is 112 bytes in size
and has a corresponding kernel mutex lock, which is an additional
eight bytes. Thus, the amount of kernel memory required by the
system to support shared memory can be calculated as
((shmmni
* 112) + (shmmni
* 8)). The
default value of 100 for shmmni
requires the system to
allocate about 13 kilobytes of kernel memory for shared memory
support. The system makes some attempt at protecting itself against
allocating too much kernel memory for shared memory support by
checking for the maximum available kernel memory, dividing that
value by four, and using the resulting value as a limit for
allocating resources for shared memory. Simply put, the system will
not allow more than 25 percent of available kernel memory to be
allocated. Note that the above applies to Solaris 2.5, 2.5.1, and 2.6.
Prior releases, up to and including Solaris 2.4, did not impose a 25
percent limit check. Nor did they require the additional eight bytes per
shmid_ds
for a kernel mutex lock because shared memory
used very coarse-grain locking in the earlier releases and only
implemented one kernel mutex in the shared memory code. Beginning in
2.5, finer-grained locking was implemented, allowing for greater
potential parallelism of applications using shared memory.
It should be clear that one should not set shmmni
to an
arbitrarily large value simply to ensure sufficient resources. There
are limits as to how much kernel memory the system supports. On
sun4m-based platforms, the limits are on the order of 128 megabytes
(MB) prior to Solaris 2.5, and 256 MB for 2.5, 2.5.1, and 2.6. On
sun4d systems (SS1000 and SC2000), the limits are about 576 MB in 2.5
and later. On UltraSPARC[sun4u]-based systems, the kernel has its
own four gigabyte (GB) address space, so it's much less
constrained. Still, keep in mind that the kernel is not pageable,
and thus whatever kernel memory is needed remains resident in RAM,
reducing available memory for user processes. Given the fact that
Sun ships systems today with very large RAM capacities, this may not
be an issue, but it should be considered nonetheless. (sun4m,
sun4d, and sun4u is Sun nomenclature for defining different "kernel
architectures.") Every operating system has some
hardware-independent and hardware-dependent components to it -- the
hardware-dependent components are at the lower levels, where
hardware registers, and things get touched. Different kernel
architectures exist for different Sun desktop and server systems and
vary due to processor technology (SuperSPARC, UltraSPARC, etc.) and
system infrastructure (Mbus, XDbus, Gigaplane, etc.). Use
uname(1M)
with the -m
flag to determine
what your systems kernel architecture is:
% uname -m sun4u
Note that the maximum value for shmmni
listed in Table
2 is 2 GB. This is a theoretical limit, based on the data type (a
signed integer) and should not be construed as something
configurable today. Applying the math from above, you see that two
billion shared memory identifiers would require over 200 GB of
kernel memory! One should assess to the best of their ability the
number of shared memory identifiers required by the application and
set shmmni
to that value plus 10 percent or so for
headroom.
The remaining three shared memory tunables are quite simple in
their meaning. Shmmax
defines the maximum size a
shared segment can be. The size of a shared memory segment is
determined by the second argument to the shmget(2)
system call. When the call is executed, the kernel checks to ensure
that the size argument is not greater than shmmax
. If
it is, an error is returned. Setting shmmax
to its
maximum value does not effect the kernel size -- no kernel resources
get allocated based on shmmax
, so this can be tuned to
its maximum value of 4 GB (0xffffffff), as in this
(/etc/system
) entry:
set shmsys:shminfo_shmmax=0xffffffff /* hexidecimal */ set shmsys:shminfo_shmmax=4294967295 /* decimal */
Actually, the 4 GB size applies only to Solaris 2.5.1 and 2.6. Prior to those releases, the maximum value is 2 GB:
set shmsys:shminfo_shmmax=0x80000000 /* hexidecimal */ set shmsys:shminfo_shmmax=2147483648 /* decimal */
The maximum size change is due in part to changing the
shmmax
data type from a signed integer to an unsigned
integer in the kernel code.
Keep in mind that SunOS today supports a maximum virtual address space of 4 GB, due to the current 32-bit implementation. Since shared memory mappings apply to a process's virtual address space, one could never address a full 4 GB of shared memory. Every process has some address space used for text (execution code), stack space, and data, which all get charged against the 4-GB total, leaving something less than 4 GB for shared memory.
The shmmin
tunable defines the smallest possible size
a shared segment can be, as per the size argument passed in the
shmget(2)
call. There's no real compelling reason to
set this from the default value of 1. Lastly, there's
shmseg
, which defines the number of shared segments a
process can attach (map pages) to. Processes may attach to multiple
shared memory segments for application purposes, and this tunable
determines how many mapped shared segments a process can have
attached at any one time. Again, the 32-kilobyte (K) limit (maximum
size) in Table 2 is based on the data type (short), and does not
necessarily reflect a value that will provide application
performance that meets business requirements if some number of
processes attach to 32,000 shared memory segments. Things like
shared segment size and system size (amount of RAM, number/speed of
processors, etc.) will all factor into determining the extent to
which you can push the boundaries of this facility.
Intimate shared memory
Intimate shared memory (ISM) is an optimization introduced first in
Solaris 2.2. It allows for the sharing of the translation tables
involved in the virtual to physical address translation for shared
memory pages, as opposed to just sharing the actual physical memory
pages. Typically, non-ISM systems maintain a per-process mapping for
the shared memory pages. With many processes attaching to shared
memory, this creates a lot of redundant mappings to the same
physical pages that the kernel must maintain. Additionally, all
modern processors implement some form of a translation lookaside
buffer (TLB), which is (essentially) a hardware cache of address
translation information. SPARC processors are no exception, and,
just like an instruction and data cache, the TLB has limits as to
how many translations it can maintain at any one time. As processes
get context switched in and out, we can reduce the effectiveness of
the TLB. If those processes are sharing memory, and we can share the
memory mappings also, we can make more effective use of the hardware
TLB.
The actual mapping structures differ across processors. UltraSPARC (SPARC V9) processors implement translation tables, comprised of translation table entries (TTEs). SuperSPARC (SPARC V8) systems implement page tables, which contain page table entries (PTE). They both do essentially the same thing -- provide a means of mapping virtual to physical addresses. However, the two SPARC architectures differ pretty substantially in MMU (memory management unit) implementation. (The MMU is the part of the processor chip dedicated to the address transaction process.) SPARC V8 defines the SPARC Reference MMU (SRMMU) and provides implementation details. SPARC V9 does not define an MMU implementation, but rather provides some guidelines and boundaries for the chip designers to follow. The actual MMU implementation is left to the chip design folks.
Additionally, there is a significant amount of kernel code dedicated to the address translation process (such as the creation and management of the translation tables). The actual details of translating a virtual address to a physical address and tying the hardware and software pieces together make for YAITOWA (yet another interesting thing to write about). Figure 1 provides a diagram of the virtual-to-physical address translation tables, with and without ISM. The diagram is of course an over-simplification, as a process will have other address mappings for its text, data, etc., in addition to the shared (or unshared) shared memory segment mappings. Also, the diagram is generic, not specific to a particular platform (as above, that would require a different diagram for several processor/server combinations).
Figure 1 |
Let's consider just one simple example of how ISM can save kernel space. Oracle uses shared memory for its Shared Global Area (SGA), which is how Oracle does its caching of data, indexes, stored procedures, etc. Assume Oracle is configured with a 2-GB SGA, and there are 400 Oracle processes (each attaching to the shared segment holding the SGA) running concurrently on the system at any point in time. 2 GB of RAM equates to 262,144 8-K pages. Assuming that the kernel needs to maintain eight bytes of information for each page mapping (two four-byte pointers), that's about 2 MB of kernel space needed to hold the translation information for one process. Without ISM, those mappings get replicated for each process, so multiply the number times 400, and we now need 800 MB of kernel space just for those mappings. With ISM, the mappings get shared, so we only need the 2 MB of space, regardless of how many processes attach.
In addition to the translation table sharing, ISM also provides another feature. When ISM is used, the shared pages are locked down in memory, such that they'll never get paged out. This feature was added for the RDBMS vendors. As we said earlier, shared memory is used extensively by commercial RDBMS systems to cache data (among other things, such as stored procedures). Non-ISM implementations treat shared memory just like any other chunk of anonymous memory -- it gets backing store allocated from the swap device, and the pages themselves are fair game to get paged out if memory contention becomes an issue.
The effects of paging out shared memory pages that are part of a database cache would be disastrous from a performance standpoint (RAM shortages are never good for performance...). Because a vast majority of customers that purchase Sun servers use them for database applications and because database applications make extensive use of shared memory, addressing this issue with ISM was an easy decision.
Memory page locking is implemented in SunOS by setting some bits in
the memory page's page structure (every page of memory has a
corresponding page structure that contains information about the
memory page. Page sizes vary across different hardware platforms.
UltraSPARC-based systems implement an 8-K memory page size, which
means that 8 K is the smallest unit of memory that can be allocated
and mapped to a process's address space). The page structure
contains several fields, among which is a field called
p_cowcnt
and p_lckcnt
, page copy-on-write
count and page lock count, respectively. Copy on write tells the
system that this page can be shared as long as it's being read, but
once a write to the page is executed, makes a copy of the page and
maps it to the process that is doing the write. Lock count maintains
a count of how many times page locking was done for this page.
Because many processes can share mappings to the same physical page,
the page may be locked from several sources. The system maintains a
count to ensure that processes that complete and exit will not
result in a page being unlocked that has mappings from other
processes. The system's pageout code, which runs if free memory gets
low, checks the status to the page's p_cowcnt
and
p_lckcnt
fields. If either of these fields are
non-zero, the page is considered locked in memory and thus not
marked as a candidate for freeing. Shared memory pages using the
ISM facility do not use the copy-on-write lock (that would make for
a non-shared page after a write). Pages locked via ISM implement the
p_lckcnt
page structure field.
Even though ISM locks pages in memory such that they'll never get paged out, Solaris still treats ISM shared segments the same way it treats non-ISM shared segments and other anonymous memory pages -- it makes sure there is sufficient backing store in swap before completing the page mapping on behalf of the requesting process. While this seems superfluous for ISM pages (allocating disk swap space for pages that can't be swapped out), it makes the implementation cleaner. Solaris 2.6 changes this somewhat, and in 2.6 swap is not allocated for ISM pages. The net effect of this is that allocation of shared segments using ISM requires sufficient available swap space for the allocation to succeed, at least until Solaris 2.6.
Using ISM requires setting a flag in the shmat(2)
system call. Specifically, the SHM_SHARE_MMU
flag must
be set in the shmflg
argument passed in the
shmat(2)
call to instruct the system to set the shared
segment up as intimate shared memory. Otherwise, the system will
create the shared segment as a non-ISM shared segment.
Note that memory pages can be locked through other means. A root
user can use the mlock(3)
library routine or the
memcntl(2)
system call (with the MLOCK
flag) to lock pages in memory. One another note: In case you're not
familiar with SunOS Virtual Memory nomenclature, anonymous memory is
any memory page that does not have a corresponding named location in
the file system. Things like files, executables, and shared
libraries originate as files in the file system, and thus can be
restored to memory if the memory page they were mapped to gets freed
and used for another object. Things like heap space
(malloc(3)
, sbrk(2)
calls) and shared
memory pages have no corresponding named location, and thus swap
disk must be allocated in case the page must be pushed out to make
room for something else.
Implementation flow
In this section we will look at the flow of kernel code that
executes when the shared memory system calls are called.
Applications first call shmget(2)
to get a shared memory
identifier. A key value is passed in the call that the kernel uses
to locate (or create) a shared segment.
(application) shmget(key, size, flags (PRIVATE or CREATE)) (kernel) shmget() ipcget() if (key equals IPC_PRIVATE) create shared segment return unique shm_id else if (key exists) return shm_id else if (key does not exist AND IPC_CREAT is set) create shared segment return unique shm_id if (this is a new shared segment) check size against min & max tunables get resources for anonymous memory mapping (swap allocation) init shmid_ds structure with appropriate information (set permissions for read/write based on flags and effective UID & effective GID of process) else (this is an existing segment) check size return shmid (or error) back to application (application) shmat(shmid, segment address, flags) (kernel) shmat() ipc_access() check access permissions (read/write). if (system as ISM disabled) clear SHM_SHARE_MMU flag if (ISM and SHM_SHARE_MMU flag) determine number of pages and align/roundup size if (address is 0) system finds an address in range else (user supplied an address in shmat call) check that address is properly aligned and usable (an address space in the address range is available) map segment to specified address create shared mapping tables map segment else (not ISM) if (address is 0) system finds address in range map segment else (user supplied address) check that address is properly aligned and usable map segment return address (pointer to shared segment) or error.
At this point, applications have a pointer to the shared segment
that they use in their code to read/write data. The
shmdt(2)
interface allows a process to unmap the shared
pages (detach itself). This will not cause the system to remove the
shared segment, even if all attached processes have detached
themselves. A shared segment must be explicitly removed by using the
shmctl(2)
call with the IPC_RMID
flag set,
or via the command line using the ipcrm(1)
command.
Obviously, permissions must allow for the removal of the shared
segment.
It should be pointed out that the kernel makes no attempt at coordinating concurrent access to shared segments. This must be done by the software developer using shared memory in order to prevent multiple processes attached to the same shared pages from writing to the same locations at the same time. There are several ways this can be done, the most common of which is the use of another IPC facility, semaphores.
The shmctl(2)
interface can also be used to get
information on the shared segment (return a populated
shmid_ds
structure), set permissions, and lock the
segment in memory (processes attempting to lock shared pages must
have an effective UID of root).
The ipcs(1)
command can be used to look at active IPC
facilities in the system. When shared segments are created, the
system maintains permission flags similar to the permission bits
used by the file system. They determine who can read and write the
shared segment based on the user ID (UID) and group ID (GID) of the
process attempting the operation. Extended information on the shared
segment can be seen by using the -a
flag with the
ipcs(1)
command. The information is fairly intuitive
and documented in the ipcs(1)
man page. We also
indicated in Table 3 which members of the shmid_ds
structures are displayed by ipcs(1)
output and what the
corresponding column name is. The permissions (mode) and key data
for the shared structure are maintained in the ipc_perm
data structure, which is embedded (a member of) the
shmid_ds
structure, and described in Table 4.
Table 4 | |||
---|---|---|---|
Member Name | Data Type | Corresponding ipcs(1) column | Description |
uid | long | OWNER | UID of shared segment owner. ipcs(1) column reports corresponding owner name, nit numeric uid. |
gid | long | GROUP | GID of group that owner belongs to. Reported by ipcs(1) as above. |
cuid | long | CREATOR | UID of process that created the shared segment |
cgid | long | CGROUP | GID of process that created the shared segment |
mode | mode | MODE | Access mode (r/w permissions) of shared segment |
seq | unsigned long | none | Shared memory slot usage sequence number |
key | long | KEY | Shared segment key value |
Closing notes
Shared memory is a powerful and relatively simple way to share data
between processes. The use of shared memory by applications requires
setting the shared memory tunable parameters to provide sufficient
resources for the application. Hopefully, the use and implementation
of these tunables has been made clear.
Intimate shared memory is an important optimization that makes more efficient use of the kernel and hardware resources involved in the implementation of virtual memory and provides a means of keeping heavily used shared pages locked in memory.
Semaphores, another common IPC facility, and one often used with shared memory in order to synchronize access to shared segments, will be the topic of next month's column.
|
Resources
About the author
Jim Mauro is currently an Area Technology Manager for Sun
Microsystems in the Northeast area, focusing on server systems,
clusters, and high availability. He has a total of 18 years industry
experience, working in service, educational services (he developed
and delivered courses on Unix internals and administration), and
software consulting.
Reach Jim at jim.mauro@sunworld.com.
If you have technical problems with this magazine, contact webmaster@sunworld.com
URL: http://www.sunworld.com/swol-09-1997/swol-09-insidesolaris.html
Last modified: