Click on our Sponsors to help Support SunWorld
Performance Q & A by Adrian Cockcroft

Get better results when you design your cache to match your applications and system

Understanding how cache works will help you improve speed

SunWorld
January  1997
[Next story]
[Table of Contents]
[Search]
Subscribe to SunWorld, it's free!

Abstract
Increasing performance often means optimizing your cache. But before you can optimize it, you need to understand it. Bigger is not necessarily faster, so first you need to understand your system, your applications, and your cache. (3,800 words)


Mail this
article to
a friend
Q: I'm confused by all the different kinds of cache -- the cache file system, directory name cache, name service cache, CPU cache. How can I tell if the cache is working well and how big it should be?
--Tasha in Cashmere

A:Caches are used to store a small amount of information for repeated fast access. Many types of cache occur throughout computer hardware and software systems. We'll take a look at the common principles and look at a few examples.

Caches work on two basic principles that should be quite familiar to you from everyday life. The first is that if you spend a long time going to get something and you think you may need it again soon, you keep it nearby. For example, if you are working on your car, and need a 15mm spanner that you find in your toolbox, you then crawl under the car and begin work. If you then need a 10mm spanner, you don't put the 15mm back in the toolbox, you leave it under the car. When you have finished, there is a cache of tools in a pile under the car that make up your working set. If you only allowed one tool at a time under the car, you would waste a lot of time going back to the toolbox. However if the pile of tools gets too large, you may decide to put away a few that you think you won't need any more.

The second principle is that when you get something, you can save time by getting a few extra things that you think you might need soon. In the car example, you could grab a handful of spanners each time, thus saving yourself a few trips. You find that the pile under the car gets bigger much more quickly, some tools don't ever get used, and it takes you longer to pick up and put down a handful of tools on each trip to the toolbox.

The first principle is called "temporal locality" and depends on reusing the same things over again. The second principle is called "spacial locality" and depends on using things at the same time that are located near each other. Caches only work well if there is good locality in what you are doing.

Another example is the "cache" of groceries that you keep in your house that save you a trip to the supermarket every time you want a bite to eat. Let's say you are the kind of person who hates shopping and lives off canned and frozen food that you buy once a month. You have a very efficient cache with few shopping trips, and you can spend all your spare time browsing the Internet and reading articles like this.

Now let's say your mother comes to stay and tells you that you need a diet of fresh fruit and salad instead. You can't buy in bulk as it doesn't freeze or keep, therefore you'll need to visit the shops almost every day. You now waste all your spare time shopping. You notice one day that your local shop is now also on the Internet, and you can use your browser to place orders for delivery. Excellent! Now you can surf the Internet, keep a good diet, and never go grocery shopping again!

This analogy illustrates that some sequences of behavior work very efficiently with a cache, and others make little or no use of the cache. In some cases cache-busting behavior can be fixed by changing the system to work in a different way that bypasses the normal cache fetch. These alternative methods usually require extra hardware or software support in a computer system. Without getting into a specific kind of cache, there are some generic terms and measurements that parameterize caches. When the cache is used to hold some kind of information, a read from the cache is very different from writing new information so we'll take a look at cache reads first, then look at writes.


Advertisements

Cache reads that succeed
There are many kinds of information that can be cached: the value of a memory location, the location of a file, or the network address of another system. When the information is needed, the system first looks for it in the cache. This requires that each item in the cache has a unique name or tag that is associated with the information.

When a cache lookup occurs, the first thing that happens is a search to match the name you gave it with one of the tags. If the search succeeds the associated data is returned, and this is called a read hit. This is the best case situation, as it is both quick and successful. The two measures of interest are the read-hit rate and the read-hit cost. The read-hit rate is the number of read hits that occur per second, and the read-hit cost is the time it takes to search the cache and return the data.

For a CPU's primary data cache, an address is used as the tag. Millions of read hits occur each second, and the cost is known as the load-use delay, a constant small number of CPU clock cycles. When the CPU issues a load instruction to move data from memory to a register, it has to wait a few cycles before it can use the register. A good compiler will try to optimize code by moving load instructions forward, and filling in the load-use delay with other work. It is hard to measure (and understand) what is going on in the CPU cache.

For the directory name lookup cache (DNLC) an open system call causes a pathname string to be looked up, e.g. "export," to find its inode. The rate at which this occurs is reported by sar -a as namei/s, and it can reach a few thousand on a busy file server. The read-hit cost is relatively low and only consumes CPU time. An efficient hashed search is used so the size of the search is not an issue.

The name service cache is implemented as a multithreaded daemon process (nscd). It caches password and group information for users, and the hosts IP address information for systems. For example, a call to gethostbyname() ordinarily would involve a search through the local /etc/hosts file, a call to a local NIS or NIS+ server, and perhaps a DNS lookup, as specified in /etc/nsswitch.conf. The nscd provides a system-wide cache for the results of these searches and gethostbyname accesses it through a very fast new inter-process call mechanism known as a door-call.

If you are debugging a name service problem, you may notice the door system call, and many people wonder what it is. The manual page in Solaris 2.5 and 2.5.1 describes it as experimental, unpublished, and subject to change, but if you look on the Sun Labs research site, you can find a description of it as part of the experimental SpringOS object oriented operating system (see Resources below). The command nscd -g will show you the statistics for the name service cache. The read-hit cost is a few tens of microseconds, depending on your CPU speed (yes, doors are very fast....).

Cache reads that miss the cache
If the search fails and the data is not in the cache, it is called a read-miss. The rate at which this occurs is called the read-miss rate, and the time taken is called the read-miss cost. A read miss corresponds to the work that would be needed if the cache did not exist, plus the added overhead of checking the cache. Read misses are slow, and the read miss cost can be unpredictable, although it is usually possible to figure out the best case, the worst case, and the average.

CPU caches come in many configurations; the most common setup is to have two separate primary caches built into the CPU chip -- one to hold instructions, and one to hold data -- often holding about 16 kilobytes each. These caches then refer to an external level two cache holding between 256 kilobytes and 4 megabytes of data. If there is a read miss in the small primary cache, it may result in a hit or a miss in the level two cache.

A level two hit takes a few cycles longer than a level one hit, but is still very efficient. A level two miss involves a main memory reference. In a multiprocessor system, it is necessary to access the backplane and check every CPU cache in the system as well. If the backplane is already very busy, there may be a considerable extra delay. This is why heavily loaded multiprocessor servers run faster and scale better with the reduced number of read misses that come with larger CPU caches.

For example, Sun currently offers a choice of 512 kilobytes or 1 megabyte caches for the Enterprise Server range. The 1 megabyte cache choice is always faster, but it will have a proportionately greater benefit on performance and scalability of a 24 CPU E6000 than on a 4 CPU E3000.

An anomaly can arise in some cases. It takes time to search the cache and to load new data into it, which slows down the fetch operation. If you normally operate in a cache-busting mode you actually go faster if there is no cache at all. Since caches add to the cost of a system (extra memory used or extra hardware), they may not be present on a simpler or cheaper system. The anomaly is that a cheaper system may be faster than a more expensive one for some kinds of work.

The microSPARC processor used in the SPARCstation 5 is very simple -- it has a high-cache miss rate but a low-cache miss cost. The SuperSPARC processor used in the SPARCstation 20 has much bigger caches, a far lower cache miss rate, but a much higher cache miss cost. For a commercial database workload, the large cache works well, and the SPARCstation 20 is much faster. For a large Verilog EDA simulation, the caches are all too small to help, and the low latency to memory makes the SPARCstation 5 an extremely fast machine.

The lessons learned from this experience were incorporated in the UltraSPARC design, which also has very low latency to memory. To get the low latency, the latest, fastest cache memory and very fast wide system buses are needed. This is one reason why UltraSPARC caches, at 1 megabyte, currently are smaller than some other systems that have larger slower caches and a higher cache-miss cost.

If we consider our other examples, the DNLC is caching a filename. If the system cannot find out the corresponding inode number from the DNLC, it has to read through the directory structures of the filesystem. This may involve a linear search through a UFS directory structure, or a series of readdir calls over NFS to an NFS server. There is some additional caching of blocks of directory data and NFS attributes that may save time, but often the search has to sleep for many milliseconds waiting for several disk reads or network requests to complete.

You can monitor the number of directory blocks read per second using sar -a as dirbk/s, plus you can also look at the number of NFS2 readdir calls and NFS3 readdirplus calls. NFS2 reads a single entry with each readdir call, while NFS3 adds the readdirplus call that reads a series of entries in one go for greater efficiency (but longer latency - more next month on NFS3). So to summarize the effects of a DNLC miss: There is a little extra CPU consumed during the search, but a lot of sleeping -- waiting for the disk and network -- dominates the miss cost. DNLC hit rates (the number of read hits as a proportion of all lookups) are often in the 80 to 100 percent range.

If you run the find command, which walks through the directory structure, you will discover that the hit rate drops to perhaps 30 to 40 percent. My virtual_adrian rule for the DNLC hit rate is a bit too simple. It tells you to increase the DNLC size if the hit rate drops below 90 percent and the reference rate is above 100 per second. In some cases a lot of new files are being created, or a find-like operation is needed, and there is no point having a big DNLC.

A read miss on the name service cache causes a sequence of lookups by the nscd that are controlled by the /etc/nsswitch.conf file. If it contains the entry: hosts: files nisplus dns, then the nscd first lookup uses the standard "files backend" to look in /etc/hosts. In recent releases of Solaris 2, the files backend caches the file and does a hashed lookup into it. So very long host and passwd files are searched efficiently, and the read miss could be quite quick and CPU bound with no waiting. A NIS or NIS+ lookup involves a call over the network to a name server, so there's at least a few milliseconds of delay. If the lookup is passed up a hierarchy of DNS servers across the Internet, it could take several seconds.

The worst case is a lookup for a hostname that does not exist in a remote DNS domain. Unfortunately, some applications repeat the same lookup several times in quick succession, so a novel feature of the nscd is that it supports negative caching. It remembers that a lookup failed, and tells you immediately that the host does not exist if you ask a second time. There is a five second timeout for negative entries by default. The full output from nscd -g (on a recently booted Solaris 2.5 system) is shown below.

% nscd -g nscd configuration:
% nscd -g
nscd configuration:

      
   0  server debug level
"/dev/null"  is server log file

passwd cache:

       Yes  cache is enabled
       507  cache hits on positive entries
         0  cache hits on negative entries
        55  cache misses on positive entries
         2  cache misses on negative entries
        89% cache hit rate
         0  queries deferred
        16  total entries
       211  suggested size
       600  seconds time to live for positive entries
         5  seconds time to live for negative entries
        20  most active entries to be kept valid
       Yes  check /etc/{passwd,group,hosts} file for changes
        No  use possibly stale data rather than waiting for refresh

group cache:

       Yes  cache is enabled
        27  cache hits on positive entries
         0  cache hits on negative entries
        11  cache misses on positive entries
         0  cache misses on negative entries
        71% cache hit rate
         0  queries deferred
         5  total entries
       211  suggested size
      3600  seconds time to live for positive entries
         5  seconds time to live for negative entries
        20  most active entries to be kept valid
       Yes  check /etc/{passwd,group,hosts} file for changes
        No  use possibly stale data rather than waiting for refresh

hosts cache:

       Yes  cache is enabled
        22  cache hits on positive entries
         3  cache hits on negative entries
         7  cache misses on positive entries
         3  cache misses on negative entries
        71% cache hit rate
         0  queries deferred
         4  total entries
       211  suggested size
      3600  seconds time to live for positive entries
         5  seconds time to live for negative entries
        20  most active entries to be kept valid
       Yes  check /etc/{passwd,group,hosts} file for changes
        No  use possibly stale data rather than waiting for refresh

What happens when the cache is full
So far we have assumed that there is always some spare room in the cache. In practice, the cache will fill up and at some point cache entries will need to be reused. The cache replacement policy varies, but the general principle is that you want to get rid of entries that you will not need again soon. Unfortunately the "won't need soon" cache policy is hard to implement unless you have a very structured and predictable workload, or you can predict the future!

In its place you will often find caches that use "least recently used" (LRU) or "not recently used" (NRU) policies, in the hope that your accesses have good temporal locality. CPU caches tend to use "direct mapping" where there is no choice -- the address in memory is used to fix an address in cache. For small on-chip caches, there may be "n-way associative mapping" where there are "n" choices of location for each cache line, usually two or four, and a LRU or random choice implemented in hardware. Random replacement policies seem to offend some of the engineering purists among us, as they can never work optimally, but I like them as the converse is true, and they rarely work poorly either!

Caches work well if the access pattern is somewhat random with some locality, or if it has been carefully constructed to work with a particular cache policy (an example is SunSoft WorkShop's highly tuned math library). Unfortunately many workloads have structured access patterns that work against the normal policies. Random policies can be fast, don't need any extra storage to remember the usage history, and can avoid some nasty interactions.

Cache writes and purges
So far we have only looked at what is involved in reading existing information through a cache. When new information is created or old information is overwritten or deleted, there are extra problems. The first issue to consider is that a memory-based cache contains a copy of some information, and the official master copy is kept somewhere else.

When we want to change its value, we have several choices. We could update the copy in the cache only, which is very fast, but it now differs from the master copy. We could update both copies immediately, which may be slow, or we could throw away the cached copy and just update the master (if the cache-write cost is high). Another issue to consider is that the cache may contain a large block of information, and we only want to update a small part of it. Do we have to copy the whole block back to update the master copy? And what if there are several caches for the same data, as in a CPU with several levels of cache and other CPUs in a multiprocessor?

There are too many possibilities, so I'll just look at the examples we have discussed already. The CPU cache is optimized for speed, using hardware to implement relatively simple functions with a lot of the operations overlapped so they all happen at once. On an UltraSPARC, a cache block contains 64 bytes of data. A write will change from 1 byte to 8 bytes at a time. Each block contains some extra flags that indicate whether the data is shared with another CPU cache, and if the data has been rewritten.

A write updates both the internal level 1 and the external level 2 caches. It is not written out to memory, but the first time a shared block is written to, a special signal is sent to all the other CPUs in a multiprocessor system that invalidates any copies of the data that they hold. The flags are then changed from shared/clean to private/dirty. If another CPU then tries to reread that data, it is provided directly in a cache to cache transfer and becomes shared/dirty.

The data eventually is written back to main memory only when that cache block is needed to hold different data by a read. In this case, the read operation is held in an input buffer, while the dirty cache block is copied to an output buffer. The read from memory then completes, and finally the dirty cache line write completes.

Another situation to consider is the case of a context switch or a process exiting. If an address space is no longer valid, the cache tags might also be invalid. For systems that store a virtual address in the cache tag, all the invalid addresses must be purged; for systems that only store a physical address in the cache tag, there is no need.

Older SPARC systems, the HyperSPARC range, and some other systems such as HP-PA use virtual caches. SuperSPARC and UltraSPARC use physical caches. To be very general, a virtual cache can be faster for uniprocessor systems and scientific workloads, and a physical cache scales much better on multiprocessor systems and commercial workloads that have numerous context switches. There are a lot more ways to implement caches. If you are interested I recommend you read Computer Architecture: A Quantitative Approach, by John Hennessy and David Patterson. (A link to the book at Amazon.com Books is in the Resources section below.)

Returning to the DNLC, when a file is created or deleted the new directory entry is written to disk or sent over NFS immediately, so it is a slow operation. The extra overhead of updating or purging the DNLC is very small in comparison. The name service cache does not get involved in updates. If someone changes data in the name service, the nscd relies on timeouts and watching the modification dates on files to register the change.

When does a cache work well? A cache works well if there are a lot more reads than writes, and the reads or writes of the same or nearby data occur close together in time. An efficient cache has a low reference rate (doesn't make unnecessary lookups), a very short cache hit time, a high hit ratio, the minimum possible cache miss time, and an efficient way of handling writes and purges -- some tradeoffs may be made. A small fast cache might have a lower hit ratio, but compensate by having a low miss cost. A big slower cache can compensate for the extra miss cost by missing much less often.

Beware of looking at hit ratio's alone. A system might be running more efficiently by referencing the cache less often. If there are multiple levels of cache, and the first level is working well, the second level will tend to see a higher proportion of misses, but a low reference rate. It is actually the miss rate (in misses/second) multiplied by the miss cost (to give a total time delay) that matters more than the ratio of hits or misses as a percentage of the total references.

Wrap up
I've tried to illustrate the generic behavior of caches with examples to show what is similar and what is different. Computer systems are constructed from a large number of hardware and software caches of many types, and there can be huge differences in performance between a workload that works with a cache and one that works against a cache.

Usually, all we can tune is the cache size for software caches like the DNLC. This has some effect, but it is much more effective to change your workload (Don't run the find command more often than you have to!) to work within the cache you have. Your computer is spending more time stalled, waiting for some kind of cache miss to complete, than you think. Next month I'll look at the various kinds of file system caching that occur with cachefs, NFS, ufs, and VxFS.


Click on our Sponsors to help Support SunWorld


Resources


About the author
Adrian Cockcroft joined Sun in 1988, and currently works as a performance specialist for the Server Division of SMCC. He wrote Sun Performance and Tuning: SPARC and Solaris, published by SunSoft Press PTR Prentice Hall. Reach Adrian at adrian.cockcroft@sunworld.com.

What did you think of this article?
-Very worth reading
-Worth reading
-Not worth reading
-Too long
-Just right
-Too short
-Too technical
-Just right
-Not technical enough
 
 
 
    

SunWorld
[Table of Contents]
Subscribe to SunWorld, it's free!
[Search]
Feedback
[Next story]
Sun's Site

[(c) Copyright  Web Publishing Inc., and IDG Communication company]

If you have technical problems with this magazine, contact webmaster@sunworld.com

URL: http://www.sunworld.com/swol-01-1997/swol-01-perf.html
Last modified: