|
Tune that codeWhen it comes to boosting performance, tuning application code often reaps more dividends than any other investment.
|
Application tuning offers handsome rewards. The choices a programmer makes have an impact on performance that often outweighs all other factors. For software developers hoping to boost performance, we have the answer. In this article, adapted from a chapter from his book Sun Performance and Tuning, Sun engineer Adrian Cockcroft describes the aspects of a system that programmers specifying and writing the software can control to affect system performance, and suggests effective tuning strategies. (5,200 words)
Mail this article to a friend |
Editor's Note: Material in this article has been excerpted from Sun Performance and Tuning: SPARC and Solaris by Adrian Cockcroft. © 1995 Sun Microsystems, Inc. A SunSoft Press/Prentice Hall PTR book. ISBN 0-13-149642-5. Available at technical bookstores worldwide. Online book order information.
To properly tune an application, programmers must often change source code. The challenge is to tune without introducing bugs, and minimize the often-costly process of testing changes. By providing guidance on strategies to try and things to avoid, this article will help you build new applications that start out well-tuned and reduce the amount of trial and error involved when tuning existing applications.
Algorithmic classification
Some general behavioral classes of algorithms are shown below. There is an initial overhead or setup time; then, as the amount of data increases, the time taken to run an algorithm increases in various ways. This is a very generalized discussion, but some examples are included below to illustrate the point.
Inefficient algorithms
For example, the time taken to sort some data with a bubblesort algorithm more than doubles when there is twice as much data to be sorted, so it is an inefficient algorithm. These algorithms are relatively simple to implement and can have a smaller setup time than more complex algorithms. For very small data sets, the setup time can dominate, so inefficient algorithms can be useful in some restricted areas. An application using inefficient algorithms in critical areas can suffer from a brick wall effect, where it rapidly becomes unusable when given data sets that are not much larger than usual.
|
|
|
|
Linear algorithms
There may be a slightly larger setup overhead for linear algorithms than for an inefficient algorithm, but the application will degrade gracefully when it is given an oversized data set to work on, if all the algorithms are linear or better. It simply means that the time taken is directly proportional to the amount of data. Simple linear searches are an example.
Efficient algorithms
The time taken to locate a database record from an indexed table is mostly independent of the actual size of the table, so it is a very efficient search algorithm. The setup time is often considerable, as an index must be built and maintained in advance, and there is often a complex hashing function to execute for each lookup.
As long as a program is being used with a small data set, the difference between each class of algorithms is minor. This is often the case during testing or in the early lifetime of a product, but problems often occur when the data-set size increases.
An example problem
One real-life example is a CAD system that kept each object in a drawing on a linked list and performed a linear search through the list whenever it needed to find an object. This worked all right for small drawings, since the time taken to perform other operations dominated the execution time. When the CAD system was ported from a 3-MIPS Sun-3(TM)/60 to a 28-MIPS SPARCstation 2 (both are now obsolete, but this kind of situation recurs every year as performance increases), many parts of the code sped up by a factor of up to 10 times. The users now expected to run drawings with many times the complexity at the same speed as they used to run them on a Sun-3/60.
Unfortunately for some operations, the time taken to search through the linked list dominated due to the larger data-set size, and the linked list code didn't see a 10-times speedup due to caching effects. (The Sun-3/60 has no data cache. See the sidebar "A problem with linked lists.") This was identified as the performance problem, and the algorithm was improved. The solution in this case was to move from a linear search to a more complex search based on hash tables or tree structures.
Another approach that could be used in similar situations is to incrementally sort the linked list. You can do this by taking the entry that you have just located on the list and moving it to the front of the list. Eventually, all the uninteresting entries end up at the end of the list, the most popular entries congregate at the front of the list where they are searched first, and the first few entries of the list may stay in the cache.
Space versus time
Another issue to consider is space versus time optimization. Which is more important: small algorithms or fast algorithms? There's no sense in making an inefficient algorithm run more quickly if doing so requires a large increase in the memory space required. The increase in storage may make the application miss the cache more often, or page, and the increase in cache or disk accesses can offset any improvement in CPU runtime. Even an efficient algorithm becomes inefficient if it uses so much space that it doesn't run at full speed.
Optimize for the common code path
Often, many places within your code will test for optional cases or error conditions. Wherever possible, try to make the common case the straight-through path, as this causes fewer branches to be taken. Straight-line code has a better cache hit rate and also helps processors prefetch and execute more useful instructions in a given number of clock cycles. The disadvantage is that the code can sometimes be a little harder to read, because the test and the action for each exception will be separated. Here's a trivial example of branching in the usual case:
Example pseudocode for branching in the usual case
do some work if (option) deal with option else do the usual case return
And an example for straight-line code:
Example pseudocode for straight line code in the usual case
do some work if (option not set) do the usual case return else deal with option return
The most benefit comes when the code handles multiple options and error conditions. If the compiler can figure out which path is most common, then at high optimization levels the tests may be inverted automatically.
Layering and fast paths
Good software engineering calls for a well-defined interface at each level in the system. This tends to increase the code path, as high-level operations descend through the hierarchy of layers. To counter this problem, profile the code, looking for the most common operations, then implement well-documented and commented fast paths through the code for the most common operations. The danger is that you end up with two ways of performing the same operation, the slow path and the fast path. When functional changes are made, they can cause problems with trying to keep both paths synchronized. The best approach is to make the fast path a conditional compilation option where possible and to delay implementing the fast path until the whole system is tested and functionally stable. The temptation to implement the fast path at the design stage must be resisted, as the measured hot spots on a finished application often subtly differ from those that the designer anticipated.
See the sample code "Example calling sequence" for an example where output is being batched into a single system call.
There is often a conceptual framework underlying an application that can be thought of in terms of a programming model. Some example models are:
The examples below illustrate some of these models. (I've tried to illustrate the flavor of each programming model on a similar task.)
Programmers and system designers sometimes choose the model they prefer regardless of the type of system being written, rather than letting the system dictate the model. An inappropriate match between the programming model and the problem to be solved is a remarkably common cause of performance problems.
Choice of language to express the algorithm or model
Programmers may choose a language that is familiar to them or may have a language imposed on them. Often the language ends up dictating the programming model, irrespective of the needs of the application, but sometimes languages are forced to implement programming models that are inappropriate. Real examples that I have encountered include a database system written in APL, a real-time control system written in Prolog, and a very large message-passing, object-oriented system written in C.
If there is a good match of problem to programming model to language, then there is a good chance that the system overall will respond well to tuning. Poorly matched systems sometimes contain so much unnecessary complexity that they are impossible to profile and tune. Only brute-force increases in CPU and I/O throughput will have a significant effect on performance in poorly matched systems.
The moral is that if you come across an application like this and first attempts at tuning have little effect, you may as well put your money into buying a faster computer and your time and effort into redesigning the system from scratch.
It comes down to using the right tool for the job. Most people know what languages to use for particular jobs, but some less obvious points are detailed below.
Fortran
Fortran is usually the fastest language for anything that is numerically intensive. For an equivalent level of compiler technology, it will always be faster than C. This is because Fortran has a simple structure that compilers can handle more easily than C, and since Fortran lacks pointers there is less chance of side effects during expression evaluation. The key to optimization is code motion, and this can be performed more safely in Fortran than in C.
Fortran also outperforms C because by default it passes floating-point variables to subroutines by reference (that is, passes an address rather than the number itself). This is more efficient, especially on RISC processors such as SPARC that have separate integer and floating-point registers and where the calling convention is to pass variables in integer registers.
Assembly language
In theory, code written in assembly language squeezes the last bit of performance out of the hardware. In practice, programmers get so bogged down in the huge volume of code and the difficulty in debugging it that they tend to implement simple, inefficient algorithms in poorly structured code. It is hard to find out where the hot spots in the system are, so the wrong parts of the system get optimized. When you discover a small routine that dominates the execution time, look at the assembly code generated by the compiler, tweak the source, recompile, and have another look. As a last resort, consider rewriting it in assembly code by hand.
It is often very helpful to understand what sort of code is generated by your compiler. I have found that writing clean, simple, high-level language code can help the compiler to understand the code better and thus improve the optimization of the generated assembler. Just think of assembly as read-only code. Read it and understand it if you can, but don't try to write it.
C and C++
It seems that just about everything is written in C nowadays. Its biggest weakness is hardly anyone seems to use lint to check code as a standard part of their compilation makefiles. Wading through the heaps of complaints that lint
produces for most systems written in C gives a pretty good insight into the sloppy nature of much C coding. Many problems with optimizers breaking C code can be avoided by getting the code to lint
cleanly first! ANSI C's function prototypes are an improvement but not a substitute for lint
.
C++ should be used whenever an object-oriented design is called for. In most cases C could be used, but sloppy coding and programmers who take shortcuts make the resulting systems very hard to debug, and give optimizers a hard time. Writing C++-like code in C makes the code very hard to understand; it is much easier to use a C++ compiler with a debugger and performance analyzer that understand the language.
A common problem with C++ is that developers create too many trivial objects or complex hierarchies. Objects should not be used when a function call would do the job just as well. Objects work best when they are used to represent something concrete. The more abstract an object, the less useful it is.
lint
has just been mentioned; build it into your default makefiles and use it regularly. A little known utility called tcov
performs test coverage analysis and produces an annotated listing showing how many times each block of code has been executed and a percentage coverage figure. This utility is bundled with SunOS 4 and with the SunSoft Developer Products compilers for Solaris 2.
The execution profile of a program can be obtained by means of the standard Unix tools prof
or gprof
provided with Solaris 2. One problem with these tools is that they measure the total execution from start to finish, which is less useful for window system tools that have a long startup time. The SunSoft SPARCworks collector
and analyzer
can be used on Solaris 2 to obtain this type of profile information for specific parts of a program.
The SunSoft SPARCworks 3.0 debugger
includes a sophisticated runtime checking option. Pure Software Inc. and Sentinel Inc. offer similar products. These tools can be used to debug subtle errors in C and C++ programs, such as used-before-set data, memory allocation errors, and array bounds violations. Purify works by modifying object files to instrument memory references, and it can even find errors in library routines.
Compiler and optimizations
After choosing a language, one must choose a compiler for the
language. Typically, several compiler options exist for each major
language on SPARC. The SPARCompilers from SunSoft tend to have the
largest user base and the best robustness on large programs. The
Apogee C and Fortran compilers seem to have a performance advantage of
10 to 20 percent over the commonly used SunSoft SPARCompilers for
benchmarks and programs that run reliably with maximum optimization.
Results for the SPARCompilers 3.0 and Apogee 2.3 releases show
substantial performance improvements over previous releases for both
vendors. Competition between SunSoft, Apogee, and others will lead to
better compilers. To learn more about using compilers effectively to
tune code, consider turning to the SunSoft compiler manuals; Peter van
der Linden's book
Expert C Programming: Deep C Secrets
contains a lot of good advice.
To summarize, follow these steps to get better performance:
1. Clean C code with lint
2. Turn up the optimizer most of the way (-xO3
)
3. Profile to find hot spots and turn up the optimizer to maximum for those parts only (-fast -xO4 -depend
).
4. Look at the assembly language code produced for the hot spots to see if it is efficient. If the code doesn't look good, tweaking the source code slightly may help the compiler do a better job. If the code still doesn't look good, then send a small piece of source code and the generated assembly language to your compiler vendor so they know what needs to be improved.
Default optimization options
Many users keep to the default compiler option of -O
. You should understand that this includes different optimizations for each language. In particular, SunSoft implements four optimization levels: -xO1
, -xO2
, -xO3
, and -xO4
. For Sun Fortran, the default -O
implies -xO3
, while for Sun C it implies -xO2
. Unless you are compiling device driver or kernel code, you should try to use -xO3
as the default option with C. The issue with device drivers is that the compiler can generate faster code if it assumes that variables do not map onto hardware I/O registers. The -xO3
optimizations are safe for normal code, but require that you use the volatile
keyword correctly in your device driver code.
Automatic parallelization
With the introduction and shipment of a large number of multiprocessor machines, it becomes cost-effective to recompile applications to utilize several processors on a single Unix process. SunSoft Developer Products has an automatic parallelizing optimizer available as an option for SPARCompiler 3.0. It supports Fortran 77, Fortran 90 and C. Apogee ships the Kuck and Associates Preprocessor (KAP) as an optional part of its product. This can also be used to parallelize code, although it uses Unix processes as its unit of concurrency, whereas the SunSoft optimizer uses lightweight processes (Solaris 2 threads bound to LWPs) in a single Unix context to reduce context-switching overhead.
Applications vary, and some parallelize better than others. One source of examples is the individual SPECfp92 benchmarks and the PAR94 parallel benchmark suite.
Effective use of provided system functions
This is a general call to avoid reinventing the wheel. The SunOS libraries contain many high-level functions that have been debugged, tuned, and documented for you to use. If your system hot spot turns out to be in a library routine, then it may be worth looking at recoding it in some streamlined way, but the next release of SunOS or the next generation of hardware may render your homegrown version obsolete. As an example, the common string handling routines provided in the standard SunOS 4.x C library are simple, compiled C code. In Solaris 2, these routines are written in optimized assembler.
Some very powerful library routines seem to be underused, and I provide some pointers to my favorites below.
Mapped files
SunOS 4.x, Solaris 2.x, and all versions of Unix System V Release 4 (SVR4) include a full implementation of the mmap
system call. This call allows a process to map a file directly into its address space without the overhead of copying to and from a user buffer. It also allows shared files, thus permitting more efficient use of memory and interprocess communication. The shared library system used for dynamic linking uses mmap
as its basis.
Memory advice
The madvise
routine tells the system how you intend to access memory. You can say that you will or won't need everything in a certain range, so that the kernel can prefetch or free the pages; you can say whether access will be sequential or random, so that the kernel can free pages that have already been used sequentially or avoid prefetching extra pages that probably will not be needed.
Asynchronous I/O
Asynchronous I/O is an extension to the normal blocking read and write calls to allow the process to issue nonblocking aioreads and aiowrites. A fast path for these calls is implemented as an optional patch for Solaris 2.4, as they are heavily used by database applications.
Memory allocation tuning
It is very rare for the CPU time used by malloc
to be an issue. If you need to improve it, the best approach is probably to preallocate large blocks and implement your own application-specific allocator on top.
There are some useful options in the standard version of malloc
.
mallocmap
prints a map of the heap to the standard output.
mallinfo
provides statistics on malloc
.
mallopt
routine. This is not actually implemented in the code of the default version of malloc
or the BSD malloc
; the interface is part of SVID but the implementation is vendor-dependent.
In Solaris 2 there are three versions of malloc
, System V.4 malloc
, SunOS 4 malloc
for backward compatibility, and BSD malloc
. When linking with third-party libraries that use malloc
, take care not to mix implementations. Each handles attempts to allocate a zero-sized section of memory differently, and some applications have been observed to attempt zero-sized allocation calls.
There are two basic ways to link to a library in SunOS and SVR4. Static linking is the traditional method used by other systems, and dynamic linking is a runtime link process. With dynamic linking, the library exists as a complete unit that is mapped into the address space of every process that uses it. This saves a lot of RAM, particularly with window system libraries at over a megabyte each. It has several implications for performance tuning, however.
Each process that uses a shared library shares the physical pages of RAM occupied by the library but uses a different virtual address mapping. This implies that the library may end up being cached differently from one run of a program to the next. On machines with physically addressed caches (e.g., SuperSPARC), this difference can cause interactions that increase the variance of benchmark results. The library must also be compiled with position-independent code. Such code is a little less efficient than normal code because every call requires an indirect table jump, which is less efficient than a direct call. Companies typically use static linking when benchmarking systems, since the performance may be better and the results will have less variance. Production code should normally dynamically link, particularly to the system interface library libc
, to save memory.
A mixture can be used. For example, in the following compilation the Fortran library is statically linked, but libc
is dynamically linked.
% f77 -fast -o fred fred.f -Bstatic -lF77 -lV77 -lm -Bdynamic -lc
This compilation dynamically links in the libc
library and makes everything else static. The order of the arguments is important. If you are shipping products written in Fortran to customers who do not have Fortran installed on their systems, you will need to use this trick or provide and install the Fortran library with your product.
Tuning shared libraries
When static linking is used, the library is accessed as an archive of separate object files, and only the files needed to resolve references in the code are linked in. This means that the position of each object module in memory is hard to control or predict. For dynamic linking, the entire library is available at runtime, regardless of which routines are used. In fact, the library is demand-paged into memory as it is needed. Since the object modules making up the library always are laid out in memory the same way, a library can be tuned when it is built by reordering it so that modules that call each other often are in the same memory page. In this way, the working set of the library can be dramatically reduced.
The window system libraries for SunView and OpenWindows are tuned in this way since there is a lot of intercalling between routines in the library. Tools to do this automatically on entire programs or libraries are provided as part of the SPARCworks Analyzer, using some functionality that is only provided in the Solaris 2 debug interface (proc
), linker, and object file format. The main difference is that the a.out
format used in BSD
Unix and SunOS 4 allows only entire object modules to be reordered. The
Executable
and Linking Format (ELF) used in Unix SVR4 and Solaris 2 allows
each function and data item in a single object module to be relocated
independently.
A special tool called the Shared Library Interposer can be used to measure and tune the performance of shared libraries. It was developed in particular for tuning graphics and window system libraries. It is not a commercial product, but you may be able to obtain a copy via anonymous ftp.
Fork performance with shared libraries
When a process forks, it has to duplicate the entire address space. This takes longer if the address space is made up of many segments and shared libraries. If fork performance is important then it is best to combine user-written libraries into fewer larger entities. In the extreme case, static linking will help but is not advised for portability reasons. When executing a fork-exec
pair, it is far more efficient to use the vfork
variant whenever possible. vfork
avoids duplicating and then immediately destroying a copy of the original address space.
Solaris 2 supports multithreaded application programs. Each release of Solaris 2 has extended the limits of this support. Multithreaded versions of some major application software packages are now available.
MP programming with SunOS 4.1.2 and 4.1.3
There is no supported programmer interface to the multiple processors. Multiple Unix processes must be used.
MP programming with Solaris 2.3
Solaris 2.3 includes an implementation of the POSIX threads standard that is based on a draft of the standard and has many MT-safe system libraries.
The initial SPARCworks 3.0 multithreaded-application-development tools became available soon after Solaris 2.3 shipped.
MP programming with Solaris 2.4
In Solaris 2.4 almost all libraries are thread-safe and the rpcgen
program can automatically generate the code for a multithreaded RPC server. An updated set of SPARCworks tools, including a lock usage analysis tool, are now available.
It is worth adding code to an application that can optionally write time-stamped data to a trace file. It lets you break down the components of user response time into time spent waiting for CPU processing, disk accesses, and network accesses. You can then decide how to improve user response time without having to guess. A new facility coming in Solaris 2.5 provides a standard way to create and control probe points in applications and the kernel. This uses a command called prex
, so read its manual page when you get a copy of Solaris 2.5. Probe points have been added to the kernel in Solaris 2.5, and a standardized trace file format has been defined. Tools that help visualize the trace data are not included in Solaris 2.5 but are being developed.
As an application developer, you can make the biggest difference to the performance of an application. The quickest way to produce well-tuned applications is to design them with performance in mind from the start, and to measure the performance of each component as you go. Simply making measurement a standard part of the development process can have a dramatic effect on a team of software developers. As we all know, it is much quicker and cheaper to fix a problem during development than when the application is shipping.
|
Resources
If you have technical problems with this magazine, contact webmaster@sunworld.com
URL: http://www.sunworld.com/swol-09-1995/swol-09-cockcroft.html
Last modified:
In the "example problem" section above, I mentioned a CAD application that traversed a large linked list. Let's look at this in more detail and assume that the list has 5,000 entries. Each block on the list contains some data and a link to the next block. If we assume that the link is located at the start of the block and that the data is in the middle of a 100-byte block as shown in the linked list example below, then the effect on the memory system of chaining down the list can be deduced.
Linked list example
The code to perform the search is a tight loop, shown in the code sample Linked List Search Code in C below. This code fits in seven words, one or two cache lines at worst, so the cache is working well for code accesses. Data accesses occur when the link and data locations are read. If the code is simply looking for a particular data value, then these data accesses will occur every few cycles. They will never be in the same cache line, so every data access will cause a 25-cycle miss that reads in 32 bytes of data, when only 4 bytes were wanted. Also, only 2,048 cache lines are available, so after 1,024 blocks have been read in, the cache lines must be reused. This means that a second attempt to search the list will find that the start of the list is no longer in the cache.
The only solution to this problem is an algorithmic change. The problem will occur on any of the current generation of high-performance computer systems. In fact, the problem gets worse as the processor gets faster, since the miss cost tends to increase due to the difference in speed between the CPU and cache clock rate and the main memory speed.
Linked list search code in C
/* C code to search a linked list for a value and miss the cache a lot */ struct block { struct block *link; /* link to next block */ int pad1[11]; int data; /* data item to check for */ int pad2[12]; } blocks[5000];struct block *find(pb,value) struct block *pb; int value; { while(pb) /* check for end of linked list */ { if (pb->data == value) /* check for value match */ return pb; /* return matching block */ pb = pb->link; /* follow link to next block */ } return (struct block *)0; /* return null if no match */ }
The while loop compiles with optimization to just seven instructions, including two loads, two tests, two branches, and a no-op as shown below. Note that the instruction after a branch is normally executed on a SPARC processor. This executes in nine cycles on a SPARCstation 2 if it hits the cache, and in 59 cycles if both loads miss.
Linked list search loop in assembler
LY1: /* loop setup code omitted */ cmp %o2,%o1 /* see if data == value */ be L77016 /* and exit loop if matched */ nop /* pad branch delay slot */ ld [%o0],%o0 /* follow link to next block */ tst %o0 /* check for end of linked list */ bne,a LY1 /* branch back to start of loop */ ld [%o0+48],%o2 /* load data in branch delay slot */
Layered code example calling sequence
save_everything() write_special_header() fwrite() write() get_thinglist() for (everything there is) get_next() save_thing() my_save() save_thing_header() fwrite() write() save_thing_body() fwrite() write()
Fast-path code example calling sequence
save_everything() /* patch up a pre-built header */ /* build a linked list of all the stuff to be written */ /* by chasing pointers through data structures directly */ writev() /* let the kernel handle the whole list in one go */
About the author
Adrian Cockcroft joined Sun in 1988, and currently works as a performance specialist for the newly formed Server Division of SMCC. Send your performance-related questions and comments to him at adrian.cockcroft@sunworld.com.
Reach Adrian at adrian.cockcroft@sunworld.com.