How can I optimize my programs for UltraSPARC?
Optimizing C and FORTRAN programs for
Compilers have a huge variety of options: some should be used as a matter of course, some give a big speedup if used correctly, and others can get you into trouble. This column outlines which compiler options Sun recommends for either general distribution or the highest performance with UltraSPARC hardware. (3,200 words)
I've got an Ultra 1 and the latest Sun compilers, but I'm confused by
the number of compiler options. What are the implications of using the
Ultra-specific options, and which options make the most difference to
performance? I don't have time to try every possible combination of
--Optimizing in Oconomowoc
A: The answer depends upon your situation. If you are a software vendor, your main concern is portability and testing costs. With a careful choice of options you can support most SPARC users with very good performance. If you are an end user with source code and some CPU-intensive applications that take a long time to run, you may be more interested in getting the very best possible performance from your particular computer.
Applications sell computers. Sun designs its systems to be compatible with pre-existing applications. Sun also worries about the costs a software vendor incurs to support Solaris applications. The key is to provide the largest possible volume sales opportunity for a single version of an application.
The Solaris OS is now installed on between one and two million SPARC-based computers. This is not uniform, however, as it consists of many different SPARC implementations and operating system releases. It is easy to build applications that work on all these systems, but it is also possible to inadvertently build in a dependency on a particular implementation or release.
This article tells you what you can depend on for maximum coverage of the installed base. It indicates several ways that you can optimize for a particular implementation, without becoming incompatible with all the others. It also describes opportunities to further optimization where performance or functionality may justify production of an implementation-specific version of the product.
Solaris 2.5 supports UltraSPARC computers. UltraSPARC is based on an extended SPARC Version 9 instruction set. This is completely upwards compatible with the installed base of SPARC Version 8 applications. The new Ultra Systems Architecture requires its own specific "sun4u" kernel, as do the previous "sun4m" MBus-based desktop systems and "sun4d" XDBus-based server systems. Correctly written device drivers will work with all kernels.
Although users may be concerned about the implications of the new features provided by UltraSPARC, they will find that it operates as just another SPARC chip. If their applications work on MicroSPARC, SuperSPARC and HyperSPARC based systems, then they will also work on UltraSPARC. Users may find over time some application developers will produce versions specifically optimized for UltraSPARC. This article should help clarify the optimizations available and likely performed.
Developers should find that their applications can be supported on Solaris 2.5 and UltraSPARC with minimal effort. Some vendors may wish to balance compatibility, performance, and functionality issues. This article sets out the issues clearly, and recommends a course of action to follow that allows incremental benefits to be investigated over time.
An incremental plan for UltraSPARC optimization
There are several steps to take; each step should usually be completed before the next step is attempted. After each step is complete you have the option to productize the application, taking advantage of whatever benefits have been obtained so far. I'll briefly describe the steps, then cover each in detail.
All user mode applications will work. Correctly written device drivers will also work. You should see a speedups of two or more times the performance of a SPARCstation 20 model 71 for CPU-intensive programs.
Review the interfaces that your application depends on, and the assumptions you made about the implementation. Move to portable standard interfaces where possible, and isolate implementation-specific interfaces and code into a replaceable module if possible.
Solaris 2.5 contains platform-specific tuned versions of some shared library routines. They are used automatically by dynamically linked programs to transparently optimize for an implementation, but will not be used by statically linked programs.
This can be done independently of the OS and hardware testing, but it is a necessary precursor to optimization for UltraSPARC. With no special platform-specific options, a "generic" binary will run reasonably well on any SPARC system.
The optimal sequences of instructions for UltraSPARC can be generated using the same SPARC instructions that current systems use. This is the option that Sun recommends for application developers. Compare the performance of this binary with the one produced at the end of Step\x114, both on UltraSPARC machines and older machines that comprise a significant segment of your customers.
Caution The following optimizations are not backwards compatible and software vendors are strongly discouraged from using them. They are intended primarily for users in the imaging and high performance computing markets.
With the all-out UltraSPARC compile options, you get access to the 32-bit subset of the SPARC V9 instruction set, and you double the available number of double-precision floating point registers. Many programs will see no improvement. A few FORTRAN programs speed up a great deal. Determine if the benefit of any extra performance outweighs the cost of maintaining two binaries (one for UltraSPARC, and one for older machines).
Applications that already implement a device-specific driver module mechanism for access to graphics and imaging accelerators can build a module using the VIS instruction set extensions. Determine if the benefit of any extra performance outweighs the cost of maintaining an UltraSPARC specific module or using the standard VIS-optimized XGL and XIL libraries.
Run the existing application on Solaris 2.5 on UltraSPARC hardware
You should collect performance data to compare against older hardware.
There is a wide range of speedups for existing, unmodified code running
on UltraSPARC. A rough guideline for integer applications is that an
average speedup is the ratio of clock rate of the SuperSPARC and
UltraSPARC CPUs tested (use
% /usr/sbin/psrinfo -v to
check the clock rates). For floating point applications the speedup is
a bit larger. This ratio does not apply for MicroSPARC and HyperSPARC.
If you get less than you would expect, you may be disk, RAM, or
network-limited. There are also a few programs that fit entirely in
the 1 megabyte SuperSPARC cache, and don't fit in the 512K Ultra 1
cache. If you get more speedup than you would expect, then you may have
been memory bandwidth limited on the older MBus-based machines.
Solaris 2.5 contains platform-specific versions of some library routines. They are automatically used by dynamically linked programs.
The UltraSPARC versions take advantage of VIS instructions for high-speed block move and graphics operations. If you static link to libc you will not take advantage of the platform specific versions. When used with Creator graphics the X server, XGL and XIL libraries have all been accelerated using the VIS instruction set extensions and the Creator framebuffer device driver.
Solaris 2.5 libraries automatically use the integer multiply and divide
instructions on any platform that has them. This allows generic
binaries to be built for the oldest SPARC Version 7 CPUs (e.g.
SPARCstation 2), but to take advantage of the instructions implemented
in SPARC Version 8 and subsequent CPUs (e.g., SuperSPARC and
UltraSPARC). In Solaris 2.5.1 some additional optimizations use
UltraSPARC specific instructions, for example to multiply the 64-bit
long long type.
The UltraSPARC specific VIS block move instruction performs a 64-byte transfer that is both cache coherent, and non-polluting. This is used by the platform specific libc bcopy, bzero, memcpy, memmove, memset, memcmp operations. The term nonpolluting refers to the fact that data that is moved is not cached. After copying a 1 megabyte block of data the CPU cache still holds its original contents. Large memory to memory data moves occur at over 170 megabyte per second, limited by the 350 megabyte per second throughput of the single Ultra 1/170 memory bank.
Memory-to-Creator framebuffer moves occur at 300 megabyte per second, limited by the processor interface throughput of 600 megabyte per second. A move involves a read and a write of the data, which is why the data is moved at half the throughput. These operations are about five times faster than a typical SuperSPARC system. The Ultra Enterprise Server systems have more memory banks, and sustain aggregate rates of 2.5 gigabyte per second.
This can be done independently of the OS and hardware testing, but it is a necessary precursor to optimization for UltraSPARC. SPARCompilers 4.0 improves performance on all platforms by perhaps 10 to 30 percent for CPU intensive applications. There may be issues with old code written in C++, as the language is evolving, and it changes from one release of the compiler to the next, as the compiler tracks the language standard. If you are already using SPARCompilers 3.0 you should have few if any problems.
To support the maximum proportion of the installed base, Sun recommends that applications are compiled on the oldest practical release of Solaris. SPARCompilers 4.0 is fully supported on Solaris 2.3 and 2.4, and all code generation options, including UltraSPARC-specific ones, can be used on older releases.
Application vendors who want to ship one binary product for all SPARC Solaris 2 systems, and want the best performance possible on older systems, should use the generic compiler options. The options are:
cc -xO3 -xdepend -xchip=generic -xarch=generic *.c f77 -xO3 -xdepend -xchip=generic -xarch=generic *.f
The level of optimization set by
-xO3 generates small,
efficient code for general purpose use. The
option tells the compiler to perform full dependency analysis. It
increases compile time (which is why it is not done by default) but
gives up to a 40 percent performance boost. Try with and without
-xdepend to quantify the difference on your application.
The compiler defaults to
options that tell the compiler you want the code to run reasonably well on
all SPARC processors. Adding the options to your makefile, even though
they are defaults, makes it clear to your coworkers what you are trying
For the C compiler the commonly used
-O option defaults
-xO2. The extra optimization invoked by
-xO3 is only problematic in device driver code that does
not declare memory-mapped device registers as volatile. The FORTRAN
compiler already maps
Optimize code scheduling for UltraSPARC
The optimal sequences of instructions for UltraSPARC can be generated using the same SPARC instructions that current systems use.
To explain what I mean by this, let's take an analogy of an Englishman talking to an American. If the Englishman speaks normally, the American will be able to understand what is being said, probably with a little extra effort (and the comment "I do love your accent..."). If the Englishman tries harder and says the same words with an American accent, they may be more easily digested by his American audience. At the same time, other English people listening in will understand them as well. The equivalent of full optimization would be to talk in an American accent, with American vocabulary, phrasing, and colloquialisms ("Let's touch base before we go the whole nine yards, y'all.") The words sound familiar but only make sense to other Englishmen if they are familiar with American culture.
The sequencing level of optimization avoids using anything that cannot be understood by older SPARC chips, but instructions are put in the most optimal sequence for fast execution on UltraSPARC.
Compare the performance of this binary with the one produced in the
previous stage, both on UltraSPARC machines and any older machines that
comprise a significant segment of your customer base. Performance on
UltraSPARC platforms can show a marked improvement. The
-xchip=ultra option puts instructions in the most
efficient order for execution by UltraSPARC. The
-xarch=generic option is the default, but it is good to
state your intentions explicitly. It tells the compiler to only use
instructions that are implemented in all SPARC processors.
The recommended compiler options to optimize for UltraSPARC are:
cc -xO3 -xdepend -xchip=ultra -xarch=generic *.c f77 -xO3 -xdepend -xchip=ultra -xarch=generic *.f
These options are intended to be realistic and safe settings for use on large applications. Higher performance can be obtained from more aggressive optimizations if assumptions can be made about numerical stability and the code lints cleanly.
The implications of nonportable optimizations
Caution Up to this point the generated code will run on any SPARC system. The subsequent optimizations are not backwards compatible and software vendors are strongly discouraged from using them. The implications and trade-off implicit in following these recommendations are that a single copy of your application will be portable across all SPARC-based environments. The performance and capability of the CPU will be maximized by the run-time environment, but some performance benefits unique to specific implementations may not be available.
There may be cases where tuning an application to a specific processor or computer is worth more to you than losing portability. If you use UltraSPARC-specific compiler options you should be aware that you will either need to create a different binary or continue to support an existing one to run on older systems.
Build an UltraSPARC only application
The primary interest in UltraSPARC specific code comes from FORTRAN end users in the high performance computing (HPC) marketplace. It is common for HPC users to have access to the source code of their applications, and to be interested in reducing the very long run times associated with large simulation and modelling computations by any means available. There is also less interest in running the code on older slower SPARC systems. The commonly used SPECfp92 and SPECfp95 benchmarks contain several examples of this kind of application. There are also situations where the UltraSPARC system is embedded in a product manufactured by an OEM. Since there is complete control over the hardware and the software combination, it is possible to optimize the two very closely together without concern for backwards compatibility.
Using the all-out UltraSPARC compile options you get access to the
32-bit subset of the SPARC V9 instruction set, and you increase the
number of double-precision floating point registers from 16 to 32. This
combination of V8 and V9 is known as the V8plus specification and it is
enabled with the
-xarch=v8plus compiler option. No source code changes
will be required, but the code will no longer run on older systems. The
binaries can be identified using the
% f77 -o prog -fast -xO4 -xdepend -xchip=ultra -xarch=v8plus *.f % file prog prog: ELF 32-bit MSB executable SPARC32PLUS Version 1, V8+ Required, dynamically linked, not stripped
Compare the performance of this binary with one using
-xarch=v8plus. Determine if the benefit of any extra performance
outweighs the cost of maintaining two binaries.
You can expect a speedup from
-xarch=v8plus if your
code is double-precision, vectorizable, and the compiler can unroll
DO loops. A large number of temporary variables need to be
stored in registers in the unrolled loop to hide load-use latencies. A
version of the Linpack DP1000 benchmark went 70 percent faster with
this option, which is the most you can expect. Single precision code
shows no speedup, as there are already 32 single-precision registers.
The performance improvement obtained using the above options with
-xarch=v8plus on each component
of SPECfp92 varied from 0 percent in several cases to a best case of 29
percent. The geometric mean increased by 11 percent. It is rare for one
loop to dominate an application, so a mixture of accelerated and
unaccelerated loops gives rise to a varying overall speedup. The
potential for speedup increases with the highest optimization levels
and should increase over time as the compiler improves its optimization
I have not seen a significant speedup on typical C code. In general,
don't waste time trying
-xarch=v8plus with the C
compiler. The compiler's code generator has many more options. The
ones I described usually make a significant difference. In a few cases
the profile feedback is useful as well. The highest level of
optimization is now
-xO5, and it should only ever be used
in conjunction with a collected profile, so the code generator knows
which loops to optimize aggressively. You simply compile with
-xprofile=collect, run the program, then recompile with
-xO5 -xprofile=use. This is easy to setup on small
benchmarks, but trickier with large apps.
Build a VIS instruction set enabled device specific module
Going back to my analogy, if the Englishman and the American start to talk using dense industry jargon, full of acronyms, noone else will have a clue what they are discussing, but the communication can be very efficient. The UltraSPARC processor implements an extension to the SPARC V9 instruction set that is dedicated to imaging and graphical operations that is a bit like talking in jargon.
Some applications already have device specific modules that provide access to accelerators for imaging and graphics operations. These modules can code directly to the VIS instruction set. For pixel-based operations, the VIS instructions operate on four or more pixels at a time. This translates into a four times speedup for large filtering, convolution, and table lookup operations. Several developers are reporting this kind of gain over the baselevel UltraSPARC performance for applications like photographic image manipulation and medical image processing. For the first time, MPEG2 video and audio stream decode can be performed at full speed and resolution with no add-on hardware. The best way to access VIS is via the standard graphics (XGL) and imaging (XIL) libraries. These are optimized to take advantage of the available CPU and framebuffer hardware automatically.
Sun's SPARC Technology Business is creating a VIS developers kit, and is promoting the use of VIS for specialized new-media applications.
About the author
Adrian Cockcroft joined Sun in 1988, and currently works as a performance specialist for the Server Division of SMCC. He wrote Sun Performance and Tuning: SPARC and Solaris, published by SunSoft Press PTR Prentice Hall. Reach Adrian at firstname.lastname@example.org.
If you have technical problems with this magazine, contact email@example.com