The Standard Performance Evaluation Corp. has released
A lot can happen in 3 years, especially in the computer industry. SPEC's new suite of benchmarks are newly adapted to more accurately measure the CPU performance of today's machines, be they single- or multi-processors, clusters, or symmetric multiprocessors. (2,200 words)
SPEC95 is provided in high-level language source-code form, which is compiled and executed on the target system. Source code must be carefully crafted to ensure portability, and to ensure that versions compiled for different systems perform a comparable amount of high-level work (at the source level). Different versions may perform different amounts of work at the machine-instruction level. That is the essence of compiler optimization: to figure out how to do the same job with less work. SPEC allows no modifications to the source code other than for portability reasons, and then only by explicit approval of the technical steering committee after the performance neutrality of the change has been demonstrated. Other examples of source-code benchmarks are GPC (Graphics Performance Characterization) and the commercial AIM benchmarks.
A different approach to benchmarks is the binary code benchmark. Examples include BAPCO, Norton SI, and various commercial application benchmarks. Benchmark binaries are provided for a single instruction set architecture (ISA) and operating system (OS), or multiple binaries are provided for a few specific ISA/OSes. Portability is strictly defined by this set of ISA/OSes: the benchmark can be run on these and no others. On a single ISA/OS, results are directly comparable across all platforms. Comparing across different ISA/OSes can be tricky. The same amount of high-level work is being done, but are the amounts of low-level work comparable? And do you care? An application vendor may have tuned one version of its code more highly than another. The MS-Windows application may be 386, 486, or Pentium code, and the Macintosh application may be 68020, PowerPC 601, or PowerPC 604 code -- with resulting speed differences. For application benchmarks, you probably don't care, so long as the applications benchmarked are the ones you will be running.
A third approach is the specification benchmark. An example is the TPC (Transaction Processing Council) database benchmark. The benchmark is provided in the form of written specifications of which high-level operations are to be performed. The tester programs the benchmark in high-level languages subject to independent audit and review of the implementation. This code is then compiled (and/or interpreted) and run on the target system. Portability is guaranteed because each version is specifically written for the target system, while comparability of results is ensured by the audit process.
Each of these approaches has strengths and weaknesses. One of the strengths of SPEC's source-code format is that it's relatively inexpensive to measure systems. This allows a more intensive measurement in the computer design process, promoting a more rapid delivery of performance improvements to users, and providing a large base of comparative results. SPEC measurements drive advances in compiler optimization technology benefiting commercial application vendors and users. As noted below however, this strength carries a corresponding weakness, and these are addressed in the design of SPEC95.
The SPEC95 yardstick
SPEC95 produces the familiar speed metrics for integer and floating-point calculations, for both peak (aggressively tuned) and baseline (more conservative) cases.
Like SPEC92, there are also throughput metrics for single- or multiple-CPU performance. SPEC95 is a component-level, as opposed to system-level benchmark, and does not require a significant amount of I/O or contention between jobs for system resources. In multi-processor systems the aggregate capacity of the CPU is measured, not how effectively that capacity is delivered to a particular user's workload. Unlike SPEC92, the SPEC95 throughput measures were designed to accommodate clusters as well as uniprocessors and symmetric multiprocessors, and to provide a meaningful comparison of aggregate capacity across these very different architectures.
The SPEC95 metrics are listed below. Users can select the metric that is most relevant to their environment depending on what factors are most important: speed versus throughput, integer versus floating-point, and conservative versus aggressive tuning.
SPEC95 Metrics Speed Throughput Aggressive SPECint95 SPECint_rate95 SPECfp95 SPECfp_rate95 Conservative SPECint_base95 SPECint_rate_base95 SPECfp_base95 SPECfp_rate_base95
Better still, SPEC has always advised "read all the numbers" and pay particular attention to the ratios of benchmarks that are most similar to your applications.
Hardware and compilers are evolving rapidly and benchmarks must stay current to accurately measure them. For example, SPEC89 put more stress on cache and memory systems than other common benchmarks of its day, back when cache sizes of 4-to-8 kilobytes and main memory sizes of 8-to-16 megabytes were typical. But today SPEC92 puts little stress on a typical 512 kilobyte cache, and many servers today have more cache memory than the old systems had main memory.
CPU speeds have increased to the point that on a number of modern systems some of the SPEC92 benchmarks run in just a few seconds, making timing less accurate, and exaggerating cache and I/O start-up effects relative to the main application code. The run times of the SPEC95 benchmarks were determined, taking into account expected performance increases over the lifetime of the suite. Indeed, running SPEC95 on older systems, as would be desirable to increase the base of comparable historical results, can be time-consuming. A single conforming execution of SPEC95 on the reference machine, a SPARCstation 10 Model 40, requires approximately 48 hours (!) of run time.
There are some fundamental differences between SPEC92 and SPEC95:
SPEC spent considerable effort choosing applications for its SPEC95 benchmarks that complied with ANSI and POSIX standards, and reworking applications to bring them into compliance. Although the first release of SPEC95 is for Unix systems, as a result of the standardization process, the benchmarks will also run on non-Unix systems, such as Windows NT (running the full suite on these systems will be delayed until the forthcoming port of perl5 to NT is complete).
The SPEC95 benchmark tools will be the most visible improvement to anyone who, like Advanced Systems Magazine (predecessor to SunWorld Online), has struggled with the SPEC92 tools and makefile wrappers to verify a result published by a vendor, or to test different hardware/software configurations. Often in SPEC92, manual compilation using different make files for each benchmark was too difficult to document and reproduce. After the benchmarks were run, generation of the reporting pages was a time-consuming manual effort. In SPEC95 compiling and running benchmarks by hand is no longer allowed. You put all your portability and optimization flags in a single file, along with a complete description of the system under test. The SPEC tools compile and run the benchmark, and generate reporting pages in ASCII, PostScript, and HTML formats. Because the tools come already built for all major systems, installation is a snap too. When you run the setup program, it determines your processor architecture and operating system, and gives you a choice of available tool sets. Your selected tool set is then installed and you're ready to run.
Flag me, baby
There is a risk, with all source-code benchmarks, of optimization far in excess of what ordinary users will see. That risk is inherent in their use as a primary metric by which to judge compilers in a highly competitive business environment. Advances in compiler technology over the past several years, stimulated in part by SPEC benchmarks, have greatly increased the delivery of CPU power to applications. So to the extent that these advances are generally applicable to a broad class of code, users benefit.
Flag me, baby
However, optimizations and tuning can also speed up particular
benchmarks while having limited relevance to more general code. In
that case the benchmark metric may inflate performance beyond what you may
see in your applications. What do we do about this in SPEC95?
The baseline metrics of conservative
optimization are carried forward from SPEC92 and tightened. No more
than four compiler flags may be used across all benchmarks (e.g.,
-optimization -processor -linking -fast_library). The run
rules are greatly expanded in the definition of "benchmark-specific
optimizations" that are expressly forbidden. The benchmarks were
selected to avoid concentration of execution time in small code
segments, and in library routines, so performance improvements in the
benchmarks would require broader performance improvements.
Machines with large caches will look relatively better on SPEC95 than on SPEC92, as will machines with efficient processor-memory interconnects. The tables below list some representative SPEC95 results along with the SPEC92 results. "Cache" is the size in kilobytes of the largest level of cache memory.
System CPU Cache SPECint95 SPECint_base95 SPECint92 AlphaStation 600 5/300 300MHz 21164 4096 7.33 7.33 338 IBM 43P 133MHz 604 512 4.55 4.45 177 HP J210 120MHz 7200 512 4.37 4.37 169 AlphaStation 250 4/266 266MHz 21064A 512 4.18 4.18 199 HP 735/125 125MHz 7150 512 4.04 4.04 136 IBM C20 120MHz 604 1024 - 3.85 155 SNI RM400/630 200MHz R4400 4096 3.95 3.79 140.7 IBM 591 77MHz Power2 288 - 3.67 144 Intel Xpress 133MHz Pentium 1024 3.68 3.64 191 IBM 3CT/39H 67MHz Power2 2048 - 3.28 130 HP 735/99 99MHz 7100 512 3.27 3.27 109 IBM C10 80MHz 601 1024 - 2.37 91 (Ref) SS10/40 40MHz SuperSPARC-I 36 - 1 50 System CPU Cache SPECfp95 SPECfp_base95 SPECfp92 AlphaStation 600 5/300 300MHz 21164 4096 12.2 11.6 502 IBM 591 77MHz Power2 288 - 11.2 308 IBM 3CT/39H 67MHz Power2 2048 - 9.44 267 HP J210 120MHz 7200 512 7.54 7.54 269 AlphaStation 250 4/266 266MHz 21064A 512 5.78 5.78 263 HP 735/125 125MHz 7150 512 4.55 4.55 201 HP 735/99 99MHz 7100 512 3.98 3.98 168 IBM C20 120MHz 604 1024 - 3.5 150 IBM 43P 133MHz 604 512 - 3.31 157 IBM C10 80MHz 601 1024 - 2.97 101 Intel Xpress 133MHz Pentium 1024 3.04 2.37 121 (Ref) SS10/40 40MHz SuperSPARC-I 36 - 1 60
Other SPEC benchmarks
The system-level file server (SFS) committee is continuing work to update to its NFS server benchmark, aka LADDIS, (SPECnfs_A93 operations per second at x ms response time). The new version will include additional workloads and support for the NFS version 3 protocol with its client-caching features. Release is targeted for Spring of 1996.
The high-performance group (HPG) will be announcing its first benchmark suite, SPEChpc96 at the Supercomputing '95 conference. It will include two high-end benchmarks: one models seismic data processing used in oil exploration, and the other performs computational chemistry used to design pharmaceuticals and chemicals. The benchmarks will include versions for uniprocessors, shared-memory multiprocessors, and message-passing parallel computers. Future HPG benchmarks will also focus on specific industries that use high-performance computer systems, such as automotive design.
The SFS group is also working on a benchmark of WWW servers, and is in contact with leading Web organizations, such as Commerce Net and NCSA, discussing the workload and benchmark design. The basic concept is much like the NFS benchmark of SFS, in that client computers drive the server under test by issuing requests and measuring response time. One challenge for this benchmark is that the rapid growth and change of the Internet makes the benchmark's workload definition a moving target. SPEC is actively soliciting organizations with expertise in the Web and an interest in its future to join SPEC to help define, develop, and deliver this important new benchmark.
About the author
Walter Bays is a Senior Staff Engineer at Sun Microsystems. He has been involved in SPEC since 1989 in the capacities of Chairman of the Open Systems Steering Committee, Vice-Chair, Secretary/Recorder, member of the Board of Directors, and most importantly, benchmark hacker.
If you have technical problems with this magazine, contact firstname.lastname@example.org