Click on our Sponsors to help Support SunWorld


The Standard Performance Evaluation Corp. has released
an updated of its benchmark

By Walter Bays

November  1995
[Next story]
[Table of Contents]
Subscribe to SunWorld, it's free!

A lot can happen in 3 years, especially in the computer industry. SPEC's new suite of benchmarks are newly adapted to more accurately measure the CPU performance of today's machines, be they single- or multi-processors, clusters, or symmetric multiprocessors. (2,200 words)

Mail this
article to
a friend
SPEC95 is the new suite of source code benchmarks from the Standard Performance Evaluation Corp. (SPEC). Like the 92 benchmarks, they test the CPU along with cache, memory, and compiler. The impact of I/O, network, graphics, and operating system on benchmark performance is negligible.

SPEC95 is provided in high-level language source-code form, which is compiled and executed on the target system. Source code must be carefully crafted to ensure portability, and to ensure that versions compiled for different systems perform a comparable amount of high-level work (at the source level). Different versions may perform different amounts of work at the machine-instruction level. That is the essence of compiler optimization: to figure out how to do the same job with less work. SPEC allows no modifications to the source code other than for portability reasons, and then only by explicit approval of the technical steering committee after the performance neutrality of the change has been demonstrated. Other examples of source-code benchmarks are GPC (Graphics Performance Characterization) and the commercial AIM benchmarks.

A different approach to benchmarks is the binary code benchmark. Examples include BAPCO, Norton SI, and various commercial application benchmarks. Benchmark binaries are provided for a single instruction set architecture (ISA) and operating system (OS), or multiple binaries are provided for a few specific ISA/OSes. Portability is strictly defined by this set of ISA/OSes: the benchmark can be run on these and no others. On a single ISA/OS, results are directly comparable across all platforms. Comparing across different ISA/OSes can be tricky. The same amount of high-level work is being done, but are the amounts of low-level work comparable? And do you care? An application vendor may have tuned one version of its code more highly than another. The MS-Windows application may be 386, 486, or Pentium code, and the Macintosh application may be 68020, PowerPC 601, or PowerPC 604 code -- with resulting speed differences. For application benchmarks, you probably don't care, so long as the applications benchmarked are the ones you will be running.

A third approach is the specification benchmark. An example is the TPC (Transaction Processing Council) database benchmark. The benchmark is provided in the form of written specifications of which high-level operations are to be performed. The tester programs the benchmark in high-level languages subject to independent audit and review of the implementation. This code is then compiled (and/or interpreted) and run on the target system. Portability is guaranteed because each version is specifically written for the target system, while comparability of results is ensured by the audit process.

Each of these approaches has strengths and weaknesses. One of the strengths of SPEC's source-code format is that it's relatively inexpensive to measure systems. This allows a more intensive measurement in the computer design process, promoting a more rapid delivery of performance improvements to users, and providing a large base of comparative results. SPEC measurements drive advances in compiler optimization technology benefiting commercial application vendors and users. As noted below however, this strength carries a corresponding weakness, and these are addressed in the design of SPEC95.


The SPEC95 yardstick

SPEC95 produces the familiar speed metrics for integer and floating-point calculations, for both peak (aggressively tuned) and baseline (more conservative) cases.

Like SPEC92, there are also throughput metrics for single- or multiple-CPU performance. SPEC95 is a component-level, as opposed to system-level benchmark, and does not require a significant amount of I/O or contention between jobs for system resources. In multi-processor systems the aggregate capacity of the CPU is measured, not how effectively that capacity is delivered to a particular user's workload. Unlike SPEC92, the SPEC95 throughput measures were designed to accommodate clusters as well as uniprocessors and symmetric multiprocessors, and to provide a meaningful comparison of aggregate capacity across these very different architectures.

The SPEC95 metrics are listed below. Users can select the metric that is most relevant to their environment depending on what factors are most important: speed versus throughput, integer versus floating-point, and conservative versus aggressive tuning.

SPEC95 Metrics

		Speed		Throughput
Aggressive	SPECint95	SPECint_rate95
		SPECfp95	SPECfp_rate95
Conservative	SPECint_base95	SPECint_rate_base95
		SPECfp_base95	SPECfp_rate_base95

Better still, SPEC has always advised "read all the numbers" and pay particular attention to the ratios of benchmarks that are most similar to your applications.

Why SPEC95?

Hardware and compilers are evolving rapidly and benchmarks must stay current to accurately measure them. For example, SPEC89 put more stress on cache and memory systems than other common benchmarks of its day, back when cache sizes of 4-to-8 kilobytes and main memory sizes of 8-to-16 megabytes were typical. But today SPEC92 puts little stress on a typical 512 kilobyte cache, and many servers today have more cache memory than the old systems had main memory.

CPU speeds have increased to the point that on a number of modern systems some of the SPEC92 benchmarks run in just a few seconds, making timing less accurate, and exaggerating cache and I/O start-up effects relative to the main application code. The run times of the SPEC95 benchmarks were determined, taking into account expected performance increases over the lifetime of the suite. Indeed, running SPEC95 on older systems, as would be desirable to increase the base of comparable historical results, can be time-consuming. A single conforming execution of SPEC95 on the reference machine, a SPARCstation 10 Model 40, requires approximately 48 hours (!) of run time.

There are some fundamental differences between SPEC92 and SPEC95:

SPEC spent considerable effort choosing applications for its SPEC95 benchmarks that complied with ANSI and POSIX standards, and reworking applications to bring them into compliance. Although the first release of SPEC95 is for Unix systems, as a result of the standardization process, the benchmarks will also run on non-Unix systems, such as Windows NT (running the full suite on these systems will be delayed until the forthcoming port of perl5 to NT is complete).

The SPEC95 benchmark tools will be the most visible improvement to anyone who, like Advanced Systems Magazine (predecessor to SunWorld Online), has struggled with the SPEC92 tools and makefile wrappers to verify a result published by a vendor, or to test different hardware/software configurations. Often in SPEC92, manual compilation using different make files for each benchmark was too difficult to document and reproduce. After the benchmarks were run, generation of the reporting pages was a time-consuming manual effort. In SPEC95 compiling and running benchmarks by hand is no longer allowed. You put all your portability and optimization flags in a single file, along with a complete description of the system under test. The SPEC tools compile and run the benchmark, and generate reporting pages in ASCII, PostScript, and HTML formats. Because the tools come already built for all major systems, installation is a snap too. When you run the setup program, it determines your processor architecture and operating system, and gives you a choice of available tool sets. Your selected tool set is then installed and you're ready to run.

Flag me, baby

There is a risk, with all source-code benchmarks, of optimization far in excess of what ordinary users will see. That risk is inherent in their use as a primary metric by which to judge compilers in a highly competitive business environment. Advances in compiler technology over the past several years, stimulated in part by SPEC benchmarks, have greatly increased the delivery of CPU power to applications. So to the extent that these advances are generally applicable to a broad class of code, users benefit.

However, optimizations and tuning can also speed up particular benchmarks while having limited relevance to more general code. In that case the benchmark metric may inflate performance beyond what you may see in your applications. What do we do about this in SPEC95? The baseline metrics of conservative optimization are carried forward from SPEC92 and tightened. No more than four compiler flags may be used across all benchmarks (e.g., -optimization -processor -linking -fast_library). The run rules are greatly expanded in the definition of "benchmark-specific optimizations" that are expressly forbidden. The benchmarks were selected to avoid concentration of execution time in small code segments, and in library routines, so performance improvements in the benchmarks would require broader performance improvements.

Machines with large caches will look relatively better on SPEC95 than on SPEC92, as will machines with efficient processor-memory interconnects. The tables below list some representative SPEC95 results along with the SPEC92 results. "Cache" is the size in kilobytes of the largest level of cache memory.

System			CPU			Cache	SPECint95	SPECint_base95	SPECint92
AlphaStation 600 5/300	300MHz 21164		4096	7.33		7.33		338
IBM 43P			133MHz 604		512	4.55		4.45		177
HP J210			120MHz 7200		512	4.37		4.37		169
AlphaStation 250 4/266	266MHz 21064A		512	4.18		4.18		199
HP 735/125		125MHz 7150		512	4.04		4.04		136
IBM C20			120MHz 604		1024	-		3.85		155
SNI RM400/630		200MHz R4400		4096	3.95		3.79		140.7
IBM 591			77MHz Power2		288	-		3.67		144
Intel Xpress		133MHz Pentium		1024	3.68		3.64		191
IBM 3CT/39H		67MHz Power2		2048	-		3.28		130
HP 735/99		99MHz 7100		512	3.27		3.27		109
IBM C10			80MHz 601		1024	-		2.37		91
(Ref) SS10/40		40MHz SuperSPARC-I	36	-		1		50
System			CPU			Cache	SPECfp95	SPECfp_base95	SPECfp92
AlphaStation 600 5/300	300MHz 21164		4096	12.2		11.6		502
IBM 591			77MHz Power2		288	-		11.2		308
IBM 3CT/39H		67MHz Power2		2048	-		9.44		267
HP J210			120MHz 7200		512	7.54		7.54		269
AlphaStation 250 4/266	266MHz 21064A		512	5.78		5.78		263
HP 735/125		125MHz 7150		512	4.55		4.55		201
HP 735/99		99MHz 7100		512	3.98		3.98		168
IBM C20			120MHz 604		1024	-		3.5		150
IBM 43P			133MHz 604		512	-		3.31		157
IBM C10			80MHz 601		1024	-		2.97		101
Intel Xpress		133MHz Pentium		1024	3.04		2.37		121
(Ref) SS10/40		40MHz SuperSPARC-I	36	-		1		60

Other SPEC benchmarks

The system-level file server (SFS) committee is continuing work to update to its NFS server benchmark, aka LADDIS, (SPECnfs_A93 operations per second at x ms response time). The new version will include additional workloads and support for the NFS version 3 protocol with its client-caching features. Release is targeted for Spring of 1996.

The high-performance group (HPG) will be announcing its first benchmark suite, SPEChpc96 at the Supercomputing '95 conference. It will include two high-end benchmarks: one models seismic data processing used in oil exploration, and the other performs computational chemistry used to design pharmaceuticals and chemicals. The benchmarks will include versions for uniprocessors, shared-memory multiprocessors, and message-passing parallel computers. Future HPG benchmarks will also focus on specific industries that use high-performance computer systems, such as automotive design.

The SFS group is also working on a benchmark of WWW servers, and is in contact with leading Web organizations, such as Commerce Net and NCSA, discussing the workload and benchmark design. The basic concept is much like the NFS benchmark of SFS, in that client computers drive the server under test by issuing requests and measuring response time. One challenge for this benchmark is that the rapid growth and change of the Internet makes the benchmark's workload definition a moving target. SPEC is actively soliciting organizations with expertise in the Web and an interest in its future to join SPEC to help define, develop, and deliver this important new benchmark.

Click on our Sponsors to help Support SunWorld

About the author
Walter Bays is a Senior Staff Engineer at Sun Microsystems. He has been involved in SPEC since 1989 in the capacities of Chairman of the Open Systems Steering Committee, Vice-Chair, Secretary/Recorder, member of the Board of Directors, and most importantly, benchmark hacker.

What did you think of this article?
-Very worth reading
-Worth reading
-Not worth reading
-Too long
-Just right
-Too short
-Too technical
-Just right
-Not technical enough

[Table of Contents]
Subscribe to SunWorld, it's free!
[Next story]
Sun's Site

[(c) Copyright  Web Publishing Inc., and IDG Communication company]

If you have technical problems with this magazine, contact

Last modified: