|
When is it faster to have 64 bits?
A comprehensive look at the term "64-bit operations"
|
What is all the fuss about 64-bits? It seems that in the past the computer industry has promoted new systems as being 64-bit, and therefore better than 32-bit, several times. In this column I'll explain the technical details, and examine some of the claims made for "64-bit" systems over the years. (2,500 words)
Mail this article to a friend |
I've seen much news about 64-bit systems
lately. Vendors claim 64 bits are faster than 32 bits, and it seems the
fastest CPUs have 64-bit capability. Can I take advantage of 64-bit
operations to make my applications run faster? If I'm not mistaken,
several vendors in the last few years marketed "64-bit" machines. What
(if anything) is the difference this time?
-- Bitless in
Beloit
Confusion surrounds this issue. Part of the problem is marketers are not always clear what the term "64-bit" refers to. I'll start by making sure everyone understands bits and bit specs and then move on to comparing 32-bit and 64-bit operations. We can then look at the many ways to apply 64-bit-ness.
A bit is the smallest unit of memory or data in a computer. Groups of bits make up storage units in the computer -- characters, bytes, words, files. A single bit has two values (0 and 1, or on and off). Two bits have four values; 3 bits, 8 values; Eight bits equal one byte, or 256 values; 32 bits equals 4 bytes, or 4,294,967,296 values, often called 4 Gig; 64 bits equals 8 bytes, or 18,446,744,073,709,551,616 values.
Bit specifications as related to computer performance can concern
Operations that fit within 32 bits
will run on a 64-bit system at the same speed --
or sometimes more slowly.
By definition, 64-bit operations handle 64 bits, twice as much as 32 bits. If you perform operations that need 64 bits, they can be performed by two 32-bit operations or one 64-bit operation. Operations that fit within 32 bits will run on a 64-bit system at the same speed -- or sometimes more slowly.
Slower on a 64-bit system? Here's an analogy to explain why this might be the case. Suppose I have to drive to work during rush hour. I can take a motorbike or a Porsche. The Porsche can carry twice as many people as the motorbike, and goes just as fast on the open road or in the carpool lane. However, if there is only one person in the Porsche, then it's not allowed in the carpool lane and it can't slip through gaps in the traffic. When each vehicle has one passenger, the bike has the advantage. We will see later on why the smaller size of a 32-bit system can make it more nimble than a 64-bit system.
By the way, despite having a Porsche on the front cover of my book, I haven't quite sold enough copies to trade in my 15-year-old Fiat Spider at the local sports car dealer. Rest assured when you buy my book the royalties will go to a good cause :-).
There are two ways to beef up compute performance:
Since the invention of the microprocessor, we've seen chip makers hike their clock rates like, well, like clockwork. To accomplish more work per cycle, chip makers have upgraded internal and external buses, arithmetic operations, and memory addresses. Not long ago, arithmetic operations moved to 64 bits with internal and external buses following suit simultaneously or soon after; however, microprocessors remained at 32-bits. We are now in the final phase, as all major microprocessor architectures are adding 64-bit addressing.
What are the software implications of 64-bit hardware? The software
doesn't know or care about the internal or external bus width. Also,
most computer languages already offer a full range of 32-bit and 64-bit
arithmetic options for 32-bit processors. If some implicit defaults
change (such as the size of an int
or long
in
C), then some work is required to port code. For changes in addressing,
the implication is a major rewrite of the operating system, and some
programmatic elbow-grease to any applications needing the extra
addressing.
|
|
|
|
SPARC versions
Today's SPARC CPUs follow the 32-bit, version 8 SPARC architecture
definition. This includes microSPARC I and II, HyperSPARC, and
SuperSPARC I and II. Sun's SPARC
Technology Business recently announced UltraSPARC, and HAL Computer Systems announced SPARC64,
the first two 64-bit
SPARC version 9 implementations.
Since SPARC V9 is a full superset of V8 for user-mode applications, V8 code runs perfectly well on these new CPUs. Here the terms "32-bit" and "64-bit" refer to the size of the linear address space that the CPU can manipulate directly. To take advantage of any SPARC V9 features, an application must be recomplied. Apps compiled to use V9 instructions will not run on V8 systems.
64-bit floating-point arithmetic
Floating-point arithmetic is defined by the IEEE 754 standard for
almost all computers. The standard includes 32-bit single-precision and
64-bit double-precision data types. For many years, single-precision
operations were faster than double-precision. This changed when
supercomputers, and then microprocessors, implemented 64-bit
floating-point in a single cycle. Since both 32- and 64-bit operations
take the same time, this was the first opportunity for marketing
departments to talk about 64-bit-ness. It is common for Fortran
programmers to assume that a 64-bit system is one where
double-precision floating-point runs at full speed.
Sun and others have offered full-speed 64-bit arithmetic processors for several years. (Five years ago, Intel marketed its i860 as the first 64-bit microprocessor, based on arithmetic operations.) The SuperSPARC, HyperSPARC and UltraSPARC processors all contain floating-point units that perform double-precision multiply and add operations at the same rate as single-precision operations. They all can be considered full-speed, 64-bit arithmetic processors -- unlike the older SPARC CPUs and microSPARC I and II, which take longer to perform double-precision operations.
SPARC defines 32 floating-point registers. SPARC V8 defines 32 single-precision floating-point registers. For 64-bit operations, the registers pair up so there are only 16 double-precision floating-point registers. SPARC V9 defines an additional 16 double-precision registers, for a total of 32 double-precision registers. (The number of registers directly accessible as single-precision is still 32, however.)
64-bit integer arithmetic
The C language is usually defined to have int
and
long
as 32-bit values on 32-bit systems. To provide for
64-bit arithmetic, an extension to ANSI C was made to create the 64-bit
long long
type. Some CPU architectures do not support
64-bit integer operations, and use several 32-bit instructions to carry
out each 64-bit operation. The SPARC architecture has always provided
support for loading and storing 64-bit integers, using pairs of 32-bit
registers. Older SPARC chips take several cycles to execute the
instructions, but SuperSPARC, HyperSPARC, and UltraSPARC all implement
64-bit load and store in a single cycle.
When it comes to 64-bit add and subtract, only UltraSPARC can do this in a single V9 instruction. SPARC V8 chips have to do this with two V8 instructions.
Full-speed 64-bit floating-point arithmetic and full-speed 64-bit integer load/store has been available since the SuperSPARC first shipped in 1992. SPARC V8 defines the integer registers and arithmetic to be 32-bit. In SPARC V9 the registers work as 32-bit, 64-bit, and V8-compatible pairs. SPARC V9 also adds 64-bit arithmetic to the instruction set.
As touted in a recent flurry of stories (including a September SunWorld Online article, "64-bit Unix initiative launched"), the major players in the industry have (in an unprecedented display of common sense) begun making some decisions about a standard 64-bit version of the Unix programming interface.
Internal buses or datapaths
The internal buses routing data within a CPU are known as
datapaths. If a datapath is only 32 bits wide, then all
64-bit-wide data will take two cycles to transfer. As you might guess,
the same CPUs (namely SuperSPARC, HyperSPARC, and UltraSPARC) that can
perform 64-bit floating-point and integer operations in one clock cycle
also have 64-bit internal datapaths. The wide datapath connects the
registers, the arithmetic units, and the load/store unit.
Since these CPUs are also superscalar, they can execute several instructions in a single clock cycle. The permitted combinations of instructions are complex to explain, but one example adequately illustrates the concept: In SuperSPARC the integer register file can accept one 64-bit result, or two 32-bit results in each clock cycle. Thus, either two 32-bit integer adds or a single 64-bit integer load can be processed in a single cycle.
External buses and caches
Data has to get in and out of the CPU somehow. On the way it gets
stored in caches that speed up repeated access to the same data, and
accesses to adjacent data items in the same cache block. For some
applications, the ability to move a lot of data very quickly is the
most important performance measure. The width of the cache memory and
the buses that connect to caches and to main memory are usually 64 bits
and as much as 128 or 256 bits in some CPUs.
Does this make these CPUs into something that can be called a 64-, 128-, or 256-bit CPU? Apparently Solbourne thought so in 1991 when it launched its own SPARC chip design, marketing it as the first 64-bit SPARC chip. In fact it was the first SPARC chip that had a 64-bit wide cache and memory interface, and was similar in performance to a 33-MHz microSPARC.
The caches and main memory interfaces in microSPARC, SuperSPARC, HyperSPARC and UltraSPARC are at least 64 bits wide. The SuperSPARC on-chip instruction cache uses a 128-bit wide datapath to load four instructions into the CPU in each cycle. Alas, Sun's marketing department missed the opportunity to call SuperSPARC a "128-bit chip." :-)
The SuperSPARC and HyperSPARC access memory over the MBus, a 64-bit wide system interconnect. The MBus-based systems use 144-bit wide-memory SIMMs to provide 128 bits of data, plus error correction. It takes two data transfer cycles to pass the data from the SIMM over the MBus to the CPU.
As part of the UltraSPARC's memory structure, the 128-bit interface from the UltraSPARC chip connects to its external cache and continues to a main memory system that provides 256 bits of data per cycle.
In summary, external interfaces and caches are already at least 64 bits wide. New designs are wider still.
64-bit addressing
The latest 64-bit development involves changing pointers and addresses
from 32-bit to 64-bit quantities. Naturally, marketers are inventing
new adjectives to describe how this solves the world's problems.
In fact, the performance improvements in UltraSPARC and other 64-bit
processors come despite the 64-bit address support, not
because of it.
Let's revisit my generic statement at the beginning of this column: Performance improves when you go from 32 bits to 64 bits only if you are exceeding the capacity of 32 bits. For addresses, this implies that performance will improve if you are trying to address more than 4 gigabytes of RAM.
The downside is, of course, the increase in size of every pointer and address. Applications embed many addresses and pointers in code and data. When they all double in size the application grows, reducing cache hit rates and increasing memory demands. All other things equal, you can expect a small decrease in performance for everyday applications when moving from 32- to 64-bit addressing.
100 times faster? Nothing to scream about!
How does this square-up with those Digital Equipment advertisements
that claim applications run 100 times faster due to the use of
64-bits? In Oracle Magazine's July/August 1995 issue (p.
89) there is a description of the tests that DEC performed. I'll try to
briefly summarize what the article says about the "100 times faster"
figure.
The comparison was performed by running two tests on the same uniprocessor DEC Alpha system. Thus, both the slow and the 100 times faster results were measured on a 64-bit machine running a 64-bit Unix. The performance differences were obtained with two configurations of an Oracle "data warehouse style" database running scans, index builds, and queries on each configuration. Of several operations evaluated, the slowest was three times faster, the fastest was a five-way join that was 107 times faster.
The special features of the configuration were that the database contained about 6 gigabytes of data, and the system was configured with two different database buffer memory and database block sizes:
Slow configuration Fast configuration 128 megabytes of shared 8.5 gigabytes of shared buffer memory buffer memory and 2-kilobyte database blocks 32-kilobyte database blocks lots of disk I/O in small blocks entire database was loaded was performed during tests into RAM during tests
So what does this have to do with 64-bits versus 32-bits? The performance difference is what you would expect to find when comparing RAM speed with disk speed. They greatly magnified the performance difference by using a database small enough to fit in memory on the fast configuration, and only 128 megabytes of shared buffer memory and smaller blocks on the slow configuration.
To redress the balance, a more realistic test should have a database that is much larger than 6 gigabytes. A few hundred gigabytes is more like the kind of data warehouse that would justify buying a big system. Six gigabytes is no more than a data shoebox, and would fit on a single disk drive. The comparison should be performed with the same 32-kilobyte database block size for both tests, and a 2-gigabyte shared buffer, not a 128-megabyte shared buffer, for the "32-bit" configuration. In this updated comparison I would not be surprised if both configurations became CPU- and disk-bound, in which case they would run at the same speed.
How does a 32-bit system handle more than 4 gigabytes of RAM?
It may have occurred to some of you that Sun and Cray currently ship
SPARC/Solaris systems than can be configured with more than 4 gigabytes
of RAM. How can this be? The answer is that the SPARC memory management
unit maps a 32-bit virtual address to a 36-bit physical address. While
any one process can access only 4 gigabytes, the rest of memory is
available to other processes and acts as a filesystem cache.
The Sun SPARCcenter 2000 supports 5 gigabytes of RAM; the Cray CS6400 16 gigabytes. It is amusing to note that the 64-bit DEC Alpha system supports "only" 14 gigabytes of RAM. It would be possible to make database tables in filesystems, and cache up to 16 gigabytes of files in memory on a CS6400. There would be some overhead moving data from the filesystem cache to the shared database buffer area, but nowhere near the overhead of disk access.
The bottom line
So where does this leave us on the issue of better performance through
64-bits? In the cases that affect a large number of applications,
there is plenty of support for high-performance 64-bit operations in
SuperSPARC and HyperSPARC. The performance improvements that UltraSPARC
will bring are due to its higher clock rate, wider memory interface,
and other changes.
Today there are a small number of applications that can take advantage of more than 4 gigabytes of RAM to improve performance. These applications can only run on a tiny number of systems, compared to the size of the Unix marketplace and installed base. Some standards are now being worked on that will hopefully provide for good application source code portability across the 64-bit implementations of several Unix variants. In a few years -- with portable standards, a large installed base of 64-bit hardware, and the inexorable growth in RAM capacity -- 64-bit Unix will begin to move into the mainstream.
Most vendors have released details of their 64-bit CPU architecture and have announced or shipped 64-bit implementations. The marketing of 64-bit-ness is now shifting to software considerations, such as operating-system support and application programming interfaces.
|
About the author
Adrian Cockcroft joined Sun in 1988, and currently works as a performance specialist for the Server Division of SMCC. He is the author of Sun Performance and Tuning: SPARC and Solaris, published by SunSoft Press PTR Prentice Hall.
Reach Adrian at adrian.cockcroft@sunworld.com.
If you have technical problems with this magazine, contact webmaster@sunworld.com
URL: http://www.sunworld.com/swol-11-1995/swol-11-perf.html
Last modified: