Can a `network of workstations' provide cheap, scalable supercomputing power?
Researchers at the UC Berkeley are turning idle workstations into supercomputers. Will the integration make your company more efficient?
Scientists have been experimenting with networking clusters of computers together for years. In the past, they often wrote new software and created new hardware components for the systems. A team of researchers at the University of California, Berkeley has now developed a way of networking together workstations for supercomputer-class performance without modifying the hardware or operating system. Once perfected, network of workstation (NOW) technology will allow an IS manager to turn the underutilized workstations into components of a supercomputer during off hours. (3,300 words, including a sidebar)
You might have an unused supercomputer sitting in your office that stares you in the face every time you sit down to work. After all, if you are in a typical organization, every workstation sits idle for a good part of each day, and most of the night, while they draw power. A group researchers at the UC Berkeley is working on a way of linking these commodity boxes into what it calls a network of workstations (NOW) that scales up to supercomputer speeds. A survey of computer usage on campus concluded that at any given time, 60 percent of the workstations are unused with an even higher percentage unused at night.
During the past decade, researchers and corporate IT managers experimented with parallel computers, but their approaches often relied on expensive, custom-designed components that drove the cost beyond the range of affordability. In some cases, they experimented by building systems based on commodity processors. For example, Thinking Machines (Cambridge, MA) used SPARC chips for its final generation of parallel workstations before it got out of the hardware business. But even then it had to do a lot of work at the systems level on such mundane, but important components, as systems integration.
While vendors have been working on this basic design for some time,
NOW attempts to go beyond some of the earlier limitations in terms of
speed and cost. Cedric Krumbein, a researcher at UC Berkeley,
identifies four main differences between clustering techniques and
Developers of these early attempts at building supercomputers from workstations spent a lot of time reinventing the wheel by rewriting device drivers and doing software development. All of the operating systems had to be rewritten from scratch, and their applications were also rewritten or ported to the new environment. Furthermore, these pseudo-supercomputers did not hold much potential for IT managers with established networks who did not want multiple operating systems in their network, much less on the same workstations.
NOW attempts to take advantage of all of the work done on commodity systems hardware and software. Its current implementation runs on commodity Sun workstations running unmodified, commercial Unix code. In the future, it might be possible to support other platforms, such as Pentium-based PCs running Windows NT.
Researchers at UC Berkeley, who already built two generations of NOW computers called NOW 0 and NOW 1, are currently working on a 100-node version called NOW 2. It is funded by the Department of Defense's Advanced Research Project Agency (ARPA), which has a history of investing in such leading-edge technology as the Internet. In addition, the research is sponsored by a number of companies including Sun Microsystems, AT&T, Digital Equipment Corp., Hewlett-Packard, Informix, and Myricom.
As a concept, NOW seems intuitively obvious, but there are some significant technical problems associated with integrating a collection of computers. The workstations need to communicate with each other at extremely high speed with low latency and a minimum of overhead. The workstations must act together in storing and moving data around in redundant array of independent disks (RAID)-like fashion to take advantage of the parallel architecture. The global operating system needs to coordinate the standard operating systems on each machine for sharing and distributing processes and data.
There are two elements associated with optimizing communications within the network. At the hardware level, you need a very fast network. At the software level, you need a way of sending numerous short messages without incurring processor overhead.
Both NOW 1 and NOW 2 use Myrinet as the LAN, which is capable of shunting 1.28 gigabits per second in each direction for a maximum bandwidth of 2.36 gigabits per second. Eight-port Myrinet switches are used for point-to-point connections between any two workstations in NOW. Depending on how big the network will scale, four to eight processors can be connected to each Myrinet switch. Other ports are connected to other Myrinet switches in a hierarchy, enabling each machine to have a direct connection to all of the others.
Unlike shared LAN topologies such as FDDI, Ethernet, or token ring, Myrinet is a point-to-point protocol so bandwidth is not shared by processors. A fast Ethernet network can provide 100 megabits per second throughput. This works well for infrequent accesses, but for a communications intensive application running on a NOW, each workstation would only have access to less than 1 megabyte per second due to sharing and network overhead.
NOW's goal is to provide machine-to-machine communication of small messages among 100 processors in under 10 microseconds, which approaches the 2.5 microseconds of a Thinking Machines CM-5 parallel computer, according to Krumbein. Generic Active Messages was developed for Thinking Machines CM-5 in 1992, by Thorsten von Eicken, then a graduate student at UC Berkeley, as a low-overhead communications architecture based on request-and-reply messages that transfer data and control of processes. This version has been adapted for other platforms and was also deployed in the first version of Inktomi's search engine.
Says Krumbein, "Ten microseconds was chosen rather arbitrarily as a challenging but reachable goal. Other fast networking schemes have achieved this speed, but they lack the parallel program support that Active Messages provides."
All current TCP implementations have relatively high latency compared to massively parallel processors because of all the memory management in the kernel and the fact that the operating system must be involved with each messaging operation. Active Message allows you to do direct user-level to user-level communication, without involving the operating system (which gives much better latency) and without involving complicated buffer management schemes (also reducing latency).
Serverless file system
"Today, files systems usually consist of one giant server, and that means there are a lot of bottlenecks in terms of performance," says Randy Wang, a NOW researcher. "We are trying to get the server and client to cooperate so it is fast."
The idea is to stripe data across workstations, much like the way RAID stripes data onto disks, increasing data transfer rates. But disk drives are not the only devices that can be shared; on a fast network, moving data to unused RAM on other workstations can increase performance more than an order of magnitude faster than moving it to local disk. The only speed limitations are network bandwidth and latency. NOW research suggests that the time it takes to service a file system local cache for an 8 kilobyte file is approximately 1,050 microseconds for network RAM, versus about 14,800 microseconds to retrieve it from a local disk.
To address the need for sharing these resources, Wang and his associates developed the Serverless File Service called xFS, which distributes the load across multiple processors and storage devices so it is balanced and there is no single point of failure. In place of a single server, clients cooperate in all aspects of the file system, such as managing metadata, storing data, and enforcing protection.
To achieve these goals, xFS has a number of features that are not a part of traditional file systems. Metadata can be moved among machines, providing quicker access and protection in case of a workstation malfunction. File and metadata data is stored in a software-defined RAID, which is less expensive than simple file replication and provides faster access. This is because striping data across multiple machines allows the operating system to store and retrieved data faster, assuming the hardware and software support parallel data transfers. Client caches are managed cooperatively as a single cache for disk, and the client disks are managed as a cooperative cache for robotic tape storage, which reduces I/O bottleneck.
Gluing Unix together
One of the keys to NOW is managing the global pool of resources as if they were a single machine. Often the first step in building a new kernel is to throw everything out and start from scratch. But that is almost like throwing out the baby with the bath water, since there are millions of dollars already invested in operating system research and billions more in the software applications that run on them.
"A lot of the parallel systems (vendors) opted to rewrite the kernel, but because of that, it was not accessible to users," says Doug Ghormley, a researcher working on the operating system for NOW. "Most people will not run an experimental kernel. They prefer to run Solaris, which has been tested for many years."
In NOW, the approach is to glue together systems into what researchers call Global Layer Unix (GLUnix). In this environment, each workstation is running an individual copy of Unix, as opposed to having multiple systems tied together on a network with a single version of Unix running on the host. The challenge is ensuring adequate performance.
A key requirement for GLUnix to support commercial workstations was software fault isolation. Traditional operating system kernels use hardware virtual memory to enforce firewalls between applications. But NOW researchers are developing Secure Loadable Interposition Code (SLIC), which is a way of inserting third-party extensions into standard operating systems.
"We will use SLIC to insert GLUnix extensions into the operating system to help us overcome some of the limitations that standard user-level cluster tools have," Krumbein explains. "GLUnix will use SLIC to catch (the relevant) system calls in order to weave the illusion of a single system image."
It is not possible to provide all of the global services that might be desired with no changes to the kernel. But if the changes can be kept to a minimum, companies can incorporate them into their operating system so that they can be "NOW ready." For example, they might replace the swap device driver for virtual RAM to enable network RAM or replace the kernel communication software with a low overhead version.
"We believe that SLIC will allow us to overcome most of the limitations from building at the user-level," Krumbein says. "However, if we are forced to make some operating system changes, we will define the interface that an OS needs to support in order to be NOW-ready (or SLIC-ready, as the case may be)."
In the current implementation of NOW, all of the workstation are locked away in a single room. It is conceivable that a company might install a fiber optic Myrinet-type network to link computers throughout the enterprise into a NOW. A balance must be made between running NOW processes and appeasing real-time users, however.
Amin Vahdat, a NOW researcher, concedes that they still have a lot of work to do in the area of negotiating with real-time users for NOW since today, process migration performance would not be acceptable to most users. "There are a lot of technical hurdles when you move from low-intensity to high-intensity processes," he says, "but solving such problems is a high priority."
The dawn of cheap power
Computer scientists pursuing the high-performance grail traditionally pursued two paths -- faster processors and scalable computers. Faster processors are at the heart of most successful supercomputer companies, but as they try and put more power into high-end boxes, the costs tend to go up almost exponentially.
The other approach is connecting many processors into a parallel supercomputer. Much of the commercial work in this area focused on integrating custom chips into a single box. However, this approach has not worked in the commercial marketplace because the cost of developing these boxes far exceed what the market can bear. Companies such as Thinking Machines or Encore either went bankrupt or changed their fundamental direction of business. Thinking Machines, for example has gone from being a hardware vendor to a parallel software vendor.
One of the most attractive aspects of a NOW is that it can follow the price curve of commodity hardware. As technology is deployed in workstations, it can be deployed in a NOW for far less than it would cost to develop a new chip from scratch. One example of this can be found in the implementation of the NOW project. "We just recently got four eight-way multiprocessors from Sun," Krumbein says. "GLUnix (and probably xFS and AM) can be brought up on the new machines in a matter of hours. The old massively-parallel-processor approach of reinventing everything would have required months to integrate a new class of hardware."
An interesting illustration of the potential savings of NOW can be seen by comparing the hardware costs for Inktomi (Berkeley, CA), a search engine company that uses NOW-based technology and a more traditional cluster approach such as that perfected by Digital Equipment Corp. for its AltaVista search engine. In a recent filing with the SEC to spin of AltaVista as a separate entity, it was noted that the equipment costs totaled $14.372 million or $5.876 million with depreciation subtracted as of June 1996 for a system that could handle 30 million documents. On the other hand, Inktomi only spent about $1.5 million on the equipment for a search engine that can index over 50 million documents, according to Kevin Brown, director of marketing at Inktomi. (See sidebar, NOW in the real world.)
"It is hard for one company to compare another based on a Web index size," says Chuck Malkiel, a spokesman for Digital. "There is no defined source for how many pages are out there. Besides, it is free to the users so what it costs to put it together is not important to them. The question is, are users getting what they want, and the response is an overwhelming yes. We have a powerful system with 100 percent uptime and have built in expandability to handle more traffic as it grows."
With NOW, Brown says, "the hardware does not matter. You want commodity hardware, which means that you can ride that cost/performance curve to the optimal point. With a supercomputer, (price/performance) goes up the curve very fast. With commodity computers, you get superior performance. When they build a new chip, first they put it into commodity workstations, and then they do research on four-way or 20-way boxes. When 200 million documents come onto the Internet in a year, the big question is `Can the invention cycle for big machines keep up?'"
NOW promises to change the landscape of supercomputing forever. Instead of expensive boxes that are difficult to upgrade, NOW promises low cost supercomputing power that scales with demand. Furthermore, it promises to put all of those idle cycles on workstations to good use. Since NOW, as an architecture, is not tied to any one operating system, it is conceivable that companies could do something useful with all of those Unix boxes, Intel based PCs, and even Macintoshes that sit idle for hours each day.
If you have technical problems with this magazine, contact email@example.com
In mid-1995, Eric Brewer, a professor of computer science at the University of California, Berkeley, and one of his top students, Eric Gautier, built the Inktomi search engine on top of NOW and put it online for public access. The response was so positive that they decided to commercialize the technology and formed Inktomi, the company, in February 1996. They made a novel arrangement with the university to license the technology that involved giving the university stock in the new company instead of royalties, which is the traditional payment method.
The company started with just four Sun workstations and now uses 22 UltraSPARC processors in 16 physical boxes -- the Ultra 2 boxes have two processors each. Internet surfing is done by a cluster of 11 Pentium Pro workstations that are networked so they can divide the work among them, but they are not configured as a NOW.
"The Pentiums are more loosely coupled than the Suns," says marketing director Kevin Brown at Inktomi. "They are highly parallel, but within each machine we have highly multithreaded computing, and they do not have the same kind of speed requirements for the interconnect." At the moment, updating the database is done manually. Every couple of weeks, the whole Internet is resurfed so that new documents can be found and documents that are no longer valid can be eliminated from the index. The Pentiums can surf up to 10 million documents a day, so the entire index could be generated from scratch in less than a week, he says.
Once the documents are indexed, they are sent to the NOW for storage. Each processor manages roughly 2.5 million documents in its own RAID storage system. Altogether, the NOW has about a terabyte of hard disk space.
Inktomi plans to make money by selling ads on its search engine and by licensing its technology to other companies. For example, OzEmail, the largest service provider in Australia, is licensing the search engine for use in Australia, New Zealand, and India. Inktomi will provide a ready-made index and the technology for crawling and indexing new documents. The driving force for building these mirrors is the need to overcome barriers in culture and bandwidth bottlenecks between the different continents.
But Inktomi does not plan to stay anchored in the search engine business forever as there are a lot of other applications that could take advantage of low cost supercomputers. "A search engine is an interesting showcase of how this technology can be used," Brown says. "We view it as a proof of concept that this technology can scale up. Our mission statement is that we are anything but a search engine company. We are a high-performance, network-computing products and services company."
Early 1994: NOW 0 was built and based on four Hewlett-Packard workstations using FDDI as the communications backbone.
Mid 1995: NOW 1 was built from 50 SPARCstation 10s and 20s and tied together by Myrinet. NOW 1 is currently on the Internet and available for testing by researchers outside of UC Berkeley.
Present: NOW 2 is currently being built from 100 UltraSPARC workstations. NOW 1 might be cannibalized to bring NOW 2 to 150 workstations, according to Cedric Krumbein, a researcher working on the project.
About the author
George Lawton (firstname.lastname@example.org) is a computer and telecommunications consultant based in Brisbane, CA. You can visit his home page at http://www.best.com/~glawton/glawton/. Reach George at email@example.com.