Clustering part three: on the node
In this month's addition to his series on clustering, Rawn explores the strengths and weaknesses of three popular node-clustering systems
The ability to combine three independent computer systems in a cluster brings high-availability and reliability to your application-server environment. This overview of three popular node-clustering systems shows how you can combine several smaller systems to make one powerful server. (3,200 words)
ode clustering is a method for joining several computers together through a physical and system software-level interconnect. The resulting connection is not from CPU to CPU, nor is it as simple as two computers sitting together on the same network. Rather, at some level of the operating system, two or more independent computer systems are integrated into a single virtual computer, allowing them to share workloads, a user database, drive volumes, a security infrastructure, and a system-management infrastructure.
This form of clustering offers several advantages over networks and multiprocessor systems. Node clustering can do the same work as expensive heavy-duty multiprocessors at a fraction of the cost. It also offers the simplicity of a single computer and forms the basis of a high-availability server. If one node goes down, the others are still available to support its users.
Node clustering has been around since mainframe days, with the first successful computer-clustering systems coming from Digital in its VAXcluster system. VAXcluster incorporated a number of separate VAX nodes physically connected to a central device known as a "cluster interconnect" in a star pattern. At the operating system level, each node had its own system identity, drives, and peripherals. The user could login to any of the nodes and have access to the same user environment, files, and directories.
The clustering capabilities of the VAX made it one of the most successful minicomputers in computer history and clustering soon became a feature of most of the Digital VAX line.
Subsequently, many other Unix vendors attempted to build their own clustering systems. With the base Unix system, however, there was no automatic way to create users such that the information could be spread across several nodes. Each Unix system contained its own set of user accounts, system monitoring tools, and drives. At one point, not even the filesystem could be shared across a network, making it nearly impossible to create a Unix cluster.
To address this, system software -- such as Sun Microsystems' NIS (Network Information System) and NIS+ -- was developed to create a shared user, group, and network host directory on which individual workstations and nodes could search. NFS (Sun's venerable network filesystem) also came along around the same time, creating a networkable filesystem that multiple Unix hosts could share.
NSF made the age of storage clustering a reality. With the ability to share drives, servers could communicate and distribute information, making node clustering possible.
Today, clustering is common in Unix shops, primarily because many companies have found it cost-effective to purchase a secondary server to create a cluster, rather than suffering losses because of system downtime.
To understand node-level clustering, we'll take a look at three popular products, the Sun Enterprise Cluster system, the Microsoft Cluster Server system, and the Citrix WinFrame server farm.
Working with Sun's SEC
The Sun Enterprise Cluster (SEC) is based on a common set of system services and interconnects between two Sun Enterprise servers. It's an "asymmetric" cluster system, meaning that the member nodes in the cluster don't have to be identical at the hardware level. You can, for example, have an Enterprise 6000 in the same cluster as two Enterprise 2s. (Practically speaking, however, it makes sense to keep your hardware close in processing strength so that users don't suffer a huge drop in performance should the main node go down. Still, an Enterprise 2 backup server is better than nothing at all.)
Although the hard limitation for total number of nodes in an SEC cluster is 256, this really depends on the types of servers and interconnects you have.
Each cluster is formed according to a topology of server and disk connectivity. Depending on the topology of your cluster, the number of interconnections between servers, servers and disks, and the public, the cost can vary significantly. In general, the more interconnections you have, the greater the reliability, availability, and cost. However, while the cluster's processing capacity increases with each additional node, such as the SMP model, eventually even this performance can taper off.
There are several topologies for clustering servers, differentiated primarily by the level of interconnectivity between nodes. In all cases, there is a set of interconnects between all the nodes. In addition, each node may have a set of drives connected just to itself or to several other nodes. The following are the five standard configurations:
Choosing between the various configurations is usually a matter of how many interconnections are needed and how much money you want to spend. The Shared-Nothing cluster is, as mentioned, the cheapest of the lot, simply connecting all the nodes together. Some topologies may be cost-effective with a 2 to 4-node cluster, but far too expensive or complex with more nodes, as is the case with the Shared-Everything cluster.
Reliability, workload sharing, and convenience are all accomplished by SEC through providing the cluster with a unified network identity, or "global networking," as Sun calls it. The network services for each node exist as before, but the cluster is seen by the user as a single entity by providing it with a single IP address and host name. This logical network addressing allows users to connect to a virtual IP host by the cluster name and be redirected to the node in the cluster that has the smallest workload. In addition, should the node go down, another member of the cluster can fall into place without losing the connection. With cluster-management software, you can also specify a client requesting a particular type of application to a predetermined node. This allows you to segment application processing within your cluster as necessary.
This combined single system image view of the cluster extends down to the process and logical device level. For example, the Unix process identifier (pid), a number signifying a specific running process on the system, is unique across all members of the cluster, as are terminal device identifiers (the Unix tty), and directory structure nodes (the Unix vnode).
This means that applications don't need to be modified in any form to work across a cluster. A single Web server application image may have individual HTTP session processes running across all the nodes, distributing the load as evenly as possible without having to develop a special "cluster-enabled" version of the Web server software. Finally, the administrator is also able to manually migrate a running process on one node to another for any maintenance purposes. This allows the sysadmin to move users actively, executing an application to another node without any service interruption, then fix the problem with the original node.
At the highest point of the Sun server evolution is the Enterprise 10000. This 64-processor-capable system even allows you to separate groups of processors into separate virtual machines or domains. Within a single hardware server it's possible to run several separate systems (just like a mainframe) which can, incidentally, also be clustered together. The E10000 focuses on redundancy and strict reliability and manageability. And, as you may expect, it comes with a hefty price tag starting just under a million for a standard configuration.
The Microsoft Cluster Server
Although MS has consistently claimed that the Windows NT operating system is scalable, it has only now begun to demonstrate clustering technology -- one of the hallmarks of a scalable enterprise server.
Building a cluster of NT systems is crucial in creating a high-availability operating system for the enterprise. As a result, the MS Cluster server (MSCS) is in the works.
MSCS, however, is limited in comparison to the Sun Enterprise cluster, particularly if you consider that the node limit is just two and the largest single-node SMP server is a 14-way Alphaserver system. MSCS is, nevertheless, an important step in creating a reliable platform for NT. By itself, NT already has journaling filesystems (NTFS and DFS) that can be shared across a network; for example, NFS is for Unix disk clusters.
Lest I sound like a pro-Solaris bigot, the table below illustrates the differences between the two systems.
|Clustering abilities of SEC and MSCS|
|Theoretical node maximum||256||More than 2|
|Practical node maximum||Untested||2|
|Shared disk systems||Yes, several topologies||With 3rd-party packages|
|Mirrored disk systems||Yes, several topologies||Not until NT 5.0 or with 3rd-party packages|
|Automated failover when a node goes down||Yes||Yes|
|Automated failback after a node is reset||Yes||Yes|
|Manual process migration||Yes||No|
|Unified network identity||Yes||No|
|Centralized administrator||Yes||Yes console|
|Parallel database systems||Yes, Oracle, Informix||Not yet, Tandem SQL/MX, planned MS SQL Server 7.0 (mid-1999)|
|Distributed lock manager||Yes||Not yet|
|High-speed physical node||1 Gbps||100 Mbps fast|
|Interconnect||Scalable Coherent Interface (SCI)||Ethernet|
|Disk interconnects||SCSI-II, Fast/Wide SCSI, UltraSCSI, Channel loop||SCSI-II, Fast/Wide SCSI, UltraSCSI, Fibre interface, Fibre Channel arbitrated|
|Application cluster||Unnecessary||Recommended awareness|
The limitations of the NT clustering system are directly related to its operating system and application model. For the most part, Windows applications are designed for single users. Furthermore, the graphical interface for these applications ties them strongly to the operating system. In NT, you can't separate the graphics from the operating system itself without severely limiting its multiuser capabilities. Even the new Windows Terminal Server Edition is essentially a hack to display multiple desktops. Because graphics processing becomes server-intensive, scaling the number of users proportionally requires scaling the graphics processing load on the server when the server should be doing data processing, and leaving the graphical processing to client stations. Thus, the design of Windows applications for the single user can obstruct the system in a multiuser situation.
Although MSCS can actually have more than two server nodes, Microsoft has set a limit of two machines per cluster until it can refine the details of the clustering algorithms. This is almost comical, considering that Microsoft is working hard with Digital and Tandem, two pioneers of cluster-based computing.
Finally, although NT does support connections faster than 100 Mbps, it's well known that NT doesn't handle a demanding network connection well, which may ultimately prevent an efficient cluster regardless of the software developed.
The release of MSCS has been slowed by the work on NT 5.0. The plan is to include MSCS as part of the NT Enterprise server system in version 4.0 and 5.0. For the time being, however, it remains in the testing phase.
Until the release of MSCS, WinFrame is the clustering software of choice for the NT operating system. WinFrame is a multiuser NT system based on NT 3.51. With over half a million client access seats deployed across the world, it's one of the leaders in multiuser NT systems, even over Microsoft's own NT Terminal Server edition.
WinFrame was the first implementation of a multiuser NT system. With the growing population of NT servers, such server-farming capability became quite popular as a means to reduce server management headaches. Microsoft liked the idea so much that it licensed the technology from Citrix and created what you see in Windows NT Terminal Server Edition.
Despite the limitations of the NT OS, Citrix has managed to create a multiuser system with some load-balancing capabilities.
WinFrame and MetaFrame (a load-balancing option for NT TSE) both allow users to create a cluster of independent processors that can distribute application loads.
Although this isn't true node-clustering (because the individual nodes aren't tightly interconnected at the system level), it provides an interesting direction in load-balancing based on OS performance parameters. Based on statistics such as CPU load, page swapping, memory usage, and network interface usage, WinFrame allows the nodes to automatically redirect new login sessions across a set of nodes. It can hand failover from the primary login server to any of the other nodes automatically. However, it doesn't create disk sharing, or process or application migration at any level other than logins, which leaves it short of becoming a true node cluster.
The need for nodes
Node-level clustering is a hallmark of scalable enterprise computing. It requires system services that can be separated from applications without disruption or rewrite and a user environment that isn't dependent on specific hardware or operating system interfaces. Node clustering almost requires that you have a high-bandwidth system, network busses, and an OS that can push data effectively. Although it can provide high-availability and scalability, it isn't entirely foolproof. There's always a chance of multiple failures in your system at the same time, which could bring down the entire cluster.
Additionally, there can still be bottlenecks for node interconnects which can affect the entire system. Even with 1-Gbps SCI connections, a node interconnect can be overwhelmed by data throughput. But the only time I've seen that happen is in a Geographical Information System pushing gigabytes of geological and physical geography data between the servers for real-time visualization.
Perhaps their greatest disadvantage is that node clusters are expensive. The amount you spend, however, is proportional to the level of availability and reliability of your system. The limitations of a single-system server are becoming apparent to administrators, despite the wide range of multiprocessor systems coming to market. Most SMP systems are still a single machine and thus fallible. Node clusters can make a cheaper alternative to powerful SMP systems which are achieving a better level of reliability.
For the time being, node clustering is common, but by no means ubiquitous. It's not just a price issue, either. There are misperceptions created by the server-vendor market, as well as incorrect assumptions made by sysadmins. Often server-hardware vendors don't want the customer to think a base/nonclustered system is unreliable by itself. There's also a misconception that any clustering solution absolutely guarantees no downtime whatsoever. So, rather than accept a 99.99 percent reliable solution, IT managers sometimes dismiss clusters altogether, especially if they've experienced a cluster failure.
Despite these misconceptions, clustering remains a powerful tool for enhancing your machines' concerted performance and can allow you to make the most of your resources. Once you have a cluster of machines, you can further expand into network-level clustering and build a more equal application server environment. Network clustering, by round-robin network addresses and load-aware routers, brings a new level of load distribution at a much cheaper price than building expensive node clusters. That's our topic next month.
About the author
Rawn Shah is an independent consultant based in Tucson, AZ. He has written for years on the topic of Unix-to-PC connectivity and has watched many of today's existing systems come into being. He has worked as a system and network administrator in heterogeneous computing environments since 1990. Reach Rawn at email@example.com.
If you have technical problems with this magazine, contact firstname.lastname@example.org