Click on our Sponsors to help Support SunWorld

How to construct your cluster configuration, Part 2: Plugging in

Notes on implementation and testing before you go live

August 1999

Abstract

Last month, Peter described the ins and outs of clustering -- in theory. This month that theory meets reality, as he describes the implementation details and the status of his Sun Enterprise 10000 cluster rollout. (2,400 words)

Mail this
article to
a friend

Peter Baer Galvin
Objective: To find a challenging and rewarding job that does not involve clustering computers...

finished last month's column hours before the rollout of the new Sun cluster on which I was working. Fortunately, the rollout was a success. It was not without its challenges, however. First up this month are some details of the implementation and testing; then come the implementation results. Of course, plenty of those aforementioned challenges are detailed here as well.

Implementing the shared storage
Architecting a supported and well performing Sun Cluster 2.X facility is still a challenge. For starters, let's look at the issue of storage. SCSI has never been a happy participant in multiple host scenarios. It was designed to allow one host to talk to a string of devices, or multiple strings of devices. It has now been folded, spindled, and mutilated to the point that it -- grudgingly -- allows two hosts can talk to a given device string. Each host is an initiator of SCSI commands on the SCSI bus. Attaching two hosts to a SCSI chain is called multi-initiator SCSI -- and it is ugly.

Sun only supports multi-initiator SCSI in Sun Cluster configurations. That means that you can't take two Sun servers and share a SCSI disk storage unit between them unless you're running the Sun Cluster software -- if you're still planning on getting support from Sun, anyway. Veritas is a little more forgiving, supporting the use of Volume Manager to manage a SCSI shared storage device. Of course, if there is a problem with your facility, Sun may refuse to provide support and send you crying to Veritas.

Another difficulty comes from the SCSI numbering convention. Every SCSI device must have an ID number unique on its chain, including the SCSI controller itself. By default, every Sun host sets the SCSI ID of its controller cards to 7. 0 through 6 are allowed for the devices on a nondifferential chain. For differential SCSI, 0 through 6 and 8 through 15 are allowed. Now, consider two SCSI hosts with various SCSI devices sharing a Sun UltraSCSI storage device (such as an A1000 or A3500). Both hosts will try to use ID 7 -- leaving two devices (the hosts) on the shared chain with the same SCSI ID. This violates the SCSI protocol and will cause all sorts of nasty problems, SCSI resets and the failure of systems to boot among them.

Figure 1 shows just such a situation. If Host A and Host B were booted in the configuration shown in the figure, they would likely hang. The problem is that, on the shared storage SCSI bus, there are two SCSI devices with ID 7 (the two hosts). To fix the problem, a free SCSI target ID must be found on one of the hosts. For instance, Host B has IDs 4, 5, 6, and 7 in use. Therefore, the ID of the SCSI controllers (the SCSI initiator ID) on that host can be set to one of the free numbers. This change would avoid the collision of SCSI ID 7 on the bus that drives the shared disk chain.

Figure 1. Problem with shared storage

One problem is that you need to create a SCSI map like this to make the proper choice of SCSI IDs. Unfortunately, it is a manual process. Because Sun machines set all SCSI cards in the host to the same initiator ID, one host has to find an unused SCSI ID on all of its SCSI chains. This is accomplished by careful use of the probe-scsi-all EEPROM command. Be careful here, because that command can hang the system if it is not executed after a clean reset. The steps to accomplish such a reset are:

At the OK prompt, use setenv auto-boot? false to disable automatic booting after a reset
reset to clear the system
probe-scsi-all to get a list of all target IDs
setenv auto-boot? true to reenable automatic booting after resets

With that accomplished, connect the to-be-shared storage to one of the hosts in question. Then, perform the probe-scsi-all magic and search the resulting SCSI ID lists for an unused ID, on all SCSI buses, under 8. Even though differential SCSI allows IDs greater than 7, only the original nondifferential numbers work for this procedure. At this point, you have a SCSI address under 8 that is unused within the system. The final step is to tell the system to use that number as the SCSI ID for its own SCSI controllers. This is accomplished by setting the EEPROM variable scsi-initiator-id to that value. The value is only read at system reset time, so follow the setting of this value with a reset. When the system starts booting, it will display a message for each SCSI controller indicating that the controller's ID has been set to the value of that variable. Finally, the shared storage unit can safely be attached to both systems, and the requisite boot -r will find the device.

This resolution of SCSI target address conflicts is only slightly more pleasant than driving into Boston during rush hour. Fortunately, the future looks bright. For instance, fibre channel and fibre channel arbitrated loop (FCAL) do not suffer the embarrassment of requiring preset unique addresses. They allow more than one host to attach to the fibre to communicate with the storage. They also allow for hot-attach and -detach of devices to the loops.

FCAL from Sun is currently limited to two hosts per loop. Why, then, if you need to create a shared storage solution, would you ever use SCSI storage devices? The simple answer is that Sun's FCAL solutions are appropriate in some, but not all, circumstances. They are very fast for sequential I/O, but not for random I/O. Random writes are especially problematic because there is no nonvolatile cache in which to hold the writes. Instead, they must be transferred all the way to the disk before the write is considered complete. Why doesn't Sun include a cache and controller on its FCAL devices? The unofficial explanation is that there isn't (currently) a controller that can keep up with FCAL's throughput performance. Rather than slow down the whole device, Sun keeps SCSI devices with caches for random write duty, and FCAL without cache for sequential I/O duty.

More information about Sun's storage offerings can be found in my January 1999 Pete's Super Systems column.

Once the proper Sun cluster facility is architected and servers purchased, the installation of the cluster is straightforward: you pay Sun to do it. Sun engineers install the hardware and software, including all of the heartbeat bits and the cluster monitor.

Data transfer from old facility to new
Of course, before this facility can move to production, several other steps are needed. If the system is replacing an existing one, then current data must be moved to the new system. In the instance of the dual E10000 cluster, this phase took the most discussion, debate, planning, testing, and implementation time. The details of how to move production data from one facility to another may be worthy of coverage in a future column. (If you are interested in this topic, drop me a line.)

Advertisements

Testing the finished product
Frequently, testing reveals things for which planning failed to account. The more complex the project, the more important and thorough the test should be. The theory is sometimes easier than the fact. In this project, schedules were very tight (due to Y2K pressures and Sun ship dates). Another week or two of testing would have made everyone more comfortable.

The tests covered the following areas:

Component test: SunVTS tests on each hardware component were run during weekends, nights, and other idle times. These tests were sometimes run concurrently with other tests to increase the stress on the system.
Application test: Is the application working properly, especially when under load?
Y2K test: Roll the clock forward during operation to determine the system and application reaction to next year (and beyond, in some cases).
Failover test: Manual creation of failure conditions tests the cluster's ability to recover from the failures, and provides recovery experience for staff members as well. These tests can include moving application instances between servers, removing heartbeat and public network connections, pulling hot-plug disks, and disconnecting disk controller cables; any of the above combinations can of course be combined as well. (Be sure to coordinate these tests with Sun's support engineers!)
Procedure test: While it is not possible to test all aspects of the operating system or applications (Oracle in my case), all major procedures should be tested, and, preferably, all minor ones as well. For instance, you'll want to test backups and restores of data and representative database operations, and also check full database loads and complex queries against the data.
Combination tests: The most complicated tests are procedure and component tests combined with failover tests. They are complicated because just determining the theoretically correct result of the test can be challenging -- should the cluster fail? Should just one node reboot? Should the system just have sent recoverable error messages? Determining the real result is easier (did the machine fall over or not?), but not trivial. Resetting the machine to its properly functioning production state can also take some effort.

Our testing revealed some issues that needed to be resolved before the facility could move to production. For instance, a single disk failure in the A3500 caused the controller (and sometimes the pair) to reset or hang. Obviously, that should not be the case. The problem was traced to a lack of grounding on the A3500s. When electrical devices are attached in sequence (say, when a disk cabinet is attached to two separate systems) grounding grows in importance. Once properly grounded, the disk arrays behaved as expected.

Another problem resulted in Oracle crashing under load. Again, this doesn't have a pleasant result. In this case the problem was a bug in some versions of the kernel jumbo patch that only affected E10000s. A couple of core dumps dispatched to Sun and Oracle resulted in Sun generating a patch to solve the problem.

When planning and scheduling testing, it's important to leave yourself time to repeat those tests. In our case, once the two problems described above were found and fixed, another full round of testing was required to be sure that they were really fixed, and that there weren't any other problems lurking behind them. For instance, there could have been another Oracle bug that was never seen because Oracle crashed before the second bug was triggered.

Final cluster thoughts
System availability is measured along a continuum. At the "less available" end, there are single systems with no redundant components and no recoverability features (PCs running lesser operating systems, for instance). At the other end are machines like the Tandems with true fault tolerance (every instruction is executed on multiple CPUs, and the results compared to assure proper execution) and full, remote disaster recovery sites with live data replication. Most facilities fall in between those extremes. The cheapest insurance for Suns is the use of RAID storage and enterprise-class servers with N+1 power, cooling modules, a hot plug/hot swap system, and I/O boards. Beyond that, clusters add an extra level of insurance. Current cluster technology works well but requires care to architect, implement, test, and roll out to production.

Once a cluster is running, the work continues, in the form of monitoring and maintenance. Even a cluster can fail if single component failures go unnoticed and uncorrected.

One final thought: it's cool to call into Sun service with an E10000 serial number. It gets the staff's attention. It's even more fun to place a service call on an E10000 cluster. They are automatically priority one, and seemingly everyone within the area code of your cluster gets paged!

Credit where credit is due
The dual E10000 project was successful; it met its schedule and has so far matched facility reliability goals. The great team at the client site was responsible for the vast majority of the testing and implementation of the facility. High pressure, short deadline, big dollar major company impact projects only succeed with talented people playing major roles. In this case Rick, Doug, Mark, Tareq, Debbie, Nelsy, Feng, and Javier -- along with the Sun team -- were the keys to its success. Oh yes, and Liz!

Next month
Many systems administrators arrive to work in the morning (or early afternoon), leave later in the day, and realize that they "got nothing done." Fire fighting is a hazard of the profession, but there are general and specific ways to move out of fire-fighting mode and regain control of your schedule and to-do list. Next month Pete's Super Systems discusses such potential conflicts and tries to provide some solutions.

Click on our Sponsors to help Support SunWorld

Resources

"The return of the cluster: How to construct your cluster configuration, Part 1," Peter Baer Galvin (SunWorld, July 1999):
http://www.sunworld.com/swol-07-1999/swol-07-supersys.html
"Data depots: Managing data storage," Peter Baer Galvin (SunWorld, January 1999):
http://www.sunworld.com/swol-01-1999/swol-01-supersys.html
Sun Cluster home page:
http://www.sun.com/ha/
Veritas Volume Manager:
http://www.veritas.com/products/volman/
Oracle Parallel Server:
http://www.oracle.com/database/options/parallel.html
Rawn Shah's series of Connectivity columns on clustering (SunWorld, August-November 1998):
http://www.sunworld.com/sunworldonline/common/swol-backissues-columns.html#connectivity
Other clustering-related article listed in the SunWorld Topical Index:
http://www.sunworld.com/sunworldonline/common/swol-siteindex.html#clustering
"Architecting high availability solutions," Sam Wong (SunWorld, November 1998):
http://www.sunworld.com/swol-11-1998/swol-11-itarchitect.html
"Create a highly available environment for your mission-critical applications," Evan Marks (SunWorld, August 1997):
http://www.sunworld.com/swol-08-1997/swol-08-ha.html
Last month's Pete's Super Systems column -- send in your best sysadmin tips:
http://www.sunworld.com/swol-06-1999/swol-06-supersys.html
Full listing of previous Pete's Super Systems columns:
http://www.sunworld.com/sunworldonline/common/swol-backissues-columns.html#supersys
Hal Stern's archived SysAdmin columns:
http://www.sunworld.com/sunworldonline/common/swol-backissues-columns.html#sysadmin
Peter's Solaris Security FAQ (recently updated!):
http://www.sunworld.com/sunworldonline/common/security-faq.html
Peter's Unix Secure Programming FAQ:
http://www.sunworld.com/swol-08-1998/swol-08-security.html

Other SunWorld resources

The SunWorld Topical Index -- a comprehensive listing of all SunWorld articles by subject:
http://www.sunworld.com/common/swol-siteindex.html
Visit sunWHERE -- launchpad to hundreds of online resources for Sun users:
http://www.sunworld.com/sunwhere.html
Explore SunWorld's back issues:
http://www.sunworld.com/common/swol-backissues.html
IDG.net, your one-stop IT resource:
http://www.idg.net

About the author
[Peter Galvin's photo] Peter Baer Galvin is the chief technologist for Corporate Technologies, a systems integrator and VAR. Before that, Peter was the systems manager for Brown University's Computer Science Department. He has written articles for Byte and other magazines, and previously wrote Pete's Wicked World, the security column for SunWorld. Peter is coauthor of the Operating Systems Concepts textbook. As a consultant and trainer, Peter has taught tutorials on security and system administration and given talks at many conferences and institutions.

If you have technical problems with this magazine, contact webmaster@sunworld.com

URL: http://www.sunworld.com/swol-08-1999/swol-08-supersys.html
Last modified:

Comments:
Name:
Email:
Company Name: