Click on our Sponsors to help Support SunWorld

Create a highly available environment for your mission-critical applications

We explain what highly available means and how to plan for, choose, and implement HA solutions

By Evan Marks

August 1997

Abstract

Tired of being paged in the middle of the night? Can't take a vacation because your mission-critical applications might not run in your absence? If so, it is time to stop worrying and to start looking into creating a highly available environment. This article will look at when and how to implement HA solutions. (2,500 words)

Mail this
article to
a friend

ou've bought your shiny new Enterprise servers with their extremely high levels of RAS. (reliability, availability, and scalability). You have been given impressive MTBF numbers and an assurance from your SunService sales rep that your systems will never fail. Now, you have Internet applications that are required to be up twenty-four hours a day, seven days a week. What would happen if there was an outage? How would you keep your applications running while maintenance is performed? Have you considered that planned outages are downtime as well? It is time to start looking at a high availability (HA) solution.

Many people confuse high availability with fault tolerance. High availability will get you to 99.5 percent uptime, while fault tolerance will give you that extra half percent. The cost scale jumps logarithmically when you try to gain that extra half of a percent. With proper architecture and planning, an HA solution can bring peace of mind to you and your organization without the considerably higher cost of a fault tolerant operation.

Advertisements

The mission -- plan for HA
Planning for high availability is not rocket science. Simply take a look at what you are trying to accomplish and eliminate single points of failure. Start with the power source, move to the server hardware and system software, and then finish with the application.

It still amazes me that people will implement elaborate HA clusters and yet not use appropriate power protection. A good UPS will mean the difference between a normal business day and an extended outage. Do not put all of your HA servers on the same UPS. Do not consider a system safe unless every power cord is plugged into a UPS. Test the battery on your UPS to make sure you will have enough time to shut systems down gracefully, as an underpowered UPS is almost as bad as no UPS at all.

To avoid single points of failure within your server hardware, use as much redundancy as possible. If you're using the Ultra Enterprise line of servers, you need enough internal power supplies to absorb the loss of one. Mirror your OS and swap disks. If multiple CPU boards are installed and you only have two CPUs, place one on each board. Mirror across multiple controllers. When using SPARC storage arrays for your storage, mirror across arrays, or if you have only one array, throw in a spare optical module in each system.

For a specific piece of external hardware that is not redundant, such as a tape jukebox or modem, you should have interfaces that can accommodate those devices on the other servers. Install additional network interfaces to be employed in case of adapter failure. One can argue that if HA software is used, these redundancies are not needed. While HA software will provide proper failover within a specified time frame, the best HA solution is one where failover never happens.

You must determine if you need any of your own scripts, programs, or third-party utilities during a failover. These should either be located on shared disk, copied to all HA servers, or mirrored to the other servers. For products that require license keys, ask your vendors for a key that would be used only in a case of a failure. For a tape jukebox and backup software, be sure you can use any of the servers in the HA configuration as the backup server.

In regards to your applications, decide which ones are mission critical and should be part of the HA environment. Determine their system requirements. If the worst possible calamity happened and everything were failed to a single server, could each server handle the load? Nothing should be hard coded in the app that would preclude its operation on another server. Now that the single points of failure have been eliminated, it is time to get the hardware and software in shape for your HA solution.

HA configuration
Originally, HA solutions were all one-into-one configurations, requiring a server as an idle hot standby, and only performing complete system failovers. Today, there are many types of HA solutions to choose from that can provide M-into-N, symmetrical, and asymmetrical failovers.

An asymmetrical HA configuration consists of two servers, with failover occurring in one direction only. Symmetrical HA involves possible failovers in both directions. M into N involves M services running on N servers, with any of the services able to fail over to any of the servers. These packages all perform some sort of clustering. We will discuss system configuration for the clustering form of HA here.

Most clustering HA packages have two major facets to their operation: heartbeats and service groups. The heartbeat is the connection between all of the systems in your HA cluster. The heartbeat requires at least one private network -- although two networks are recommended. Some HA packages utilize multicasts or broadcasts to determine the members of the cluster, so it is not recommended that you place a heartbeat network on your public net, lest your corporate network evangelist club you over the head with a sniffer. Also, if you cannot use a separate hub for each private network, crossover cables are recommended for one-into-one configurations, as a single hub would constitute an unacceptable single point of failure.

For partial failovers, the applications are split into service groups. Each service group is a logical unit that can be failed over on its own. For example, a server that is running two instances of Sybase SQL server, one instance of Oracle, and providing NFS home directory services would have four logical service groups. Each of these groups would be assigned its own IP address. Port numbers for the database instances should be kept distinct among all instances within the cluster.

All drives and volumes for each service group should be placed in their own disk groups if using Veritas Volume Manager. If you're using Solstice Disk Suite, make sure your metadevices for each service group don't share drives, and that you have enough configuration copies of the metadb defined. If this is a small HA configuration with dual-ported SCSI disk between two systems, the SCSI LUNs must not conflict for the controllers that are dual ported. The key to service groups is remembering that IP addresses need to be distinct for each service. No multiple adapters? Don't worry, Solaris supports up to 255 virtual IP addresses for each adapter. (For those that don't know, the syntax is ifconfig interface:virtual # IP address; i.e. ifconfig hme0:1 191.29.71.48).

Why would we want to fail over a single service? There are several good examples. A controller failure on the primary system (machine one) for an application could cause a failover of that one service to the secondary system (machine two). Performance would be much better for all applications than if all services failed over. Now, hardware maintenance can be scheduled for machine one during a non-critical time and the rest of the services failed over at that time. Another example would be the testing of a new operating system. If you have an asymmetrical configuration with two machines, one production and one test, with dual-ported storage between them, the operating system could be upgraded on the test box then each service failed over individually to test how the application runs on that operating system.

Rolling your own -- Can it be done?
Now your environment is ready for HA. With the purchase of Open Vision by Veritas, there are now three major vendors of HA software for the Sun Solaris environment: Qualix, Sun, and Veritas. Each of these packages has its strengths and weaknesses. But what about rolling your own? Is it possible? Is it worth doing?

The first thing you need to determine is what type of availability is really needed. If you have mission-critical applications, buy the software. If you just want to improve uptime of nonmission-critical projects, then it is possible to create your own "HA on a shoestring."

While it is possible, it's not recommended. Buy the software if you have the budget. If you cannot buy the software, then decide whether you have the time to write and support your solution. Management should then weigh the pros and cons of in-house development of an HA solution.

Disclaimer: If you are planning to develop your own HA solution, a lab environment is highly recommended to test and verify your work. A misconfigured or misbehaving HA product can be much worse than not having an HA product at all. One major plus for purchasing an HA package is the ability to point a finger and know who to call when there is a problem.

Now that you know the risks, here are the major parts needed for your own HA solution:

The ability to start, stop, and test each service from all servers
The ability to bring physical and virtual network interfaces up and down and to add and delete virtual IP addresses
The ability to import, export, and start disk groups and volumes
The ability to monitor status of service groups and servers
The ability to send notification

How much downtime and manual intervention you are willing to accept will determine how much functionality you really need for implementation. If you can accept manual failover, you can simply write a script to import the drives/disk groups, configure the interface, and start the application. Procedures can then be written for operators to run these scripts if required.

The above items can be accomplished via scripts. Most applications should already have start scripts. Developing methods to determine whether an application is up and how to bring it down (hopefully other than kill -9) should be rudimentary. Ifconfig of interfaces is also simple, but you don't want to bring up an address on one machine while it is still being used on another. The address should not be pingable. The same goes for the importing of disk groups: Veritas Volume Manager makes it quite easy to import a disk group (vxdg import disk group) and start volumes (vxrecover -g disk group), and it even has a force option (-f) to ensure the group gets imported. To make your HA solution, the group needs to get exported when the application is brought down on the original server if possible, and imported file systems should be fsck'd prior to being mounted.

Probably the toughest task in rolling your own is the communication between the servers. Determining what is up, what is down, and who holds the ball for each service is not as simple as it sounds. For reliability, this should be accomplished across the private network. For each service, you need to maintain where the service is currently being run, and the machine on which it is to roll over to in case of a failure. Now that you have seen the complexity of creating your own, it is time to look at implementing one of the third-party high availability solutions.

Choosing third-party HA
When choosing a third-party HA vendor, you should consider all the options prior to making a decision. Each vendor's product has pluses and minuses. Here is a list of questions that should be asked of each HA vendor.

Is the installation simple? Can a cluster be installed in hours, days, or weeks?
Are the test intervals customizable?
Does the solution provide for symmetric, asymmetric, or N node cluster capabilities?
Are service groups supported, or is it system-level failover only?
Does the solution support the servers and OS versions you are currently using or plan to use in the future?
Are kernel modifications required to implement the solution?
What types of disks and RAID software are supported?
Can the solution be customized for in-house applications?
What type of modules or agents are available (Sybase/Oracle/NFS/HTTP)?
Is root access required for all access to the applications?
How is planned application downtime handled?

After these questions are asked, an intelligent informed decision can be made on which vendor to choose.

The implementation
You have chosen a vendor and can now proceed to implement the solution. Remember the old adage K.I.S.S. (keep it simple, stupid). Try to keep your configuration as understandable as possible. The more complex the solution, the greater the chance for error. If you hire a consultant to implement your solution, be sure internal employees also understand exactly how and why things are set up. Establish procedures for anyone needing to interact with the HA system, so that mistakes can be confined to a minimum. Look for complementary products to the HA package, like paging software. Lastly, test all permutations of failovers on a regular basis. A non-tested HA solution is probably a non-working HA solution.

You now know what HA is and isn't, how to plan for an HA configuration, and how to choose and implement an HA solution. Following the simple guidelines outlined above, anyone can create a highly available environment.

Evan Marks is currently employed at Aetna Retirement Services in Hartford, CT where he is in charge of the Unix environment. He is also vice president of the Sun User Group and has given several talks on high availability at SUG conferences.

Reach Evan at evan.marks@sunworld.com.

If you have technical problems with this magazine, contact webmaster@sunworld.com

URL: http://www.sunworld.com/swol-08-1997/swol-08-ha.html

Last modified:

Click on our Sponsors to help Support SunWorld

Resources

Qualix HA+
http://www.qualix.com
Veritas Firstwatch
http://www.veritas.com
Solstice HA
http://www.sun.com
"Veritas says Sun will dump DiskSuite," April 1997 news story in SunWorld.
http://www.sun.com/sunworldonline/swol-04-1997/swol-04-disksuite.html
"DiskSuite-Volume Manager migration service now available as Sun expands its sales of Veritas product line," June 1997 news story in SunWorld.
http://www.sun.com/sunworldonline/swol-06-1997/swol-06-sunspots.html#5

About the author
Evan Marks is currently employed at Aetna Retirement Services in Hartford, CT where he is in charge of the Unix environment. He is also vice president of the Sun User Group and has given several talks on high availability at SUG conferences. Reach Evan at evan.marks@sunworld.com.

If you have technical problems with this magazine, contact webmaster@sunworld.com

URL: http://www.sunworld.com/swol-08-1997/swol-08-ha.html
Last modified:

Comments:
Name:
Email:
Company Name:

Comments:
Name:
Email:
Company Name: