Click on our Sponsors to help Support SunWorld
IT Architect

Architecting high availability solutions

What are the components of high availability and how do you go about implementing an effective high availability system?

By Sam Wong

SunWorld
November  1998
[Next story]
[Table of Contents]
[Search]
Subscribe to SunWorld, it's free!

Abstract
This month's IT Architect columnist, Sam Wong, explains the challenges of building a high availability (HA) system. He breaks down HA requirements, points out a few potential problem areas, describes a couple of HA system designs, and offers recommendations for planning and putting together your HA solution. (3,300 words)


Mail this
article to
a friend

Foreword
A common request for a team implementing a technology solution is to "make it fault tolerant." That is, there is a high availability requirement -- or maybe there isn't. How much availability is necessary to meet a request for high availability and what are the additional costs or trade-offs in development and maintenance to making a solution slightly more fault tolerant?

In this month's column, Sam Wong discusses the concept of HA. When we build distributed applications, there are many pieces and parts to the overall design, and that makes ensuring availability more difficult. It's not only difficult to pinpoint the failure, it's a complex problem to ensure the design can recover from any failure, whether this involves a network connection going down, hardware failure, database failure, or an application failure. Let us know your thoughts on HA, and how you've been handling the complex problem of HA for highly distributed environments.

--Kara Kapczynski

As the pace of business accelerates and more and more is expected from a company's information systems, high availability (HA) solutions are becoming increasingly common. Just 10 to 15 years ago, HA solutions were found in only the most critical infrastructures or applications like backbone financial systems and clearinghouses that transferred hundreds of billions of dollars a day, or life-support systems in the health care industry. HA systems in the typical corporate data center existed, but they were much less prevalent than they are today. In building these systems, it wasn't unusual to spend tens of millions of dollars on specialty fault tolerant computers such as those from Tandem, or on double- or triple-redundant computer systems such as those found on the space shuttle.

In today's environment, the drive to consolidate operations, initiatives to provide round-the-clock customer support, and the scalability of modern servers has forced HA solutions into the limelight. Consolidation of multiple data centers into a single site reduces operational costs, but increases the impact of system failures at that site. In an effort to provide the best customer service possible, call centers are getting larger and larger and operating twenty-four hours a day. In addition, the ability of high-end servers to support over 5,000 concurrent users and process over 100 million transactions per day increases the impact of system outages.

As a result of all of this, IT architects will turn to HA more frequently as a solution. End users and process owners will emphatically declare that their system has to be available 24 hours a day, 7 days a week, 365 days a year. But what does this really mean to the IT architect? Why should IT architects pay attention to HA?

One reason to do so is that HA is truly an architectural issue, transcending multiple IT disciplines, such as local and wide area networking, computing hardware, databases, applications and IT facilities planning. Another major reason is that building and operating HA solutions can cost two to five times more than systems that are not highly available. Why is the cost so high?

Imagine, if you will, the difficulty of performing maintenance on a Boeing 747 while the airplane is in mid-flight and without any passengers noticing any service interruptions -- operating, upgrading, and evolving an HA application is somewhat similar. And with the advent of distributed computing technologies, the number of "moving parts" that make up a modern application and the computing infrastructure that supports it has easily tripled in the last 15 years. This extra complexity has exacerbated the challenge of building and operating HA solutions.

In the rest of this article, I will define what HA really means, discuss common points of failure in an application, summarize some important issues that are often overlooked in the initial planning of HA systems, describe a few common HA designs, and offer several suggestions for implementing a successful and appropriate HA solution.

What is high availability?
High availability is much more than just having a redundant database server on the backend of the application. It is important to break down a general HA requirement into specific, detailed requirements. I typically characterize HA using three factors: Mean-time-to-recovery, scheduled downtime and fault transparency.

Mean-time-to-recovery (MTTR) is defined as the amount of time required to return an application to a functioning status after an unexpected component failure. It is measured as the elapsed time between the occurrence of a fault that results in a system outage to the resumption of business processing. Different applications typically have different MTTR requirements. I categorize MTTR as follows:


Advertisements

Scheduled downtime
Scheduled downtime describes the amount of time throughout a calendar year that an application will be unavailable to the end user. Scheduled downtime is always well-planned, communicated and managed. And it always occurs during brief periods at night or on the weekend when normal business transactions are stopped or at a reduced level. Unscheduled downtime is determined by the MTTR requirement and occurs in the middle of normal business operations.

Typical activities performed during periods of scheduled downtime include hardware maintenance, hardware upgrades, backup and recovery, and software upgrades. Scheduled downtime is categorized as follows:

It isn't unusual for users to initially state that 99.9 percent uptime is required without having actually thought through the exact requirement. It is the responsibility of the IT architect and systems designer to determine if this requirement is really 90 percent, 99 percent or 99.9 percent. This is important because the final few tenths of a percentage point below 100 percent can result in driving up the final cost by a factor of two or three. Few systems really require 99.9 percent uptime.

To provide some context, the following table summarizes the amount of scheduled downtime allowed in a full calendar year under several different requirements for percentage uptime:

Percentage uptime

Number of hours available

Number of hours unavailable

99%

8,672

88

99.5%

8,716

44

99.8%

8,742

18

99.9%

8,751

9

99.95%

8,755

5

100%

8,760

0

Fault transparency
Fault transparency is the final consideration that should be factored into defining an HA requirement. The level of fault transparency is determined by the amount of visibility a user has to the fact that a system fault has occurred. Systems that require users to execute several fault recovery steps, or to follow a different set of operational procedures, have a low-degree of fault transparency, whereas fully fault transparent systems shield an end user from even knowing that a fault has occurred.

This list describes several degrees of fault transparency:

These terms determine several different classes of HA. Users typically pay the most attention to MTTR and scheduled downtime in defining their HA requirement. As a result, I've used these two factors to define several different classes of HA service that are listed below:

Classification

Mean time to recovery

Scheduled downtime

Not highly available

Level 1: Don't Care

Level 1: Acceptable

Usually available

Level 1: Don't care

Level 2: Limited

(Doesn't make sense)

Level 1: Don't care

Level 3: Minimal

Fault resilient

Level 2: Limited

Level 1: Acceptable

Highly available

Level 2: Limited

Level 2: Limited

Continuously available

Level 2: Limited

Level 3: Minimal

Fault tolerant

Level 3: Minimal

Level 1: Acceptable

Fault tolerant & highly available

Level 3: Minimal

Level 2: Limited

Continuously fault tolerant

Level 3: Minimal

Level 3: Minimal

What could go wrong?
The different levels of HA listed above directly impact how faults are managed. It is conceivable that some HA systems -- specifically, those that fall in Level 2 (usually available) through Level 6 (continuously available) -- could sustain some down time in order to recover from a fault. Other HA systems may require true fault tolerance and must recover from faults without interrupting the application. The following diagram describes the fault management cycle:


As expected, the more stringent the requirements of the high availability solution in terms of the MTTR, scheduled downtime, and fault transparency, the greater the cost of the solution. It is not unusual to see costs increase exponentially as the high availability requirements increase linearly.

As stated earlier, one reason for the high cost of HA is the large number of parts in a typical application using current technology. Multiple application architectures exist, but for simplicity, we'll only consider the components in the two-tiered, client/server system as depicted in the diagram below:


A failure in any of these components may render the application unavailable. In order to provide HA, all of the components must be eliminated as a potential single-point-of-failure (SPOF). In most systems, more attention is paid to eliminating SPOFs from the back-end server since it may support several hundred or several thousand clients. The following table suggests several common techniques for eliminating SPOFs:

Component Tactics to eliminate this component as a single point of failure
Client application / server application Integrate application into fail over software environment; develop applications with restart / recovery in mind; thoroughly test and debug code
CPU Implement multi-CPU machines; implement clustered machines
Database engine Implement multiple database engines on separate servers; implement a transaction processing monitor on top of the database engines; utilize two-phase commits
Disk controller Use redundant disk controllers
Disk drive Implement mirrored disks; use RAID disks (redundant array of inexpensive disks); implement hot-swappable disk drives (drives that can be replaced without powering down the machine)
Human operator Automate as many tasks as possible; document all procedures at a detailed level; provide adequate training; regularly execute procedures to build familiarity; automate operational alerts to facilitate fast response to system faults
Network and network services Implement separate, redundant networks and network services (such as firewalls and infrastructure services such as DNS, LDAP, NIS, etc.)
Network interface card Use redundant network interface cards (or cards with multiple ports), each connected to separate sub-networks
Operating system Utilize fail over software; implement a journaled file system
Power source Use uninterruptible power supplies; implement power conditioning equipment
Power supply Use redundant power supplies

Why high availability is hard
Each SPOF can be eliminated through the techniques listed. However, what is not apparent is the difficulty of building and maintaining applications and systems that implement the techniques presented. Numerous hidden considerations exist that make HA a complex problem.

One such hidden consideration is managing fault transparency. What does the user see if a fault occurs? The table above describes how to deal with problems at a microscopic level, but what about the big picture? We know what to do to minimize the chance of a tree falling, but what happens to the forest when one does fall? This is where HA transcends multiple computing disciplines. It is the responsibility of the IT architect and system designer to accommodate such scenarios. Doing so takes time, effort and coordination -- all of which ultimately increase the final cost of implementing HA.

In implementing HA, one key supporting requirement is a strong systems management infrastructure. Distributed systems management systems are focused around proactively preventing faults and efficiently reacting to faults that do occur. (Systems management architectures will be covered in a future column.) Without a robust systems management infrastructure, fault prevention is less assured, and keeping MTTR low is more difficult.

In addition, the operation of HA systems requires that very detailed operational procedures be developed. It is important to capture detailed, step-by-step procedures in order to eliminate confusion and the potential for human error. Maintaining these procedures over time requires an extra measure of organizational discipline. Also, periodic testing of normal operating procedures, as well as failure recovery procedures, requires time, resources, and commitment from the IT organization.

It's important to note that these procedures should not only cover how to fail over (how to react to a fault), but how to fail back (how to bring the application back from operating in a fail over state). Planning for this fail back step is easily overlooked when estimating the effort required to implement HA.

Another complicating consideration that is easily overlooked is the extra effort needed to develop and maintain an HA application. Development costs may increase in order to build special functionality into the application in order to be "HA aware" -- the ability to fail over and fail back without shutting down the application and without users taking any extraordinary corrective actions. Making applications "HA aware" or "HA friendly" is probably the least understood aspect of HA with little detailed literature or vendor support. The simplest solution here is to seek specific guidance from technical and management staff who have built and are now operating HA applications.

Administrative costs of operating an HA system also increases. Maintaining redundant systems often doubles the administrative burden of creating new users, managing security, and permissioning and synchronizing software or configurations because all tasks must be done on at least two separate systems. In addition, a shared disk between the primary and secondary servers is often required. With some HA solutions, this shared disk can only be in the form of a raw partition, which complicates day-to-day maintenance.

How to implement high availability
Two common approaches exist for implementing an HA solution with redundant servers. The first approach is an "active/active" solution where both of the back-end servers share the burden of processing live transactions. The other approach is an "active/standby" solution where only one of the two servers process live transactions; the other server is used only when the primary server fails. These diagrams describe each approach:



The root disks depicted here are typically mirrored. The shared disk can be a mirrored disk, a RAID system, or a combination of the two. The heartbeat connection is a dedicated connection between the two servers that is used to ensure that both servers are still functioning. The heartbeat connection can be carried by a dedicated serial connection or over a network connection. The client PCs can be connected either to just one of the network segments or both. If the client PCs are connected to only one network segment, then each server must be connected (or must be able to access) all network segments.

Choosing between the active/active and the active/standby approach depends on the exact requirements of application. The configuration of the hardware and systems software will be determined by the MTTR, scheduled downtime, and fault transparency requirements. These are the tradeoffs of each approach:

  Advantage Disadvantage
Active/Active
  • If desired, allows for both servers to operate at or near full capacity in normal usage, maximizing resource utilization and reducing hardware costs.
  • If both servers are operating at or near capacity, then performance in the fail over state is likely to be 50% or less of normal performance.
  • In order to avoid a performance penalty in a fail over state, both servers must operate at no more than 50% of capacity, wasting resources until a fault occurs.
  • If continuous operation is required, then upgrading the hardware or software would require a third server if full fault recoverability is desired even during the upgrade process.
Active/Standby
  • No performance penalty is incurred when operating in a fail over state
  • Requires the purchase of an standby server which does no work until a fault condition occurs, which increases up-front hardware costs.
  • If continuous operation is required, then upgrading the hardware or software would require a third server if full fault recoverability is desired even during the upgrade process.

When selecting which components to use and how to configure them, it's important to evaluate how they each affect fault transparency. Most HA fail over software, such as Hewlett-Packard's MC/ServiceGuard, IBM's HACMP, and Sun's Solstice HA, work by having the operational server assume the IP address of the failed server, eliminating the need for the user to log off and log back on to a different fail over server. (Note that other components such as the application itself or the database may still require the user to log off and log back on.) Most database applications will incur a pause in transaction processing upon a fail over as the database performs a roll-forward and roll-back recovery on the back-up server. The length of this pause can be reduced by increasing the frequency of database checkpoints. However, frequent checkpoints reduce database transactions throughput.

Regardless of which approach is chosen, cobbling together the various components can be a complex task. In the past, the only approach was to order each individual component -- two servers, dual network cards in each server, disk drives, disk controllers, HA fail over software, etc. -- and integrate them together yourself. The server vendors have recognized this fact and are now offering pre-configured solutions as a single orderable product that includes the servers, interface cards, cables, and other goodies all in one package. Vendors are even beginning to bundle common software configurations together to simplify installation of the technical infrastructure for the application. For example, Hewlett-Packard markets an HA server solution in conjunction with Oracle Parallel Server.

Recommendations
With all of the issues noted, one is tempted to avoid implementing HA at all. Yet the pace and direction of business today will call for HA systems more and more frequently. As HA requirements surface, I'd recommend that IT architects and system designers do the following:

Though the recommendations above won't ensure flawless HA implementations, they will help you to reduce risk, avoid many common surprises, and set expectations appropriately. Good luck in your HA implementation!


Click on our Sponsors to help Support SunWorld


Resources


About the author
[Sam Wong's photo] Sam Wong is a technology director and principal architect with Cambridge Technology Partners. Reach Sam at sam.wong@sunworld.com.

What did you think of this article?
-Very worth reading
-Worth reading
-Not worth reading
-Too long
-Just right
-Too short
-Too technical
-Just right
-Not technical enough
 
 
 
    

SunWorld
[Table of Contents]
Subscribe to SunWorld, it's free!
[Search]
Feedback
[Next story]
Sun's Site

[(c) Copyright  Web Publishing Inc., and IDG Communication company]

If you have technical problems with this magazine, contact webmaster@sunworld.com

URL: http://www.sunworld.com/swol-11-1998/swol-11-itarchitect.html
Last modified: