Change control provides an orderly way for making changes to your system, notifying anyone affected by the change and listening to protests should the change adversely affect someone. It means documenting changes. And it means devising reasonable contingency plans for restoring the system if a change doesn't work.
The concept of change control is simple, and the benefits are obvious. People need to know when change is going to happen so they can be prepared for its effects. More importantly for Information Technologies (IT), change often affects systems availability, particularly in distributed-computing environments where change can produce sometimes radically unpredictable results. Changing the operating system or patching an application on one server can affect the operation of other systems in many unforeseen ways. As production systems manager, your numero uno responsibility is to make your systems highly available, up and running at least 99 percent of the time. Control change or you'll lose control of your systems -- and your job.
Given the potentially adverse effect on their job security, why is change control so often ignored or abused by IT professionals? In fact, change control is one area where the New Unix Professional can take the high road. Seize the opportunity to sneer at your mainframer counterparts.
Fatal flaw
Change-management procedures in the traditional mainframe-based data center dictate that only systems programmers can make changes to the systems, anyone who might be affected by the change is notified at least two weeks prior to the fact, and any changes are rigorously documented. But there is a fatal flaw to that system: System programmers have the authority to make changes to their systems in the data center whenever they want to make them. What has become common in many sites is that minor changes aren't properly announced or documented.
Arrogance, perhaps, is the reason for such behavior -- even we were guilty in our early days of the "techno-god" pretentiousness that was common among mainframe systems programmers. But the effects of change to a homogeneous, tightly integrated system like a mainframe are usually very predictable, very controllable, and nearly imperceptible. Except for the rare occasion of a major system update or upgrade, change in the long-established, proprietary mainframe data center, like our recent U.S. Congress, comes in infinitesimally small increments.
Now let's consider a production system with more than 200 servers and countless desktops running dozens of mission-critical applications and databases in a distributed-computing enterprise. What level of availability do you think you could maintain if you let systems programmers make changes whenever they want to make changes?
Remember, at the time we began rightsizing Sun it was common practice in the Unix community that everyone (it seemed) had root access and could modify their systems. And so-authorized, change the systems they did. Sure, they were supposed to notify the people affected, but the effects often were unpredictable. One time, for instance, unbeknownst to us in IT, someone fixed what on the surface appeared to be a minor networking glitch in one of our production servers. It took us three hours and the expertise of an entire team of system administrators, urged on in their task by incessant phone calls from several corporate VPs, to discover that the undocumented networking change was the reason one of our mission-critical applications had failed. Do you know what it costs the company for that kind of downtime? We had a similar incident when a system administrator applied an apparently innocuous operating system patch that subsequently crashed one of our production databases.
It should be apparent by now that the fatal flaw, even in mainframe-based data centers, but particularly in distributed systems, is not that you don't have change controls and practices, but that there aren't good checks and balances to detect unauthorized changes or unforeseen consequences. Today, we treat Sun's 200-plus distributed-production servers even more delicately than if they were mainframes, particularly when managing change. We have to!
We saw that
We don't stop change; we make sure it's orderly, documented, and secure. We've implemented, as with many other IT functions, a lights-out, automated, checks-and-balances process for change control to help eliminate human error, improve the efficiency and effectiveness of the process, and ensure high systems availability. At the top of our change-control system is the Unix Production Acceptance (UPA) process, which provides -- among many other things -- for standard systems configurations
(see "Our #1 priority," Unix Enterprise, June 1994).
When a server is put through the UPA process, which is necessary if it is to be installed and maintained by IT, we know its exact configuration, including all hardware and software, even its disk and data partitions.
Next, we use in-house-developed shell scripts to automatically monitor those servers and detect changes. For example, we use auto sysconfig, which was developed by our former systems programming staff to track inventory. It surfs Sun's WAN at 3:00 am every morning taking a snapshot of each production server's configuration and compares it with previous configurations. The results are stored in a Sybase database for review by the data center and management. We know exactly what changes have been made to any server in the past 24 hours and who made the changes.
Now, don't get us wrong -- we're not trying to play cops and robbers. Whether disciplinary action is called for or not, our scripts automatically document change; they don't stop it. That way we know exactly what's happening with the servers if we need to restore the systems. And we don't have to rely on manual documentation, which can be fraught with inconsistencies and omissions.
Here's what a report from that early morning "big brother" looks like:
*** System file changes on system "golfball"
{passwd} lcl <root:dqdxnAC.1JTRc:0:1:Operator:/:/bin/csh
> root:wolmNqwX0FnhQ:0:1:Operator:/:/bin/csh 9a10
> alexbkup:aNibdAyTAFSQM:9001:101:Alexandra Backupr:/home/alexbkup: /bin/csh
The script extracts and reports only changes to a system. In this case, it is a change to the passwd file (important for systems security) on the server named "golfball." The report then shows the original passwd file entry followed by the new password. We know which systems person has root authority and responsibility for each server. So, those who feel brave enough to make a systems change without going through change control are bound to get caught, and the incident will get documented in their personnel file. Yes, we are dead serious about managing change.
Making change happen
To make changes to Sun's mission-critical systems, IT personnel must go through fairly rigorous channels. First, they must use the UPA process that dictates system availability and, hence, how and when a change can be made to a server. Then, their change request must be approved by affected group managers, and finally, by an IT change-control committee that meets once a week. That may sound laborious, but like many other IT functions at Sun, change happens automatically, starting with an on-line distributed software tool developed by one of our former systems programmers, Karen Chau. IT users access the chngcntl tool from their SunOS desktop and use the text-based interactive script to request a change or view the status of their change request. The entire approval process happens by e-mail, including authorizing signatures.
The person requesting a systems change fills out five main areas in the chngcntl script: description of the change, the impact it will have on the system, who owns the server, any special change instructions, and what procedure to take to back out of the change. The View option lets users see which requests are pending for approval, which have been approved and are awaiting scheduled implementation, and which system changes have occurred in the past month. In addition, a similar change-control summary automatically is compiled from the chngcntl database and e-mailed to IT personnel each week on the day prior to the change-control committee meeting.
We also have a special process to handle emergency changes. Senior operations management can approve changes that must be done in less than seven days after a request is submitted.
Mission-critical production applications configuration management and source control happens as a separate but equally rigorous process to systems change management. We'll go into more detail in a later column. Fortunately for us, we didn't have to build software for applications version control. With Unix systems, we use Sun's TeamWare, and we highly recommend Programming Research Corp.'s (Dallas) QA Administer.
Better than mainframers
Okay, now what was that claim that the New Unix Enterprise isn't as secure as the mainframe? We beg to differ! Our change controls are good examples that Sun's distributed-computing environments are even more secure than many mainframe data centers. Remember: It isn't the tools; it isn't the hardware; it's all in the infrastructure.
Harris Kern (harris.kern@sunworld.com) is Sun's Open Systems Migration Consultant for NAAFO Market Development. Randy Johnson (randy.johnson@sunworld.com) owns R&H Associates, a full-time rightsizing consultancy in Boulder Creek, CA. R&H Associates helps people worldwide in implementing and supporting client/server infrastructures based on their proven methodologies. © 1995 Harris Kern and Randy Johnson. All rights reserved.
Pick up a copy of their book Rightsizing The New Enterprise: The Proof Not the Hype, SunSoft Press/PTR Prentice Hall, ISBN 0-13-132184-6, or their new book Managing The New Enterprise: The Proof Not the Hype by Kern, Johnson, Hawkins, Law, and Kennedy, SunSoft Press/PTR Prentice Hall, ISBN 0-13-231184-4. Browse SunSoft Press offerings at: http://www.sun.com/smi/ssoftpress
You can buy Managing The New Enterprise and Rightsizing The New Enterprise at Amazon.com Books.
If you have problems with this magazine, contact webmaster@sunworld.com
URL: http://www.sunworld.com/asm-03-1995/asm-03-unix.html.
Last updated: 1 March 1995.