How to manage and implement change in your Unix environment
When are changes necessary -- and when should they be avoided?
Change in your IT shop, whether it's hardware or software related, can be either a blessing or a big blunder. Proper change management is crucial. Chuck takes you through the best steps for preparing for change, tracking it, and finally, implementing it. (2,500 words)
Change also brings constant confusion and chaos. Every time something changes, something else breaks and stops working. Systems that ran fine for years suddenly go belly-up when a seemingly unrelated system is altered in some way. Change confuses users and frustrates administrators.
We can't live with change, but we can't survive without it. How do you harness the power of change to use its forces for good instead of evil? In the realm of systems management, you apply the principles of change management.
Origins of change
As in all confrontations, you must know your enemy. To anticipate and control change, you must know what causes it, which of its manifestations are important, and which can be safely ignored.
One of the principle sources of change is hardware and software vendors. In the industry we've created, success is only possible when constant change forces customers to purchase new products to replace and enhance old ones. Every vendor is required to constantly change its products to differentiate them from the competition and garner more business. Much of this change is for the better, bringing new and improved products to your door. The trick is to know when to buy and when to smile politely and say no.
Hardware changes are often easiest to resist, because they're rarely free. You may dream of upgrading all your processors from 300 MHz to 400 MHz, but justifying that hefty price tag can be difficult. You can and do defer hardware changes until you absolutely can't live without them, and both your pocketbook and system health are generally better off for it.
Vendor-driven software changes are a different matter. Software upgrades, especially system patches and updates, are almost always free. Like horribly addictive drugs, you can surf the Net and pull down all sorts of add-ons, upgrades, and enhancements for your systems, all of which promise some sort of new feature or improvement.
In reality, many upgrades, patches, and software fixes aren't necessary. Patches and upgrades should only be applied to your system if they fix a specific problem you're actually experiencing on your system. Many administrators feel the need to apply every patch to their box, "just in case we might need it." Nothing could be worse for your systems. All those patches have the capability of mixing badly on your system, inducing bugs where none existed before.
Once you've resisted the vast majority of changes your vendors would love to inflict upon you, it's time to turn your efforts to reining in your staff. The desire to tinker and hack flows through the blood of every Unix admin, and you must harness that desire only when it can be used to improve your environment. If they're worth their salt, your administrators will constantly fiddle and improve your systems. The goal (as we'll see) is to track those changes, keep the good ones, and avoid the frivolous or costly ones.
This is a fine line to walk. If you reject all the advances put forth by your staff, you're doing both yourself and your team a disservice. Computing is fundamentally a creative outlet, and you must satisfy that urge by allowing your people to create a better environment with your systems. If you go overboard and allow any and all changes, however, you're inducing too much chaos into your environment, creating problems where none existed.
With your staff well managed, it's time to face the crowd you can't control: your users. Their desire to change will directly control your ability to manage that change. Users generally don't care about the details; they view the systems as tools that must be shaped to get their jobs done as quickly and cheaply as possible. It's hard for Unix administrators to come to grips with the reality that mainframe systems programmers came to accept years ago: no change -- even the slightest -- ever occurs without users approving it. You might think your systems belong to you, but in fact you run them for other people. Those people call the shots, because usually they're paying the bills.
Change, then, originates from three groups. Your vendors and suppliers will constantly suggest changes, almost all of which can be ignored. Your staff will think of changes, which must be carefully considered and accepted judiciously. Your users will come up with changes that, no matter how outlandish, will usually have to be accommodated in some form or fashion. More importantly, your users can stop any change you or your vendors may want to cause, declaring it to be not in their best business interest.
This last observation has far-reaching implications. Users may be perfectly happy running software that is several years old on aging hardware. Even though vendors have dropped support and nothing else works in the environment anymore, you may be expected to keep those old components running to support a user that has no good reason to upgrade or enhance his or her environment. Convincing a happy, stable user to change to a newer environment to make your life easier is a difficult task, and not one to undertake lightly.
With a constant stream of proposed changes being presented to you and your staff, you must implement a formal process to evaluate, approve, test, implement, and document them. While reasoned, systematic change is unavoidable, its unorderly cousin isn't. By creating a process that you and your customers can live with, you allow change to occur without disrupting your systems too much.
The first step is to formalize change requests. You may choose to use a formal request management tool, such as Remedy's Action Request System, or you might create your own internal system to accept and track change requests. Rolling your own solution might work for small shops with informal change management procedures; formal tools, with various reporting and tracking bells and whistles, might suit larger shops with more diverse customer bases.
No matter which tool you select, it should allow your users and your staff to submit proposed changes for eventual review and implementation. Proposed changes should include not only the desired change, but the reasons for the change, the potential implementation difficulty, the impact on other systems, and the effect on the system if the change isn't implemented.
Periodically, you must convene a meeting of your users and your administration staff to review all proposed changes. For each change, the submitter explains why the change should be made to the system. Each operational group, along with each user group, can then debate the merits of the change. Reaching a consensus can be difficult, and it may be that you only implement changes that receive unanimous support. Remember that your users must be part of that consensus: leaving them out will only set you up for disaster later.
Change review meetings are tedious, but can't be avoided if you expect to implement change in a controlled manner. More importantly, no one is allowed to skip a meeting. It's amazing to see which groups are affected by seemingly innocent changes in other systems. It may seem like overkill when your PC server administrator proposes moving a network printer to a different IP address, until someone from your mainframe group pipes up to announce that the printer is also bound to a specific channel on the mainframe. Changing that IP address will cause print jobs to fail unless the change is coordinated between the two systems. Reviewing changes like this before they're implemented allows you to catch more of these hidden dependencies before they disrupt your customers.
At the end of a change review meeting, some of the changes will be approved, some will be denied, and some may be deferred to a future date. Make sure your system can track these decisions, allowing the deferred changes to come back for future consideration and documenting why changes were allowed or denied.
Preparing for change
After approval, you need to implement proposed changes in a controlled manner. This boils down to three important steps: testing, testing, and testing. You cannot swagger up to a production system, slap a few changes in place, adjust your protective goggles, and reboot. You must do everything you can to ensure that the changes will be made smoothly, that the system will run correctly, and that production computing will not be disrupted.
Although expensive, good test environments are critical to the change management procedure. You must be able to reproduce the production environment to the point where you can determine that changes will work correctly.
Mainframe systems make this easy, implementing a feature called regions in MVS or VSE. A region is a separate virtual machine running on the same hardware the production system, itself a region, runs on. Each region can be booted, configured, and modified without impacting other regions. On my current mainframe system, we have five regions running, with several production regions, a test region, and a region running with the date set forward to 2000 for Y2K testing.
Unix systems are only now offering these features on high-end servers. Solaris can be partitioned into different domains when running on an Enterprise 10000 server, and the domain feature will migrate down onto the E6000 and smaller systems. While domains will eventually offer everything regions provide in the mainframe world, it will be a while before they're as commonplace and as readily usable as regions are in the mainframe world.
Fortunately, the Unix vendors can implement regions using a different feature: cheap hardware. You can come close to emulating your production computing environment using a much cheaper, smaller server. Given that Solaris runs the same binary image (modulo-appropriate device drivers loaded at boot time) on a wide range of hardware, you can use a small server to test many of the changes you hope to implement on your bigger systems. If you have the funding to deploy an E6000 server for production, you should be able to pony up the relatively smaller bucks for an E3500 for testing and systems management. Similarly, a smaller workstation can serve as a test platform for a midrange E250 or E450 server.
For all the changes you hope to implement, you need to try to run them in a mix as close to your production environment as possible. Given that this isn't always possible, you should still do your best to wring every last problem out in testing before going live with a change in production. I'm fond of repeating an adage coined by Admiral Hyman Rickover, who created the US Nuclear Submarine Fleet: The more you sweat in peace, the less you bleed in war. Every minute spent testing a system change is worth an hour explaining to a user that his or her systems are down due to a change you made and failed to adequately test.
Once everything is approved and tested, it's time to actually implement your changes. Changes are never installed as they become available. Instead, designate a specific time to be used for systems maintenance. During that outage window, you can install all changes deemed ready to go by your users and your staff. In my current shop, for example, we preschedule maintenance for the third weekend of each month. At my previous job, we were only allowed to take the system out for maintenance from 6:00 a.m. to 10:00 a.m. on a Sunday morning once every six weeks. Your users' requirements will dictate how much time you'll have to maintain and change your systems.
For large shops, you may have multiple teams implementing changes on many systems in parallel. There may be interactions during the maintenance period that you will have to plan for. I can recall one outage where we needed to FTP some additional configuration files from a vendor's site during the outage. We hadn't planned on the network group upgrading our backbone switches, cutting off our access to the Internet. Much wailing and gnashing of teeth ensued as the clocked ticked inexorably toward the end of the outage window.
There are two morals to that story: First, coordinate all your activities across all maintenance groups, minimizing conflicts. Second, have everything you need on hand before you start updating your systems. Much time and anguish can be saved by having all upgrade and patch files staged on servers before the outage begins, with backup copies on tape or diskette in case of some unforeseen disaster. A little planning goes a long way, especially if you're working within a tight outage window.
All this planning may seem like overkill, but you'll love the feeling of running a perfect maintenance outage, with each and every change occurring like clockwork, and without problems or unforeseen obstacles. In many shops, smooth outages are a point of pride among the different operational groups, with a lot of internal competition going on to see who can run the smoothest, quickest maintenance outage.
When all is said and done, you simply start over. As I noted at the beginning of this article, change never ends. In the time it takes you to install new hardware, patch the software, and upgrade the applications, your users will have developed a new need, your vendor will have shipped a new controller with a patched driver, and your admins will have discovered some new tool guaranteed to make your life easier overnight. At that point, you simply record all the changes in your change management system and start the process over, reviewing, approving, testing, and implementing.
It's easy to shrug your shoulders and view much of the change management procedure as so much overhead and busy work. To be honest, no one has time to handle all the paper work and attend all the meetings required to implement a good change management system. More importantly, no one has the time to fix all the problems and calm all the users disrupted by uncontrolled change. If you make the time to create an effective change management process in your shop, you'll find yourself solving problems before they occur. Your users will feel more involved in the operations of your data center, and your staff will take pride in the discipline needed to run your systems perfectly. Change management takes a lot of effort that may seem misdirected, but it pays off in stable, reliable systems that are easy to maintain and provides good service to your users. And that is something that should never change.
About the author
Chuck Musciano started out as a compiler writer on a mainframe system before shifting to an R&D job in a Unix environment. He combined both environments when he helped build Harris Corporation's Corporate Unix Data Center, a mainframe-class computing environment running all-Unix systems. Along the way, he spent a lot of time tinkering on the Internet and authored the best-selling HTML: The Definitive Guide. He is now the chief information officer for the American Kennel Club in Raleigh, NC, actively engaged in moving the AKC from its existing mainframe systems to an all-Unix computing environment.
If you have technical problems with this magazine, contact firstname.lastname@example.org