Network or nightmare?
Adding computers adds complexity. How do you keep up?
The rise of networked computing environments expands your system management role to include all the systems, their applications, and the wires that connect them. With the right tools, you can stay ahead of your network. (2,600 words)
Over time, computers have come way down in price. With the rise of the minicomputer, the workstation, and the personal computer it's become possible to deploy hundreds -- if not thousands -- of systems to get the job done. Hardware is now so cheap it often makes sense to simply trash an underpowered system and roll in a replacement. Forklift upgrades have given way to buying systems by the six-pack.
All this power is a "good thing," but it comes at a heavy price. Managing one computer is hard enough; managing a network of systems can be overwhelming. The problem is one of exponential complexity. If all your systems were standalone entities, your workload would increase linearly as you added systems. But all those computers talk to each other -- sharing data, making network connections, cooperating to run large jobs. Adding a new system, n, increases your workload by n-1, equal to all the other systems it may potentially communicate with.
In these kinds of environments, system management is just one part of the larger problem: network management. Suddenly, your overall availability isn't dependent solely on keeping the processors running. In today's world, losing a router is just as disastrous as a failed disk drive; maybe more so. Often, a single network component failure disrupts the work of hundreds of users connected to dozens of systems.
For this reason, you need to be paying a lot of attention to network management. Network management views your computing environment as a collection of cooperating systems connected by various communication mechanisms. As the folks at Sun Microsystems like to say, the network is the computer. If your skills are focused on effective system management, you need to step back and see the forest instead of the trees. You now need to think of the network as a single, multilayered entity, one that requires its own care and feeding.
Your users will force you to take this approach to managing your entire environment. In the event of an outage, it will do you little good to point out that the systems were up, and that their downtime was caused by a network failure or an application problem. As far as users are concerned, the system is everything behind the network jack in the wall. When the report doesn't print on time, it doesn't matter whether the problem is a disk drive failure, a blown fuse on a hub, or a misconfigured printer driver. The system failed, and it's your fault.
How do you keep ahead of these problems and meet user expectations in such a complicated world? Like any other big problem, you must divide and conquer. In this case, we're going to attack the environment one layer at a time, examining the tools you'll need to keep each layer running. Finally, we'll look at several tools that integrate all the others and try to decide if the cost justifies the capabilities.
Managing the network
The bottom layer of your network computing environment is the network. At this layer, you need to ensure that data flows reliably between systems, that connections get made when needed, and that bandwidth is provided at the right place at the right time.
Network management at this level is a world away from traditional system management. Unlike computers, network components often have no console, log files, diagnostic tools, or easily-changed configuration parameters. Even worse, many network problems come and go at a moment's notice, triggered by strange combinations of activities and network load. Reproducing problems is often impossible, and you often wait for something to happen again, hoping to see the cause before the problem goes away on its own.
To help you out, almost all network devices are now built with support for SNMP, the Simple Network Management Protocol. This protocol allows network devices to signal when they have a problem, report their status when queried, and communicate with other SNMP-capable devices. Not bad, given it wasn't too long ago that determining the status of your network meant walking to your communications closet and checking the lights on the switch's front panel. Now, you can employ any one of a number of SNMP management consoles to ask the switch how things are going.
Where do you get an SNMP console? Luckily, most major vendors of networking hardware will be happy to provide (or sell) an SNMP management tool. This tool can identify SNMP devices on your network, query their status, and build a view of your network that makes it easy to detect where problems might lie. The better tools will create logs of this data, letting you review past history to spot trends and isolate intermittent problems.
Many SNMP consoles are trying to outgrow their humble beginnings as simple network management consoles. With the pervasive support of SNMP, all sorts of devices can now be managed by these consoles, and having one tool to manage everything is an attractive solution. Unfortunately, many system management issues are beyond the scope of SNMP, and these tools, while useful, are not yet up to the task of running your entire environment.
Still, you must have some sort of SNMP management capability in your environment. If you've chosen a strategic vendor for your networking components, you should consider its available tools -- if only because they'll be tuned to best support its products. If you have no specific allegiance to one networking vendor, a tool like Hewlett-Packard's OpenView or Novell's ManageWise might fit the bill.
Managing the systems
We've been talking about system management all year. All those management tools don't exist in a vacuum, however. You must make sure that you integrate them into your overall network management system so that your administrators can track your systems while they manage your network.
Each system management tool you install has some sort of console or control tool. Your operators spend hours looking at your scheduler console, your backup system console, and various other system status tools. You need to bring those tools into the realm of network management, extracting important data and getting it in front of your operators when the need arises.
At first blush, having one person watch the network and the backups may seem absurd. In reality, being able to track problems in one area and proactively avoid them in another is critical. In this case, suppose your network backup jobs ran every night at 1 a.m. If your night operator discovered that a failed hub had knocked out connectivity to your network tape library, the job scheduler could be used to suspend the backup jobs before they began, eliminating hundreds of "can't connect" errors from all your backups. Later, when the device was repaired, you could put the backups back in the schedule, letting them run safely.
Taking this example one step further, suppose your job scheduler used network availability as one criteria for starting your backup jobs. If the management tools indicated a failure, the jobs would automatically wait for things to be fixed before they started. When you integrate all your management tools at the system level with your tools from the network layer, this kind of coordination becomes possible.
This integration can be easy or hard, depending on your tools, your level of expertise, and the size of your wallet. Even if you can't afford sophisticated automation, you can still get better information in front of your administrators, who can then make better decisions about keeping your systems up and running. In addition to keeping your systems running more smoothly, your admins might start finding the sources of all those intermittent problems. Maybe those late-night network glitches happen right when you've got the daily downloads to your data warehouse scheduled. If you can't get all this information in front of the right people, you'll never get the problems solved.
The bottom line: Build or buy system management tools that can integrate into larger management systems. Every time you consider a new tool, make sure you find out if it supports SNMP, and check its compatibility with the network management tools you're using.
Managing the applications
While many system administrators wash their hands of application support, users demand otherwise. If the database isn't up, they don't care if the root cause is a fat-fingered database administrator. They'll be pounding on your door, wanting to know why the system is down.
An entire layer of tools exists just to track and manage applications. This is decidedly more difficult than managing networks and systems, since applications can't be easily categorized and probed. Still, you can buy tools, such as those from BMC, that audit and manage your applications. These tools report status, detect problems, and provide a level of integration with your other management tools.
BMC offers a wide range of modules built to interact with specific applications. These modules know how to look inside an application and gauge its health. If problems arise, the module may be able to fix it, or it may simply raise an alert. In either case, you're given visibility inside an application, something you normally never have.
It is possible to build this kind of support into your homegrown tools. A little forethought in the design stage might make it possible for the application to create status files every so often, or to honor Remote Procedure Calls (RPC) that let you query its status. You might even consider building SNMP support into your applications, making them instantly accessible to your SNMP management console.
Whether you build or buy, obtaining application management capabilities is critical to improving your overall uptime. As with integrating your system and network tools, adding application data into your management console makes it that much easier to find and fix problems. If you could see, for example, that one of your Oracle instances had a low cache hit rate, or that the number of locks on a particular file was too large, you might be able to prevent a problem by taking appropriate action. Even if a problem did arise, you might have more data to help debug the situation. Again, correlating application-level data and network and system data makes life much easier for your sysadmins.
At this point, even if you agree that you need tools at all three layers, you may be overwhelmed at the thought of installing and integrating an entire network management system. Luckily for you, several vendors would be more than happy to do this for you. All it takes is an enormous truckload of cash.
Companies like Computer Associates, Tivoli, and Platinum offer soup-to-nuts suites of tools that will integrate a wide variety of management products. In some cases, you can incorporate your existing tools into these management frameworks. In other cases, you'll have to use the tools these vendors provide.
Integrated management suites offer a number of benefits, but have some serious drawbacks as well. On the plus side, these tools do incorporate a wide range of management services under one umbrella framework, guaranteeing that all these different tools will work together. They support all the standard management protocols, work within networked environments, support a wide variety of platforms, and provide fairly sophisticated consoles that integrate a lot of information.
Unfortunately, buying into a suite means that you are buying a collection of products -- some good, some bad. No one vendor can ever produce the best of everything, and you often sacrifice one tool to get another. To truly get the best of breed for every management tool, you will have to forego the safety of an integrated package, which can be risky. It also means that you cannot use some of these tools at all. Computer Associates' Unicenter product was only recently unbundled so that you can, supposedly, buy individual components. Even so, I question how well these tightly-integrated products can function by themselves. Products from Platinum, in contrast, are designed to be unbundled, making it easier to buy point solutions or product subsets from Platinum.
No matter which tool you buy, installing and configuring these tools is hard. It will take tremendous effort to make a large suite of tools work exactly right for your environment, and you will be faced with long-term support issues. Still, it may be that the improved uptime and reduced management costs justify the initial effort needed to install these tools. If nothing else, every vendor is more than happy to provide a consultant to do the installation for you -- for an additional fee.
Rolling your own
It may be that management suites are simply too expensive, too complicated, or too broad for your shop. You may only need a few small tools, or may not have the time and money to invest in a lot of automation. In many shops, simply getting more data in front of an operator is sufficient and cost-effective. In these cases, you might want to just build your own system.
This doesn't mean you write every tool from the ground up. The idea is to acquire a few useful tools that meet most of your needs, and to bolt them together on your own. Every good systems administrator knows how to write scripts that tweak and manage their systems, and these scripts are often the foundation of a management environment. For example, you can certainly buy a tool that, among other things, monitors swap space utilization on your system. You can also write a quick script that checks the swap space and sends someone e-mail if things look bad. Run the script every 10 minutes (or whenever) with cron, and shazam! Instant swap-management tool. Add a few scripts to track disk usage and the load average, and suddenly you're running real-time systems management.
Don't think that these quick hacks are somehow inferior to the expensive tools you might buy. Many of these tools, at their core, are nothing more than fancy scripts that look at the same system parameters that you do. What these tools have, that you don't, are fancy graphical front ends and broad support for a variety of different systems. If you don't need fancy front ends, don't have many different systems, and have a little bit of scripting expertise, you can put together a few effective tools without a lot of effort.
The point of all this is not how you get the tools, but that you have them at all. Simply put, you cannot run a networked computing environment without good tools at every layer -- from the network, to the systems, to the applications. Whether you shell out a lot of money and install a huge tool suite, or hand someone the task of writing a few important shell scripts, you need to begin building a management infrastructure to keep you ahead of your systems. Otherwise, your users will have your head.
About the author
Chuck Musciano has been running various Web sites, including the HTML Guru Home Page, since early 1994. He serves up HTML tips and tricks to hundreds of thousands of visitors each month. He's been a beta tester and contributor to the NCSA httpd project and speaks regularly on the Internet, World Wide Web, and related topics. Chuck writes SunWorld's Webmaster column and is currently CIO at the American Kennel Club. Reach Chuck at firstname.lastname@example.org.
If you have technical problems with this magazine, contact email@example.com