OpenVision offers a set of systems management tools to help tend large sites. These tools can be used separately, though they are most powerful when combined.
We reviewed the most widely applicable tools, including a networkwide system monitor sporting a friendly user interface (OpenV*Event Manager), a real-time performance and status monitor with a long-term performance analysis tool (OpenV*Perform/Trend), and a batch processor (OpenV*Scheduler).
We installed the software in a large network with mixed networking protocols. There are approximately 1,800 machines on the net with gateways to Novell and Token Rings. The test bed is a corporate setting and contains chip design, testing, and management systems. Most systems on this network are Sun Microsystems desktop computers.
If you are managing a medium to large site, you should strongly consider systems management tools such as OpenVision's OpenV series. (Tivoli Systems and Computer Associates make similar software. We asked both to participate in this review; however, both declined, stating their tools are too complex to review.)
The devil's details
OpenV*Event Manager -- The Event Manager is the most useful of those we tested and a two-part tool. The Management Interface (MI) provides a real-time view of the monitored network. The Management Agent (MA) is a software module that resides on each computer under surveillance. There can be more than one MI to collect information from the Management Agents, which is good, since each Management Interface is limited to 1,024 Management Agents. This made it impossible to monitor all the nodes in our test using one Management Interface. While this sounds like a nasty limitation, in practice this shouldn't pose a problem since most networks are separated into groups anyway.
The MAs communicate with the MI via Remote Procedure Calls (RPCs). The view of the network from the MI is controlled via a Motif-based GUI. It's a typical tool of the client/server genre.
OpenV*Perform/Trend -- (During the writing of this review, OpenVision announced it is combining Perform and Trend into a single package.) OpenV* Perform is a real-time performance analysis tool. It relies on Performance Agents, one of which is installed on the monitored nodes, which in turn passes data back to the monitoring tools. Perform allows monitoring of multiple nodes simultaneously. With Perform, system administrators can display graphs of important remote performance data. It supports a graphical mapping feature so administrators can set threshold levels of performance and see in seconds whether those thresholds have been exceeded.
OpenV*Trend is a long-term performance analysis tool that presents a graphical analysis of the stored performance data.
OpenV*Scheduler -- The OpenV*Scheduler provides enhanced batch processing facilities for Unix systems. Besides standard cron and at-like scheduling, OpenV*Scheduler provides queue management tools and job control across networked machines. It also provides enhanced security features. Some of these features can be quite useful, for example, allowing otherwise idle CPU cycles to be used efficiently.
Installing & removing
Because management tools can introduce security problems and administration headaches of their own, we spent a great deal of time checking which files were affected during installation. The installation guides are separate documents included with each OpenV module. All are to the point. The installation guide for the OpenV*Event Manager, for example, is only 10 pages long.
The installation of the Event Manager has four steps. First is completing the Management Station installation work-sheet. While we love to see these and wish more vendors included worksheets, we object to OpenVision's suggestion that a system administrator record the superuser password on the worksheet. We offer a thumbs-up for OpenVision listing the commands users will need to generate the information required during installation. For the Event Manager, however, it is clear what is wanted. It is simply the hostname, IP address, architecture, and the amount of free disk space.
The second step in the Event Manager installation process is the Pre-Installation Setup. This is simply a checklist to be certain that all of the software and hardware requirements are met. The Pre-Installation Setup checklist has a section titled Sun Dependencies. Here, there is a line listing Sybase 4.9.2 that implies Sybase is required to run the Event Manager. OpenVision uses its own database format for its tools and does not require an external database.
The third step is the Managed Node Pre-Install Setup. As before, this is merely a checklist for each managed host.
The fourth and final step is the installation. We received the software on an 8mm tape. Installation is easy. There are three files on the tape: two are compressed tar files of the actual binaries and scripts; the third is the actual installation script. We love this feature. It allows the administrator to untar the tape into an NFS exported filesystem, then install the Management Agents from that filesystem on all managed hosts. This is much easier than either dragging the tape around to different hosts (which we've had to do with other packages) or running the installation over the network (which is another common tactic).
After the software is untarred from the tape, the administrator runs the installation script. We did not do this as root, as is recommended, because we distrust installation scripts that run as root. After a polite warning, the script went ahead and started. When it can't perform a necessary task because it needs root privileges, it prints a warning and asks that you perform the task separately as root. Here, it asked two things. It wanted a named pipe created in /dev, and it wanted some start-up commands added to /etc/rc.local. Neither of these are dangerous, so we endorse running the script as root.
The installation script for the Event Manager allows the administrator to specify whether they want the Management Interface, the Management Agent, or both installed. This allows one script for all three installations, making life easier for the administrator.
Installing Perform and Trend are also easy. The only thing to keep in mind is they must be installed in the same directory as previously installed OpenV tools. This is not a problem. The system administrator must hand edit the services file and rebuild the services NIS map.
Installing the Batch Scheduler, however, is more complex, though not unreasonable. It requires a new directory (/var/spool/ovbatch), a new userid (batch), and modifying the user path. The software does not assume you want to replace all cron jobs, for example, and gives you the option of converting existing cron entries to the new system.
Removing the Event Manager is a simple task. No binaries or kernel modifications were performed at installation, so removing it is simply a matter of shutting down the daemons, removing the start-up commands from /etc/rc.local, and deleting the installation directory. You can also run the install script and select the remove option.
Removing Perform and Trend is done the same way. Even removing the Open V*Scheduler is no problem because all the binaries are kept in one directory.
How the tools work
Starting the Event Manager is a matter of making sure the application has permission to write on your display, then typing oemi. Initially, three windows appear: Alarm Admin, Network Admin, and Node Admin. We used a virtual window manager (tvtwm), and the default window placement cannot deal with a virtual desktop. All the windows appear in the home display page, rather than abiding by the RandomPlacement variable setting. We would prefer to have it honor existing X11 resource settings.
The only problem we had installing or running the software was that the ownerships of the executables must be set properly. In our case, we had to chown the directory to root. When we installed the Management Agent on a second node as root, we expected the ownerships to be correct. They were, however, odd occasionally. It is a good idea to chown the directory tree to root, whatever the installation method.
The user guide is laid out well. The earlier chapters are about color conventions and the meaning of things in the start-up windows. It is functionally a tutorial on configuring the Event Monitor for a particular site. We set up and monitored nodes two hours after starting the installation, which is amazing.
The Event Manager is based on the idea of alarm conditions. For example, when an alarm condition occurs, the Management Interface is notified, and other action can be taken (such as paging or e-mail). The alarms are configurable for each managed node.
The first step in setting up a network for monitoring is the addition of monitored nodes. This is done via the Network Admin window. The administrator simply selects add node from the Edit pull-down menu, and a dialog box appears. The administrator types in the name of the machine and selects what kind of machine it is (server, workstation, and so forth). This must be done for each monitored node. Note that a monitored node is different from a managed node. On a very low level (ICMP Ping), OpenV*Event Manager can monitor hosts that do not have the Management Agent installed. They appear as gray icons in the Network Admin window if the node is reachable via ping, black otherwise. Nodes that have the Management Agent installed are normally white icons.
After the nodes are added, they must be enabled. This is simply a matter of selecting enable/disable nodes from the Edit pull-down in the Network Admin window. At first, this seemed redundant, but it is not. In the lower-right corner of the Network Admin window is a box called the bullpen. It contains small squares representing each monitored node (in the appropriate color). If you only want to look at selected hosts, it is convenient to be able to disable them so they don't appear in the bullpen.
The color scheme used in the Event Manager is excellent and intuitive. Nodes that are down or unreachable are designated with black icons. Nodes that are up, but do not have the Management Agent installed (or running) are gray icons. Nodes that are up, with no alarms, are white icons.
Event Manger has three alarm levels. The mildest, Information, is blue; a Warning is yellow; Critical is red. As the number of alarms increases, the shade deepens. If one Critical alarm is triggered, for example, the icon for that host becomes a pleasant rose pink. As more alarms are triggered, the color moves toward "arrest-me" red.
The Event Manager is quite useful for a quick overview of critical machines, allowing us to watch machines shut down and reboot. We found it amusing to have a user call and complain that a server had gone down, and we replied, "Yes, we know." We recommend this product for sites that lack host monitoring software of any kind.
The Event Manager also supports the idea of groups, so you can cluster your monitored hosts into categories. This grouping can be any logical entity you like. For example, you could group by department, by administrative domain, by NIS domain, or by subnet. If you group by subnet, you can easily detect routing failures, since entire groups of machines will suddenly go black.
The Alarm Admin window allows control of outstanding alarms. At the top, there are four blocks: red, yellow, blue, and white. The red box shows the number of outstanding critical events. Likewise, the yellow box shows the number of outstanding warnings, the blue box shows the number of outstanding informational alarms, and the white box has a total.
Below the totals is a section of radio buttons that restrict the alarms in the alarm summary, which is at the bottom of the window. The alarm summary has a one-line list of all the outstanding alarms selected. If you click on one of the outstanding alarms, the Event Manager will open a window that has all the details inside.
The Alarm Admin window, combined with the information in the Network Admin window, gives an instantaneous look at the status of the monitored network. Setting the threshold conditions for each alarm can be a tedious task, but once they're set it takes just a quick glance to see what's happening on the network. Resetting the thresholds for the alarm conditions is easy.
After the initial task of setting up alarms, the system notified me of problems as they developed. For example, we had it configured to watch disk space on critical servers. Twice during the first week, we handled disks filling without the users noticing a reduction in service (which wouldn't have happened before). This is a key feature of this tool; it allows you to address system problems actively, rather than reactively.
We were also experiencing intermittent problems with one of our servers. We believed it to be memory related. It took seconds to start memory monitors to watch pages in and out, swapping, and disk activity. We narrowed the problem down to specific conditions, which led to a quick resolution. Who knows how long it would have taken to solve this problem without the Event Manager.
The Node Admin window contains a graphical representation of the node hardware and a log of all the alarm and element-monitoring commands. Errors are also reported in this window.
There are also some more advanced features in the Event Monitor. It is possible to record data from managed nodes for later analysis. This was helpful where we had a machine that would crash occasionally, which made interactive analysis difficult. Using the playback feature, we could go back through the event sequence leading up to the failure.
Another advanced feature we found useful was the ability to bring up information about remote machines. If we want to clone the disk partitions from a remote machine to install a disk mirror, for example, we can pull up the existing partition map using the hardware graphical display.
After using the Event Manager for several weeks, we found a few things we'd like improved. For example, if you create a group, you cannot simply grab a machine icon and drag it into the new group. You must add it to the group and then disable it in the previous location.
Also, there is no way graphically to arrange the bullpen. It would be wonderful to arrange the bullpen according to geographical location. Another flaw is that it is not obvious how to remove nodes once they are added. These problems are almost insignificant compared to the software's utility.
Using OpenV*Perform is very similar to using certain elements of the Event Manager. The operating ideas are likewise similar. This makes it very easy to start using Perform after you have already configured the Event Manager.
Perform relies on a Performance Agent to pass data back to the server. This Performance Agent must be installed on all monitored nodes. It is a simple installation, as noted above. Once these Performance Agents have been installed (and they can be added later, too), the system administrator can go ahead and configure the Perform tool.
Configuration involves editing the network map and then defining performance thresholds.
We found the network map interesting. Perform provides many map backgrounds on which to define your network map. You can zoom in on a single state, use regions, or the entire United States. It also allows you to draw maps for foreign countries. If you don't want a geographic background, Perform has a blank map background.
Once the administrator has selected the background, they then add elements to the map. These elements consist of logical or physical network elements. Such elements include networks, backbones, domains, or resources. Each of these elements has features for which threshold levels can be set. For example, the resource element can monitor load average.
Once the network has been defined, the system administrator sets thresholds for various items within each network element. These include input packets, output packets, disk reads, load average, and others, just like the Event Manager. The icons on the network map will change color to represent the status of the monitored nodes, as they do in the Event Manager.
OpenV*Trend provides analysis of longer-term events. For example, it is possible to monitor disk activity or load over a period of weeks or months. This can detect (oddly enough) trends in usage that may require management intercession to correct. This tool can be used, in conjunction with Perform and Event Manager, to detect and analyze under- or overutilized machines or networks. This allows the system administrators actively to engineer a solution before a problem arises. For example, heavy use of departmental compute servers with an increasing load trend may tell the administrators it is time to order additional computing power and networking elements. These tools allow information technology departments to shift from reactive troubleshooting to active troubleshooting, which is preferable from a stress and customer-satisfaction viewpoint. In other words, these tools can help your administrators do their jobs better.
Administrators can use Trend to capture performance data at intervals (ranging from seconds to months) for specific resources or activities. For example, it is possible to capture data for disk reads over a period of weeks for each disk on a server. This could be used to distribute software among exported filesystems more evenly.
Once the Scheduler has been installed, it is a simple matter of letting the users submit jobs. There are some very nice features to the Scheduler. One of these is the time specification for running jobs. This is much more granular and configurable than is possible with cron. For example, OpenV*Scheduler allows the user to specify days to skip, so you can have a job repeat every three days, only not Saturday or Sunday. Another interesting feature to the Open V*Scheduler is the job editor. This allows the user to edit the time constraints for all the jobs in the queue.
By far, the most powerful feature of the job scheduler, and, in our experience, the most useful, is the conditional execution feature. Scheduler allows the user to set variables before and after the execution of a job. This is extremely useful in situations where several long-running jobs must be chained together, such as engineering applications where one job produces datafiles for use in the next job. OpenV*Scheduler allows the user to specify exit conditions so the jobs will terminate gracefully if an earlier job has not executed properly. This frees the engineer from the tedious task of watching each job and executing the next manually or, worse, from having a series of long-running jobs run with bad data.
Documentation & support
Our first impression was that these packages are a little light on documentation. Each product includes release notes, installation notes, and a user manual. Several components also include a reference manual. The installation notes are quite clear, with well-defined sections for existing bugs, fixed bugs, and system requirements. Many packages leave you hunting around for the system requirements, and many do not include known problems. OpenVision is up front with this important information, which was good to see. For example, the Batch Scheduler installation scripts do not properly update the services NIS map; the additional services information must be added to systems using NIS manually. It would be very confusing if this information was hidden at the back of a
30-page release note. In hindsight, we believe the sparseness of the documentation is acceptable.
One drawback of the documentation is cosmetic. There are quite a few typographical errors, which, at times, can be distracting. Novice users, for example, may not know that the correct variable name is LD_LIBRARY_PATH, rather than LD-LIBRARY_PATH. Novice users, on the other hand, probably will not be using this software.
Overall, the printed documentation is well done. The instructions are clear and concise. The examples are consistent with the graphics provided, and they are of a general nature; this allows the reader to extrapolate information quickly in the examples to other circumstances.
OpenVision's support policies for these tools are average.
Are they worth the bother?
The OpenV*Event Manager is an excellent tool. It has a few mildly irritating quirks, but no showstoppers. We recommend it. The Perform and Trend tools provide many features we wished were in the Event Manager (it was the only tool we had running). When we thought, "We wish this could," we'd later find OpenVision had already thought of that.
The test site does not use extensive batch processing, so the OpenV*Scheduler is not needed here. However, we have consulted at materials-engineering sites, where we would have loved to have this product to manage long-running, finite element analysis jobs.
OpenVision has produced an excellent set of software tools that are very useful for system administration at medium to large installations.
About the author
Philip R. Moyer (philip.moyer@advanced.com) is a senior Unix administrator at Cirrus Logic and has administered large networks of Unix machines for seven years.
An excellent example of this is disk space availability. With 30 or 40 servers, each with at least 4 gigabytes of disk space and only four human administrators to watch them, disks can fill quickly. The admins in this situation usually learn about filled disks when a user complains. A sysadmin tool could warn of the filling disk; this would allow the administrator to log into the server in question and free up some disk space, or purchase, format, and install a new disk so users would, ideally, never see a "write failed; not enough disk space" message.
In a previous network administration job, we used a console switch to monitor our system. This is a custom-built hardware and software tool that displays all console messages from all servers on a monitoring xterm. It allows each administrator to watch all consoles simultaneously. This tool is better than the situation described above, where the admins don't even see the console messages unless they walk into the console room. It does, however, have some weaknesses. It does not, for example, allow administrators to detect disks filling before they print a message on the console. Therefore, this kind of admin tool merely helps admins respond sooner; it doesn't allow administrators to take proactive measures.
A better tool is one that allows the admins to set warning thresholds for various resources, so they receive notice before the users see a performance degradation. This prevents a loss of productivity. It also makes the admins look good! With a tool like this, the admins will get a notification (via e-mail, GUI, page, or some other mechanism) when disks begin to fill.
System administration tools differ from network administration tools in that they generally look at different resources. There is some overlap, though. A network tool would look at things like percent utilization of a network segment, or, as another example, IPX packets as a percent of total packets. An administration tool would look at host-specific things, such as number of packets received on a particular IP interface. An example of overlap would be packet collisions, which both the system and network administration tools would monitor.
Keeping in mind that system administration tools monitor host-specific resources, we can list resources we might want to monitor. We've already talked about disk space availability, but we might want to look at other disk resources, such as reads and writes. For example, let's say we have a server with three 2- gigabyte SCSI disks. These have a filesystem that we export via NFS. With a system administration tool, we might find one disk has a disproportionately high percentage of the system disk reads. Software on that disk could then be redistributed among the other disks to improve I/O performance.
System administration tools would also monitor memory resources such as percent free memory, percent used memory, pages in, process swapped out, page faults, etc. CPU resources could be monitored, such as load average, context switches, percent utilization, and so forth.
Essentially, the system administration tool graphically allows system administrators to monitor remote resources more efficiently than if they were actually on the machine in question, running programs from the command line (such as vmstat) to monitor performance.
Documentation- The documentation is good. There are typographical errors that distract the reader, but these are minor when compared with the excellent layout and content. The design of the documentation takes the user from installation through configuration and use. If you have a question later, the layout of the documentation allows its use as a reference manual; it is quick and easy to get answers to questions. Electronic docs consist of man pages. We award higher scores to documentation that includes hypertext help akin to Sun's AnswerBook.
Support- The company offers a 180-day warranty and offers two support contracts: The basic plan costs 18 percent of the purchase price and entitles the user to support 12 hours a day, 5 days a week. A 24-hour, 7-day-per-week plan costs 23 percent of the purchase price.
Performance- The Management Agents, which run on the monitored hosts, place a negligible load on the machine. The Management Interface, which displays the GUI, has a slightly higher load. At no point, though, did we hesitate to install the software on production servers for fear of performance hits. This confidence was warranted, as we saw no noticeable performance degradation. It was necessary to run the monitoring software to detect the impact of the monitoring software! The response and start-up time was not instantaneous, but it was very good for a networked monitoring tool.
Interface- The interface is Motif-based. We ran it remotely and displayed it on a local workstation. It is easy to start and almost intuitive to use. We noted a few nits about layout. Also, the tool does not allow placement of windows correctly when a virtual window manager is in use.
If you have problems with this magazine, contact webmaster@sunworld.com
URL: http://www.sunworld.com/asm-04-1995/asm-04-sysmanage.html
Last updated: 1 April 1995