Today's networks are complicated. Here's how to avoid LAN meltdowns that grind your users to a halt.
This article examines the issues underlying network reliability, including network performance and traffic control, physical access and security problems, bandwidth allocation, configuration files, service dependencies, application exposures, suggested metrics, and measurement techniques designed to help demonstrate you have your network under control.
Most of us think of the network as a single entity that can be managed, coached, and coaxed as needed. That's certainly the goal, and even the theme of Sun's latest advertising campaign. In reality, any network is a loosely managed team of components, spanning wiring closets, configuration files, and applications that use or abuse resources. It's rare that a single point of failure can be identified easily, or that the same element fails repeatedly. Failures in one part of the infrastructure affect applications or services several logical layers away, masking the true source of the problem. Physical plant problems ripple upstream, disrupting name or file services. When one player fails, the whole team goes south.
As businesses become more dependent on the correct operation of networks for groupware, Internet, intranet, and e-mail capabilities, network reliability will become another buzzword adored by analysts and touted by vendors. We'll look at the issues underlying network reliability, including but not limited to
network performance and traffic control.
We'll examine physical access and security problems,
configuration files, and
From there it's on to
application exposures, such as unduly long network latency or resending lost requests. We'll conclude with some
suggested metrics and measurement techniques,
designed to help you demonstrate that you have your network team under control.
Go team! Yeah team! An end-to-end approach
How do you take a team approach to network reliability? Instead of focusing on the individual components, look at how they interact and form an end-to-end system, and examine the relationships of network devices and subsystems to higher level services. Taking the team analogy a step further, define some ideal attributes for a reliable, well-built network:
A good "team approach" example is the United States telephone system. No matter where you go to plug in your handset, the RJ-11 jacks appear the same. The dialtone is the same, as are the touchtones and dialing sequences from any access point. There is only minor variation in the time required to connect a call, and the system has enough fault tolerance built in to survive a number of failures on the natural disaster scale. The telephone service model is appropriately strong because it describes a distributed data center -- predictable, regular, reliable performance right to the end user access point. Perhaps the most impressive feature of the telephone system is that its reliability is transparent -- you don't "feel" the system. Rarely does a caller notice the changes in topology that occur in response to congestion or failures, because reliability is designed into the system in layers.
We'll cover the layers of network reliability from the ground wire up:
Unfortunately, most people associate "network reliability" with simple congestion control and end-to-end connectivity. When packets are flowing, they consider the network reliable. This narrow definition skirts issues listed above, and it's also the source of problems that make distributed computing seem riskier and more costly than centralized, host-based architectures. Reliability has to be built into every component in the system: it's a function of design, not simply location. You're building a distributed data center, and the network connecting the distributed components has to be as reliable as the centrally managed ones. When the pieces are assembled in a well-designed, well-matched system, the end result is both reliable and predictable.
Plant it: Cables and access point management
The logical starting point for reliable network design is the physical cable plant -- the wires, fibers, hubs, transceivers, closets, cable raceways, and wallplates that form the foundation of your network. You want the physical layer to be as reliable as possible, but you're limited in power. Mechanical components from hammers to network cables invariably fail, and always at the least opportune moment. Network wiring is especially susceptible to failure because it is exposed. Often, Ethernet taps run under desks, and thinwire (10base2) daisy chains are unintentional targets of the cleaning service's vacuum cleaners. The trend toward switching hubs, with point to point twisted pair wiring, bodes well for improved reliability. A single failure affects only one machine, not an entire leg of the network or cluster of machines on the same run. Minimizing the number of components affected by a failure is your first design principle, and one that we'll revisit later.
Your best defense against cable faults is a good offense, complete
with diagnostic tools and equipment. A simple
or script that performs a "reachability" test is not sufficient to
diagnose all cable problems. A host may answer the
after a very long delay caused by a fault that injects noise or runt
packets onto the network. It's best to know the warning signs of a
noisy, broken or failing cable:
netstat -i, dividing the number of collisions by the number of output packets. Note that collisions are only counted when that machine transmits, so lightly used machines will tend to underreport the actual rate.
netstat -ioutput. Output errors occur when there's no sending buffer space for the outbound packet, or when the packet cannot be sent because of non-stop collisions, carrier detects (network jam), or other transmit errors that cause a back-off and retry. Under normal conditions, you shouldn't see any output errors. Disconnecting a network cable while the machine is in mid-network-sentence, however, will generate a flurry of errors, so be sure to correlate any errors to machine migration or other desktop work.
In general, it's a good idea to have the appropriate test equipment on hand if you're going to be responsible for the spaghetti that often resembles a cable nest. Wiring and network contractors that install and maintain the physical plant should be properly equipped, but if you're going to do those jobs yourself, make sure you have access to the right gear for your cabling. When half the machines on the network are spewing messages about network jams, it's embarrassing to ask for purchase order approval to buy a network analyzer. At the very least, locate a shop that will lease equipment on a short-term basis.
What kind of cabling should you use? From a reliability perspective, you need to worry about the quality of the connections between cable endpoints and other devices, and the impact of radio frequency interference (RFI) on the signal quality. High levels of RFI, for example, cable raceways tucked behind the elevator motor room that pick up noise, or poorly shielded high-voltage power lines, indicate shielded twisted-pair wiring, or fiber in extreme cases. If you're going to use fiber, it pays to have a professional installation done because the fiber requires careful treatment of the ends and connection jackets.
The final physical issue to address is security of the network access points. If you worry about unattended PCs being used to tap into your network, you should also worry about unattended network connections becoming a conduit as well. Any connectors left out in the open are ripe for attack, a risk that affects both desktop devices as well as wide-area and leased lines that leave your building. Are you sure that your private lines come into a secure area? Could an intruder put a pair of alligator clips on your incoming phone line, and watch the 1s and 0s go by on your leased line? When you make your living on the net, and losing your connection is fatal, it's worth at least ensuring that the line comes into a locked room. The best reliability and security engineering don't help if your network access points are left dangling, unprotected, where untrusted persons can have a field day with them.
If this seems to be a bit of hyperbolic Usenet ranting, think again about your consultants and contractors. Have you security screened them? Are they included in the web of trust you extend to employees and others who can reach under a desk and unplug a live Ethernet tap? Extend the scope of your reliability and security analysis out to the value of your data, and the extent to which someone might go to gain access to it. Plugging a rogue PC into an available twisted pair connector is far easier than breaking into a workstation. The new network addition may be added with the best of intentions, but its choice of IP address or protocol stacks may interfere with other network traffic. The larger these risks, the more you don't want to leave the access points standing naked.
Guaranteed traffic: Dealing with network rush hour
Making sure the bits get from one end of the network to the other is the first problem, but it's also the easiest one to solve. As you move up a layer from the physical wiring and connectivity to the network traffic level, you get into issues of bandwidth consumption, performance, and other black arts. Ensuring network reliability means that you have an unencumbered path between any two points, and that you'll be able to guarantee some minimum throughput over that path.
Bandwidth and throughput are numbers that vendors love to toss around. Bandwidth is what you can theoretically achieve out of the wire -- it's an upper limit, and one that is rarely reached. Throughput tells you what you can obtain in practice. Many factors reduce effective throughput over a network: congestion resulting in contention for the media, inefficient protocols, overloaded routers, and hubs that drop packets. Monitor your network utilization, as well as the typical throughput between pairs of machines, looking for periods of peak usage that contribute to network problems. Users may think that the network is "down" but in fact it's excruciatingly slow due to heavy demand and traffic volumes. Be sure to tabulate the types of traffic you see at peak times -- are you suffering from an ftp party at the end of the day, or is there a regularly scheduled video conference that saps every available network pipe on Tuesday afternoons?
Watch for traffic anomalies, such as broadcast storms, caused by
machines that are configured with improper broadcast or IP addresses.
A simple snoop filter, for example,
snoop -d le0
broadcast, will show you the volume of ARP and other broadcast
packets. Malformed sequences, for example, multiple replies to a
single ARP request, or broadcasts that result in a flurry of ARP
requests for IP addresses ending in .0 or .255, are the first isobars
of a pending broadcast storm. While normal traffic peaks may die down
quickly, broadcast storms often take several minutes to subside. For a
quick and dirty indicator of storm-like activity, try playing with the
audio option of snoop or etherman
snoop -a -d le0 broadcastWhen your speaker box sounds less like a Geiger counter and more like a cheap electronic keyboard, you need to take a look at the traffic statistics.
Sometimes local traffic patterns, such as those caused by unduly large ftp or http transfers, disrupt network availability to the point where you want to regulate access to services. In other cases, you may have a bona fide denial of service attack underway, in which someone is flooding your network with noise or generating non-stop connection requests to one of your servers. The solution to both problems is to install firewall or access controls at the entrance to the network, and on the hosts from which the services are provided. Having trouble with the local ftp maniac? Stick a firewall and proxy agent in between that turns off outgoing ftp connections before 5 pm. Techniques for filtering connections and enforcing access controls are discussed in this month's System Administration column.
In addition to providing selective access to services, firewalls and proxies let you build up minimal throughput guarantees by eliminating or re-directing non-essential traffic. Classes of service are nearly completely absent from the TCP/IP world, although class of service (COS) is popular in SNA networks. It's amazingly difficult to relegate less important traffic to the back of the network output queue, and to minimize latency for critical messages. The next generation of the IP protocol, IPv6 promises to support some mechanisms for defining class of service. For now, however, you have to rely on brute-force techniques of separating traffic with logically separate networks, or enforcing access controls based on host name, time of day, and traffic type. (See the list of resources at the end of this story for more on IPv6.)
Don't underestimate the importance of bandwidth guarantees. From the perspective of a desktop user, a network that's running red-hot with traffic is as useful as one that's not connected. Calls for traffic segregation start to fall along business lines: a decision support query that returns a megabyte to an eager marketeer is going to impact the OLTP transactions run by the front office. It's laudable that marketing is using the network to improve customer choices, but it helps if the customers can make choices the first place. Fighting network congestion through careful design and monitoring is the first logical step you take toward providing reliability that is seen and felt by the end users and customers.
Don't decreasing networking costs and increasing bandwidth make this concern over guarantees a bit misplaced? Bandwidth isn't infinite, and it's certainly going to be a critical factor where heavy payloads converge. Let's say you've left your 10 megabits per second Ethernet dangling in the ceiling of your old building, and wired a new campus with 100 megabits per second Fast Ethernet. If you bring four of those networks together in the data center, each utilized at 50 percent or more, you're going to need at least 200 megabits per second just to handle the traffic streams without introducing latency or dropping bits.
When everybody wants a nice 200 kilobyte per second stream from your video server you need to chop up the available bandwidth to make everyone happy. You can stagger the retrievals, and give each user full use of the network in round-robin fashion. Betting on politeness over politics is not a sure thing, so you could opt to have the users duke it out with simultaneous requests, then listen to them complain about the terrible network reliability. The ideal solution is to institute a bidding system for available bandwidth -- those users willing to pay a premium get a better quality feed, and those who are cost constrained get a low-resolution stream. These kinds of interactive, on-the-fly auctions will be a fundamental part of any kind of electronic commerce done over internets. Building the bidding tools, auctioneers, and service exchanges is a wide-open arena for Java applets and SafeTcl scripts.
After tackling the first two layers of reliability, you should have a solid plan in place for moving bit streams and providing a constant performance level even under peak loads. There's still a chunk of glue needed between applications and the wire -- configuration files and naming services -- that has to be made just as robust as the underlying pieces.
Soft consistency: Configuration information and service dependencies
Basic network configuration information includes host to IP address map
pings (/etc/hosts), network routes, names of
network services (/etc/services) and distributed filesystem
usage (/etc/vfstab, /etc/fstab and the automounter).
Network configuration data should promote network isotropism -- the net
appears the same from any vantage point. Naming services like DNS, NIS,
and NIS+ provide some measure of consistency. Policies for making
changes, propagating the deltas, and coordinating updates among
multiple management groups complement the name service. We've covered
techniques for managing the change control process and the integration
of network service configuration files in previous System
Administration columns. (See the list of resources at the end of this
The key questions to answer are:
Once we reach the configuration information layer, we can modify a machine's view of the network and services offered on the wire. By editing configuration files and network interface setups on the fly, it's possible to change a host's network usage, moving it from one network to another or from one target server to another. The configuration level is the first stage in which a fault in a lower level can be hidden from applications and users. Here's a simple example: let's say you connect two Ethernets to all of your servers, so that a failure in the hub, cabling or other machines on the one network doesn't render the server unreachable. In the event of a network failure, the server will switch from its primary network interface to the secondary one. How do you get the clients to start looking for the server on its new network?
The best answer is that the change should be transparent to the clients. That is, the server should not change names or IP addresses after it switches network interfaces. The clients may see a long timeout or delay while the server reconfigures itself, but the clients should not have to look up new IP addresses or walk down a list of host names to handle a server network failure. The easiest approach is to use a "virtual" IP address for the server, switching it from the primary to the secondary interface after a failure is detected on the primary side. Virtual IP addresses are created by adding new hostname and IP address pairs to an existing physical interface. For example, consider the following fragment of /etc/hosts:
126.96.36.199 db1 188.8.131.52 db1-primary # on le0 184.108.40.206 db1-secondary # on le1The host calls itself db1-primary, using the .201 network on the primary interface and the .202 address on the secondary interface. Management tools can talk to the machine over either interface, and the interfaces can
pingeach other and test for connectivity using these "private" names. To make the clients' life easier, however, the virtual address 220.127.116.11 is assigned first to the primary interface:
# ifconfig le0:1 db1 up broadcast + netmask +Packets sent to host "db1" show up on the primary interface, and the server responds over the same network. Now assume that the cable connecting the primary interface to the network fails, leaving the host temporarily disconnected. At this point, you'd fail the interface over to the redundant network:
# ifconfig le0:1 down # ifconfig le1:1 db1 up broadcast + netmaskt +These actions can be initiated by a script that monitors the network connectivity. To the clients, nothing has changed -- the same host is responding to queries, with the same IP address, albeit over a different cable. It's easiest to use different network numbers to avoid confusing the routing tables, but with some care and deliberate sequencing of the monitoring and reconfiguration scripts you can use a single network number and build up/tear down IP addresses as needed to migrate the public, virtual IP address to live interfaces. What if the server that fails is providing a key service, such as NIS, NIS+, or DNS? With replicated servers for each on a network, the clients will eventually find a new source of network information, but what happens when the name service is corrupted to the point where it requires system administrator intervention? If you can't get the name service back on its feet, you may have trouble righting other network ships that start following it down. Most name and file services are tightly coupled to other network functions, so a failure in the network or in the service takes down the application as well. Once you've provided reliability at the network infrastructure level, including redundant network paths, it's time to put insurance plans into place for system software failures. Here's a simple example of service dependency that leads to failure ripples: The RPC port mapper (rpcbind or portmap) exits without explanation on your sole NIS server on one network. An application on a client machine goes to read a file from a library server, residing on a central NFS server that is two router hops away. The application's NFS request hits the automounter on the client machine, and the automounter tries to resolve the host name into an IP address -- using NIS. With the portmapper out of service, the client can't bind to the NIS server, so the automounter hangs waiting on NIS. To the application, it appears that the local machine has hung, the shared library NFS server has hung, or that the network is "out" somewhere along the double router hop. After a few minutes, the "NIS server not responding" messages begin to appear, and the problem can be resolved with some additional prying on the local network. But what happens when NIS is taken out of the equation, and DNS is used to resolve host names? The timeouts may get longer, and the longer period of silence makes it harder to track down the root cause of the problem. Here are failure modes and recovery tactics to consider when layering critical applications on top of network services like NIS, DNS and network license managers:
Lost Worlds: Dealing with unreliable transports
Applications built on top of TCP enjoy sequenced, reliable data transmission from one socket endpoint to another, provided the intervening routers, hosts, and configuration data are in order. The downside of using TCP is that it requires end-to-end handshaking to ensure data delivery; this means that the communicating hosts must remain in contact until the data is received and acknowledged. When UDP is used, the sequencing and delivery guarantees go away, the application must make arrangements for requests arriving out of order, multiple times, or not at all. There are a large number of application exposures with which to contend. All of the name and file service dependencies mentioned above, for example, apply to applications. If your transaction processing engine reads an NFS-mounted file for access control information (a bad idea we'll explore shortly), you've made it dependent on the remote server and the network between it and the client and possibly on an NIS server as well. Dealing with other network failures is more subtle, because it requires analysis of the handshaking between client and server. For example, an application that considers a TCP message "delivered" once the
write() on the socket completes is
going to fail when the TCP segments aren't delivered because the
receiver was disconnected from the network. The application will
eventually detect an error while writing on the socket, but without a
clear acknowledgement protocol it's close to impossible to know the
last request that was safely received by the other end.
Other network failures that impact applications include:
ranlibdisregard minor variations in the current time. When
lsdisplays file modification times, it shows anything older than six months as a month and day.
lssubtracts the modification time from the current time to gauge the file's age, which is fine until you have clock drift that makes the file appear to be modified a few minutes in the future -- making the time delta negative. The NFS-aware
lscompensates for the clock skew, rather than displaying the file as having been modified in 1969. With your applications safely ensconsed in network reliable frameworks, it's time to look at the last risk posed by the network -- exposure of data or relationships that creates a security hole.
Safe and secure: Data exposure and network reliability
One of the primary drawbacks of distributed computing is that the safety of the centralized, access controlled data center is gone. When you run a transaction over a network, you also run the risk of exposing the contents of the transaction, its participants, and their relationships. If the source and destinations are not properly protected, you run the risk of having (maliciously) incorrect data inserted in the transaction. Consider the previous example of a transaction system that reads an access control list from an NFS mounted file. If a network intruder "spoofed" the NFS server identity, creating a machine that appeared to be the valid NFS server, he or she could easily insert a hand-crafted access control file. Anyone with unrestricted network access -- say from a PC on an unprotected network tap -- could listen for the NFS request and return a bogus NFS buffer. Protecting the identities of parties to a transaction is just as important as securing its contents. Wall Street's "Chinese Wall" separating investment banking and trading operations is supposed to prevent the flow of inside information. But what happens when the traders can watch packets going to and from the corporate gateway, and surmise the existence of an investment banking relationship with ABC, Inc. because traffic is exchanged between their bank and abc.com? Protecting the players, their roles, and their data requires careful analysis of the paths over which the transactions travel, and the degree of protection that each requires.
So what do you measure to prove that you're making progress on the reliability front? Strive for consistency in all areas -- flat response times even under the heaviest load and in the face of growing network usage, fast and regular responses to calls for help, and aggressive resolution of network failures. Instrument applications as well, so you can determine where the clock time is expended -- on the client, the server, or the network. If 90 percent of the request response time is consistently taken up by host or client processing, then the network should shoulder minimal blame for performance or reliability problems. Here are additional metrics to track your reliability efforts:
If you have technical problems with this magazine, contact firstname.lastname@example.org