Click on our Sponsors to help Support SunWorld

Disk error detection

What tools can warn you of disk failure automatically -- and more efficiently -- before your disks fail?

July 1999

Abstract

This month, Adrian looks at the data Solaris collects about different kinds of disk errors with iostat and SyMON, and explains how to read it into the SE toolkit so that it can be included in an upgraded disk monitoring rule. (2,000 words)

Mail this
article to
a friend

: I sometimes see messages on the console when a disk has an error. How can I automatically look for a disk that is giving soft errors before it fails completely? Are there any tools that will do this for me?

: Traditionally, the disk device driver reacts to a problem by printing an error via syslog that gets printed on the console and stored in the /var/adm/messages file. While this is useful, it's an inefficient way to report a problem, and it's difficult to ensure that messages aren't lost or ignored. It's also hard to parse the error messages, because you might not know all the possible error conditions that could be reported.

The SyMON 1.x and 2.x products include a log file monitor that watches the /var/adm/messages file and warns the user if it sees certain kinds of errors. A new way to monitor disks was added in Solaris 2.6, and the iostat command was extended with -e and -E options that report error counts. I've now also extended the SE toolkit to look at the same information as part of the disk monitoring rule.

This isn't normally considered a performance question, but disks that are doing multiple retries due to transient errors can cause hard-to-find performance problems -- and the performance of a dead disk is zero!

Monitoring errors with iostat
Since Solaris 2.6, a new kstat data structure has been maintained in the kernel for every disk. Here is an example output from an Ultra 2 with two disks and a CD-ROM. The disk name is given in the normal device form of sd1 unless the -n option is also used to translate the name to c0t1d0s0 form.

The first line gives the total number of soft errors, hard errors, and transport errors; then the device identification string is broken down into vendor, product, firmware revision, and serial number. This string does not always work out properly, as you can see for the CD-ROM, where the serial number appears to be a date. The format used is vendor specific and difficult to parse in a general way. Next the size is given, and a detailed error count breakdown showing the total number of errors since boot is provided. In this case the system has been powered down using the cpr(7) capability. Each time this occurs, the disks are left spun down until they are accessed, giving rise to a No Device error for each power up resume.

% iostat -E

sd0     Soft Errors: 0 Hard Errors: 7 Transport Errors: 0 
Vendor: SEAGATE  Product: ST32550W SUN2.1G Revision: 0418 Serial No: 04406244 
Size: 2.13GB <2127708160 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 7 Recoverable: 0 
Illegal Request: 0 Predictive Failure Analysis: 0 

sd1     Soft Errors: 0 Hard Errors: 7 Transport Errors: 0 
Vendor: SEAGATE  Product: ST32550W SUN2.1G Revision: 0418 Serial No: 04620031 
Size: 2.13GB <2127708160 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 7 Recoverable: 0 
Illegal Request: 0 Predictive Failure Analysis: 0 

sd6     Soft Errors: 0 Hard Errors: 1 Transport Errors: 0 
Vendor: TOSHIBA  Product: XM-5401TASUN4XCD Revision: 1036 Serial No: 04/12/95 
Size: 18446744073.71GB <-1 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 1 Recoverable: 0 
Illegal Request: 0 Predictive Failure Analysis: 0

Underlying disk error data
If we look at the underlying raw data with netstat -k (an undocumented option) we can see that iostat is reporting the data directly from the kstat with just a little formatting.

% netstat -k sd1,err
sd1,err:
Soft Errors 0 Hard Errors 7 Transport Errors 0 Vendor SEAGATE  
Product ST32550W SUN2.1GRevision Revision 0418 Serial No 04620031 Size 2127708160 Media Error 0 Device Not Ready 0 
No Device 7 Recoverable 0 Illegal Request 0 Predictive Failure Analysis 0

Using SE to report disk error data
This kstat is already defined in the SE toolkit to include /kstat.se as shown below. The syntax is a bit strange as this definition has to cope with more than normal C syntax allows. The SE interpreter uses a special kstat definition to read the data out of the kernel by name. The definition includes a special number$ entry that indexes the instances of this data and the name$ entry that includes the name, such as sd1,err. Because the names of these kstat entries contain spaces, the names are quoted in the specification. When this structure is referenced by C-like code in SE, the interpreter substitutes underlines for any non-legal C characters like spaces; so Soft Errors is accessed as Soft_Errors. Any new or changed kstat devices can be dynamically defined in SE using this syntax.

kstat struct "sderr:" ks_sderr {
    int    number$;
    string name$;

    uint32_t "Soft Errors";
    uint32_t "Hard Errors";
    uint32_t "Transport Errors";
    string    Vendor;
    string    Product;
    string    Revision;
    string   "Serial No";
    uint64_t  Size;
    uint32_t "Media Error";
    uint32_t "Device Not Ready";
    uint32_t "No Device";
    uint32_t  Recoverable;
    uint32_t "Illegal Request";
    uint32_t "Predictive Failure Analysis";
};

The disks.se script was modified to print out the disk error status structures before it lists all the disk devices and partitions. This data matches what is returned by iostat -E.

% se disks.se
nnn      name  soft hard tran [Vendor Product Revision]
  0    sd6,err    0    1    0 [TOSHIBA  XM-5401TASUN4XCD 1036]
  1    sd0,err    0    7    0 [SEAGATE  ST32550W SUN2.1G 0418]
  2    sd1,err    0    7    0 [SEAGATE  ST32550W SUN2.1G 0418]
se thinks MAX_DISK is 10
kernel   -> path_to_inst -> /dev/dsk part_count [fstype mount]
sd6      -> sd6          -> c0t6d0       0
sd0      -> sd0          -> c0t0d0       2
                          c0t0d0s0             ufs  /
                          c0t0d0s1            swap  swap
sd0,a    -> sd0,a        -> c0t0d0s0     0
sd0,b    -> sd0,b        -> c0t0d0s1     0
sd0,c    -> sd0,c        -> c0t0d0s2     0
sd1      -> sd1          -> c0t1d0       1
                          c0t1d0s0             ufs  /export/home/adrianc
sd1,a    -> sd1,a        -> c0t1d0s0     0
sd1,c    -> sd1,c        -> c0t1d0s2     0
fd0      -> fd0          -> fd0          0

The code to obtain and print out this data is quite simple as shown below. Setting the number$ member causes SE to find the corresponding data; and if there is no data it sets number$ to -1 to signal this.

  ks_sderr kstat$sderr;
  ks_sderr tmp_sderr;
  int i;

  printf("nnn      name  soft hard tran [Vendor Product Revision]\n");
  for(kstat$sderr.number$=0; ; kstat$sderr.number$++) {
	tmp_sderr = kstat$sderr;
	if (tmp_sderr.number$ == -1) {
		break;
	}
	printf("%3d %10s %4d %4d %4d [%s %s %s]\n", tmp_sderr.number$, tmp_sderr.name$,
		tmp_sderr.Soft_Errors, tmp_sderr.Hard_Errors,
		tmp_sderr.Transport_Errors,
		tmp_sderr.Vendor, tmp_sderr.Product, on);
  }

Advertisements

Looking for errors in disks
In order to check disks, you can add code in the disk rule code that is used by tools such as virtual_adrian.se. This rule is implemented as an abstract, "pure" rule wrapped up in a "live" rule that looks up the current values of the disk performance metrics and passes them to the pure rule. Because the disk error data is very detailed and is not usually available except on a live system, I decided to just add it to the live rule.

The code reads all the disk error data into a global array, so the details can be examined or printed out easily. The rule looks at the total error count on each disk and has two options. One, the default option, is to report any error since boot time. The other option is to only report new errors in the current interval. The error analysis looks at the data in increasing order of severity, so that higher priority values of state (e.g. red or black) overwrite lower priorities (green or amber).

The first test is the case in which no device was present, and the system had to stop to load it. In the case of a desktop system that uses the EnergyStar cpr(7) capability to save its state and power off, it may restart with the disks spun down. When an application tries to access a disk, the application stalls for a few seconds as the disk spins up. This is not a cause for concern, so a green state is reported, and a warning is posted as part of the explanation string. This can be seen in the example output from live_test.se shown below.

Overall current state is green at: Mon May 24 20:49:33 1999
Subsystem  State Action
Disk       green No activity
	c0t6d0		XM-5401TASUN4XCD	  1 green Warning - No Device - cprboot spinup?
	c0t0d0		ST32550W SUN2.1G	 12 green Warning - No Device - cprboot spinup?
	c0t1d0		ST32550W SUN2.1G	 12 green Warning - No Device - cprboot spinup?
    c0t6d0 green    c0t0d0 green    c0t1d0 green       fd0 white
Networks   white No activity
      hme0 white      hme1 white
NFS client white No client NFS/RPC activity
Swap space white There is a lot of unused swap space
RAM demand white RAM available
Kernel mem white No worries, mate
CPU power  white CPU idling
Mutex      green No worries, mate
No significant kernel contention
[cpu0      0 green] [cpu1      0 green] 
DNLC       white No activity
Inode      white No activity
TCP        white No activity

It's harder to generate tests for the other errors, and my interpretation of them should be regarded as experimental rather than definitive!

Recoverable errors
Recoverable errors do not cause data loss, so I report an amber warning state to indicate that a retry was required to get the data. Retries will increase the disk service time.
Predictive failure analysis
This state is reported back from the disk drive itself; I think it means that the disk has not yet had a problem, but it will have one soon. Therefore, I report a red problem state and tell you to get a new disk before it fails on you.
Device not ready
The device has gone offline, or perhaps it has died completely. The SCSI bus has become disconnected, or the Fibre Channel laser has died. I report a black state, as the disk is down. Check power and cables then try replacing the disk.
Media error
The media is the surface of the disk, so a media error is often caused by a head crash or misalignment due to overheating. This is a black state, as a new disk is required.
Illegal request
An illegal request could be caused by corruption of the command, or by a software bug. I report a black state, but I'm not sure what to do about it.

The code that implements these rules is shown below. The explanation string will already have been set by the normal disk rule, so I keep adding more lines to it for each error encountered. The full name of the disk in the c0t0d0 format is reported. This must be specifically looked up because it has to be translated from the sd0,err format; but you can see whether all the disks on a controller have the same error at the same time, which would indicate a cable or controller failure.

	/* process rules in order of increasing error level */
		if (GLOBAL_sderr[i].No_Device > 0) {
	 		 explanation = sprintf("%s\n\t%s\t%s\t%3d %s", explanation,
				dname, GLOBAL_sderr[i].Product,
				GLOBAL_sderr[i].No_Device,
				"green Warning - No  - cprboot spinup?");
	  GLOBAL_sderr_state[i] = ST_GREEN;
	}
        if (GLOBAL_sderr[i].Recoverable > 0) {
          explanation = sprintf("%s\n\t%s\t%s\t%3d %s", explanation,
                dname, GLOBAL_sderr[i].Product,
				GLOBAL_sderr[i].Recoverable,
				"amber - Recoverable or Retry");
          GLOBAL_sderr_state[i] = ST_AMBER;
        }
        if (GLOBAL_sderr[i].Predictive_Failure_Analysis > 0) {
          explanation = sprintf("%s\n\t%s\t%s\t%3d %s", explanation,
                dname, GLOBAL_sderr[i].Product,
				GLOBAL_sderr[i].Predictive_Failure_Analysis,
				"red  Predictive_Failure_Analysis - Replace Disk");
          GLOBAL_sderr_state[i] = ST_RED;
        }
        if (GLOBAL_sderr[i].Device_Not_Ready > 0) {
          explanation = sprintf("%s\n\t%s\t%s\t%3d %s", explanation,
                dname, GLOBAL_sderr[i].Product,
				GLOBAL_sderr[i].Device_Not_Ready,
				"black- Device Not Ready - disk offline?");
          GLOBAL_sderr_state[i] = ST_BLACK;
        }
        if (GLOBAL_sderr[i].Media_Error > 0) {
          explanation = sprintf("%s\n\t%s\t%s\t%3d %s", explanation,
                dname, GLOBAL_sderr[i].Product,
				GLOBAL_sderr[i].Media_Error,
				"black - Media Error - replace disk");
          GLOBAL_sderr_state[i] = ST_BLACK;
        }
        if (GLOBAL_sderr[i].Illegal_Request > 0) {
          explanation = sprintf("%s\n\t%s\t%s\t%3d %s", explanation,
				dname, 		GLOBAL_sderr[i].Product,
				GLOBAL_sderr[i].Illegal_Request,
				"black - Illegal Request Error");
          GLOBAL_sderr_state[i] = ST_BLACK;
        }

Wrap up
So, another source of data is now automatically watched for you. This rule is used by the virtual_adrian, zoom, and percollator scripts, and more functionality has been added without changing any of the scripts. Please let me know if it finds a problem for you, or if you disagree with the rule itself. For now, the code is an add-on to SE3.1prefcs, and you can download a tar file that contains the fixes from the regular site. It replaces live_rules.se, live_test.se, and disks.se only. As usual, if you discover a problem or think of a better way to process this data than I did, it's easy to fix it yourself. Just send me the fix! Remember that the SE toolkit is unsupported and experimental, so don't depend on it in production environments. Use the supported iostat -nE command or SyMON 2.0 instead.

Finally, a small, personal plug. For the third year straight, Dr. Neil Gunther and I are running a week-long "Practical Performance Methods" class at Stanford University from August 16 through August 20. More info is available at http://wics.stanford.edu/courses/performance.html. We already have more people signed up than last year.

Click on our Sponsors to help Support SunWorld

Resources

Online performance information home page
http://www.sun.com/sun-on-net/performance.html
SE download site:
http://www.sun.com/sun-on-net/performance/se3
virtual_adrian.se rule:
http://www.sun.com/951001/columns/adrian/column2.html
Detailed two-part whitepaper that describes how to optimize for performance on Sun systems: http://www.sun.com/software/white-papers/wp-optimize/optimize-part1.pdf
http://www.sun.com/software/white-papers/wp-optimize/optimize-part2.pdf

Related articles and FAQs in SunWorld

"SE Toolkit FAQ," (January 1998):
http://www.sunworld.com/swol-01-1998/swol-01-perf.html
"SyMON and SE get upgraded," (February 1999):
http://www.sunworld.com/swol-02-1999/swol-02-perf.html
"How do disks really work?" (June 1996):
http://www.sunworld.com/swol-06-1996/swol-06-perf.html
"Solving the iostat disk mystery," (October 1996):
http://www.sunworld.com/swol-10-1996/swol-10-perf.html
"Choosing the right disk configurations for your servers," (November 1996):
http://www.sunworld.com/swol-11-1996/swol-11-perf.html
"Clarifying disk measurement and terminology," (September 1997):
http://www.sunworld.com/swol-09-1997/swol-09-perf.html
All of Adrian Cockcroft's Performance Q&A columns:
http://www.sunworld.com/common/swol-backissues-columns.html#perf
All of Jim Mauro's Inside Solaris columns:
http://www.sunworld.com/common/swol-backissues-columns.html#insidesolaris
See Adrian Cockcroft's FAQ for answers to three dozen performance-related questions:
http://www.sunworld.com/common/cockcroft.letters.html
Web server performance features on SunWorld's Site Index:
http://www.sunworld.com/common/swol-siteindex.html#webperf

Other SunWorld resources

The SunWorld Topical Index -- a comprehensive listing of all SunWorld articles by subject:
http://www.sunworld.com/common/swol-siteindex.html
Visit sunWHERE -- launchpad to hundreds of online resources for Sun users:
http://www.sunworld.com/sunworldonline/sunwhere.html
Explore back issues of SunWorld:
http://www.sunworld.com/common/swol-backissues.html
IDG.net, your one-stop IT: resource
http://www.idg.net

About the author
Adrian Cockcroft joined Sun Microsystems in 1988, and currently works as a performance specialist for the Computer Systems Division of Sun. He wrote Sun Performance and Tuning: SPARC and Solaris and Sun Performance and Tuning: Java and the Internet, both published by Sun Microsystems Press Books.

If you have technical problems with this magazine, contact webmaster@sunworld.com

URL: http://www.sunworld.com/swol-07-1999/swol-07-perf.html
Last modified:

Comments:
Name:
Email:
Company Name: