|
Disk error detectionWhat tools can warn you of disk failure automatically -- and more efficiently -- before your disks fail? |
This month, Adrian looks at the data Solaris collects about different kinds of disk errors with iostat
and SyMON, and explains how to read it into the SE toolkit so that it can be included in an upgraded disk monitoring rule.
(2,000 words)
Mail this article to a friend |
: I sometimes see messages on the console when a disk has an error. How can I automatically look for a disk that is giving soft errors before it fails completely? Are there any tools that will do this for me?
: Traditionally, the disk device driver reacts to a problem by
printing an error via syslog
that gets printed on the console and
stored in the /var/adm/messages
file. While this is useful, it's an
inefficient way to report a problem, and it's difficult to ensure that
messages aren't lost or ignored. It's also hard to parse the error
messages, because you might not know all the possible error conditions
that could be reported.
The SyMON 1.x and 2.x products include a log file monitor that
watches the /var/adm/messages
file and warns the user if it sees
certain kinds of errors. A new way to monitor disks was added in
Solaris 2.6, and the iostat
command was extended with -e
and -E
options that report error counts. I've now also extended the SE
toolkit to look at the same information as part of the disk monitoring rule.
This isn't normally considered a performance question, but disks that are doing multiple retries due to transient errors can cause hard-to-find performance problems -- and the performance of a dead disk is zero!
Monitoring errors with iostat
Since Solaris 2.6, a new kstat
data structure has been maintained in
the kernel for every disk. Here is an example output from an Ultra
2 with two disks and a CD-ROM. The disk name is given in the normal
device form of sd1
unless the -n
option is also used to translate
the name to c0t1d0s0
form.
The first line gives the total number of soft errors, hard errors,
and transport errors; then the device identification string is
broken down into vendor, product, firmware revision, and serial
number. This string does not always work out properly, as you can
see for the CD-ROM, where the serial number appears to be a date.
The format used is vendor specific and difficult to parse in a
general way. Next the size is given, and a detailed error count
breakdown showing the total number of errors since boot is provided.
In this case the system has been powered down using the cpr(7)
capability. Each time this occurs, the disks are left spun down
until they are accessed, giving rise to a No Device
error for each
power up resume.
% iostat -E sd0 Soft Errors: 0 Hard Errors: 7 Transport Errors: 0 Vendor: SEAGATE Product: ST32550W SUN2.1G Revision: 0418 Serial No: 04406244 Size: 2.13GB <2127708160 bytes> Media Error: 0 Device Not Ready: 0 No Device: 7 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 sd1 Soft Errors: 0 Hard Errors: 7 Transport Errors: 0 Vendor: SEAGATE Product: ST32550W SUN2.1G Revision: 0418 Serial No: 04620031 Size: 2.13GB <2127708160 bytes> Media Error: 0 Device Not Ready: 0 No Device: 7 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 sd6 Soft Errors: 0 Hard Errors: 1 Transport Errors: 0 Vendor: TOSHIBA Product: XM-5401TASUN4XCD Revision: 1036 Serial No: 04/12/95 Size: 18446744073.71GB <-1 bytes> Media Error: 0 Device Not Ready: 0 No Device: 1 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0
Underlying disk error data
If we look at the underlying raw data with netstat -k
(an
undocumented option) we can see that iostat
is reporting the data
directly from the kstat
with just a little formatting.
% netstat -k sd1,err sd1,err: Soft Errors 0 Hard Errors 7 Transport Errors 0 Vendor SEAGATE Product ST32550W SUN2.1GRevision Revision 0418 Serial No 04620031 Size 2127708160 Media Error 0 Device Not Ready 0 No Device 7 Recoverable 0 Illegal Request 0 Predictive Failure Analysis 0
Using SE to report disk error data
This kstat
is already defined in the SE toolkit to include /kstat.se
as shown below. The syntax is a bit strange as this definition has
to cope with more than normal C syntax allows. The SE interpreter
uses a special kstat
definition to read the data out of the kernel
by name. The definition includes a special number$
entry that indexes the instances of this data and the
name$
entry that includes the name, such as sd1,err
.
Because the names of these kstat
entries contain spaces, the names
are quoted in the specification. When this structure is referenced
by C-like code in SE, the interpreter substitutes underlines for any
non-legal C characters like spaces; so Soft Errors
is accessed as
Soft_Errors
. Any new or changed kstat
devices can be
dynamically defined in SE using this syntax.
kstat struct "sderr:" ks_sderr { int number$; string name$; uint32_t "Soft Errors"; uint32_t "Hard Errors"; uint32_t "Transport Errors"; string Vendor; string Product; string Revision; string "Serial No"; uint64_t Size; uint32_t "Media Error"; uint32_t "Device Not Ready"; uint32_t "No Device"; uint32_t Recoverable; uint32_t "Illegal Request"; uint32_t "Predictive Failure Analysis"; };
The disks.se
script was modified to print out the disk error status structures before it lists all the disk devices and partitions. This data matches what is returned by iostat -E
.
% se disks.se nnn name soft hard tran [Vendor Product Revision] 0 sd6,err 0 1 0 [TOSHIBA XM-5401TASUN4XCD 1036] 1 sd0,err 0 7 0 [SEAGATE ST32550W SUN2.1G 0418] 2 sd1,err 0 7 0 [SEAGATE ST32550W SUN2.1G 0418] se thinks MAX_DISK is 10 kernel -> path_to_inst -> /dev/dsk part_count [fstype mount] sd6 -> sd6 -> c0t6d0 0 sd0 -> sd0 -> c0t0d0 2 c0t0d0s0 ufs / c0t0d0s1 swap swap sd0,a -> sd0,a -> c0t0d0s0 0 sd0,b -> sd0,b -> c0t0d0s1 0 sd0,c -> sd0,c -> c0t0d0s2 0 sd1 -> sd1 -> c0t1d0 1 c0t1d0s0 ufs /export/home/adrianc sd1,a -> sd1,a -> c0t1d0s0 0 sd1,c -> sd1,c -> c0t1d0s2 0 fd0 -> fd0 -> fd0 0
The code to obtain and print out this data is quite simple as shown
below. Setting the number$
member causes SE to find the
corresponding data; and if there is no data it sets
number$
to -1 to signal this.
ks_sderr kstat$sderr; ks_sderr tmp_sderr; int i; printf("nnn name soft hard tran [Vendor Product Revision]\n"); for(kstat$sderr.number$=0; ; kstat$sderr.number$++) { tmp_sderr = kstat$sderr; if (tmp_sderr.number$ == -1) { break; } printf("%3d %10s %4d %4d %4d [%s %s %s]\n", tmp_sderr.number$, tmp_sderr.name$, tmp_sderr.Soft_Errors, tmp_sderr.Hard_Errors, tmp_sderr.Transport_Errors, tmp_sderr.Vendor, tmp_sderr.Product, on); }
|
|
|
|
Looking for errors in disks
In order to check disks, you can add code in the disk rule code that is used by tools such as virtual_adrian.se
. This rule is implemented as an abstract, "pure" rule wrapped up in a "live" rule that looks up the current values of the disk performance metrics and passes them to the pure rule. Because the disk error data is very detailed and is not usually available except on a live system, I decided to just add it to the live rule.
The code reads all the disk error data into a global array, so the details can be examined or printed out easily. The rule looks at the total error count on each disk and has two options. One, the default option, is to report any error since boot time. The other option is to only report new errors in the current interval. The error analysis looks at the data in increasing order of severity, so that higher priority values of state (e.g. red or black) overwrite lower priorities (green or amber).
The first test is the case in which no device was present, and the system had to stop to load it. In the case of a desktop system that uses the EnergyStar cpr(7) capability to save its state and power off, it may restart with the disks spun down. When an application tries to access a disk, the application stalls for a few seconds as the disk spins up. This is not a cause for concern, so a green state is reported, and a warning is posted as part of the explanation string. This can be seen in the example output from live_test.se
shown below.
Overall current state is green at: Mon May 24 20:49:33 1999 Subsystem State Action Disk green No activity c0t6d0 XM-5401TASUN4XCD 1 green Warning - No Device - cprboot spinup? c0t0d0 ST32550W SUN2.1G 12 green Warning - No Device - cprboot spinup? c0t1d0 ST32550W SUN2.1G 12 green Warning - No Device - cprboot spinup? c0t6d0 green c0t0d0 green c0t1d0 green fd0 white Networks white No activity hme0 white hme1 white NFS client white No client NFS/RPC activity Swap space white There is a lot of unused swap space RAM demand white RAM available Kernel mem white No worries, mate CPU power white CPU idling Mutex green No worries, mate No significant kernel contention [cpu0 0 green] [cpu1 0 green] DNLC white No activity Inode white No activity TCP white No activity
It's harder to generate tests for the other errors, and my interpretation of them should be regarded as experimental rather than definitive!
The code that implements these rules is shown below. The explanation
string will already have been set by the normal disk rule, so I keep
adding more lines to it for each error encountered. The full name of
the disk in the c0t0d0
format is reported. This must be specifically looked up
because it has to be translated from the sd0,err
format; but you can see
whether all the disks on a controller have the same error at the
same time, which would indicate a cable or controller failure.
/* process rules in order of increasing error level */ if (GLOBAL_sderr[i].No_Device > 0) { explanation = sprintf("%s\n\t%s\t%s\t%3d %s", explanation, dname, GLOBAL_sderr[i].Product, GLOBAL_sderr[i].No_Device, "green Warning - No - cprboot spinup?"); GLOBAL_sderr_state[i] = ST_GREEN; } if (GLOBAL_sderr[i].Recoverable > 0) { explanation = sprintf("%s\n\t%s\t%s\t%3d %s", explanation, dname, GLOBAL_sderr[i].Product, GLOBAL_sderr[i].Recoverable, "amber - Recoverable or Retry"); GLOBAL_sderr_state[i] = ST_AMBER; } if (GLOBAL_sderr[i].Predictive_Failure_Analysis > 0) { explanation = sprintf("%s\n\t%s\t%s\t%3d %s", explanation, dname, GLOBAL_sderr[i].Product, GLOBAL_sderr[i].Predictive_Failure_Analysis, "red Predictive_Failure_Analysis - Replace Disk"); GLOBAL_sderr_state[i] = ST_RED; } if (GLOBAL_sderr[i].Device_Not_Ready > 0) { explanation = sprintf("%s\n\t%s\t%s\t%3d %s", explanation, dname, GLOBAL_sderr[i].Product, GLOBAL_sderr[i].Device_Not_Ready, "black- Device Not Ready - disk offline?"); GLOBAL_sderr_state[i] = ST_BLACK; } if (GLOBAL_sderr[i].Media_Error > 0) { explanation = sprintf("%s\n\t%s\t%s\t%3d %s", explanation, dname, GLOBAL_sderr[i].Product, GLOBAL_sderr[i].Media_Error, "black - Media Error - replace disk"); GLOBAL_sderr_state[i] = ST_BLACK; } if (GLOBAL_sderr[i].Illegal_Request > 0) { explanation = sprintf("%s\n\t%s\t%s\t%3d %s", explanation, dname, GLOBAL_sderr[i].Product, GLOBAL_sderr[i].Illegal_Request, "black - Illegal Request Error"); GLOBAL_sderr_state[i] = ST_BLACK; }
Wrap up
So, another source of data is now automatically watched for
you. This rule is used by the virtual_adrian
, zoom
, and percollator
scripts, and more functionality has been added without changing any
of the scripts. Please let me know if it finds a problem for you, or
if you disagree with the rule itself. For now, the code is an add-on
to SE3.1prefcs
, and you can download a tar file that contains the
fixes from the regular site. It replaces live_rules.se
,
live_test.se
, and disks.se
only. As usual, if you discover a problem
or think of a better way to process this data than I did, it's easy
to fix it yourself. Just send me the fix! Remember that the SE
toolkit is unsupported and experimental, so don't depend on it in
production environments. Use the supported iostat -nE
command or
SyMON 2.0 instead.
Finally, a small, personal plug. For the third year straight, Dr. Neil Gunther and I are running a week-long "Practical Performance Methods" class at Stanford University from August 16 through August 20. More info is available at http://wics.stanford.edu/courses/performance.html. We already have more people signed up than last year.
|
Resources
virtual_adrian.se
rule:
About the author
Adrian
Cockcroft joined Sun Microsystems in 1988, and currently works as a
performance specialist for the Computer Systems Division of Sun. He
wrote Sun
Performance and Tuning: SPARC and Solaris and Sun
Performance and Tuning: Java and the Internet, both published
by Sun Microsystems Press
Books.
If you have technical problems with this magazine, contact webmaster@sunworld.com
URL: http://www.sunworld.com/swol-07-1999/swol-07-perf.html
Last modified: