Check both sides - or: Errors on the SAN

It’s just a little tip from the field: A customer has suddenly really bad performance on one of his servers. he finds a lot of error messages like this in the output of dmesg:

Apr 21 05:22:11 server1 scsi: [ID 243001 kern.info] /scsi_vhci (scsi_vhci0):
Apr 21 05:22:11 server1 /scsi_vhci/ssd@g6001438005de946defa2000000020010 (ssd38): Command failed to complete (3) on path fp9/ssd@w50001fe15023ef59,a</blockquote>

The customer concludes: I have problems with my SAN, let us look on the switch for errors. But none are seen. So he think "It's not the SAN". At this moment the customer called me. Important as in "make a tattoo on your arm when you can't remember it": Check both sides! Checking just the error counters on the switch (or just on the server) is a just necessary, but not a sufficient condition for "It's not the SAN". At first i checked the error counters for the disks. You could use iostat -e for this task however I think kstat -p is easier to parse and you have the same kind of information in it.
kstat -p |  grep "ssd28,err" 
[…]
ssderr:28:ssd28,err:Device Not Ready	 0
ssderr:28:ssd28,err:Hard Errors	400
ssderr:28:ssd28,err:Illegal Request	1
ssderr:28:ssd28,err:Media Error	0
ssderr:28:ssd28,err:No Device	0
ssderr:28:ssd28,err:Predictive Failure Analysis	0
ssderr:28:ssd28,err:Product	(some storage)          Revision
ssderr:28:ssd28,err:Recoverable	0
ssderr:28:ssd28,err:Revision 	1100
ssderr:28:ssd28,err:Serial No	
ssderr:28:ssd28,err:Size	 16106127360
ssderr:28:ssd28,err:Soft Errors	0
ssderr:28:ssd28,err:Transport Errors	 403
ssderr:28:ssd28,err:Vendor	 (some storage)      
ssderr:28:ssd28,err:class	device_error
ssderr:28:ssd28,err:crtime	 19077969.9568033
ssderr:28:ssd28,err:snaptime	 20623873.2833807</blockquote></code></pre>
So i added up the errors for the disks in the <code>kstat</code> file:
<blockquote><code>
<pre>
# kstat -p | grep -i ",err" | grep "sd" | grep "Hard" | cut -f 2 | awk '{sum+=$1} END {print sum}'
3285
# kstat -p | grep -i ",err" | grep "sd" | grep "Transport" | cut -f 2 | awk '{sum+=$1} END {print sum}'
3405
Dang … doesn't look good … let's look closer to the
# fcinfo hba-port -l
HBA Port WWN: (a wwn)
        OS Device Name: /dev/cfg/c10
        Manufacturer: Emulex
        Model: LPem12002E-S
        Firmware Version: 2.00a4 (U3D2.00A4)
        FCode/BIOS Version: Boot:5.03a4 Fcode:3.10a3
				Serial Number: ABCDEFG-HIJKLMNOPQ
        Driver Name: emlxs
        Driver Version: 2.60k (2011.03.24.16.45)
        Type: N-port
        State: online
        Supported Speeds: 2Gb 4Gb 8Gb
        Current Speed: 4Gb
        Node WWN: (a wwn)
        Link Error Statistics:
                Link Failure Count: 0
                Loss of Sync Count: 145
                Loss of Signal Count: 0
                Primitive Seq Protocol Error Count: 0
                <b>Invalid Tx Word Count: 2500000
                Invalid CRC Count: 2100</b>
The interesting part is the highlighted one. A massive increase in invalid tx word count. When you see an massively increased counter here, your HBA just received rubbish from the storage. I never saw a different reason for this than a problem between the interface to the optics on the HBA and the interface to the optics on the switch ranging from not properly seated transceivers to blatant cases of ignoring the minimum bending radius of the fibre optic cables. Suggestions to the customer based on the rule "check cheapest solution first":
  • reseat cables
  • reseat transceivers
  • use a new cable
  • use new transceivers
In this case the problem could be solved by a new cable. Problem disappeared.