Check both sides - or: Errors on the SAN
It’s just a little tip from the field: A customer has suddenly really bad performance on one of his servers. he finds a lot of error messages like this in the output of dmesg
:
Apr 21 05:22:11 server1 scsi: [ID 243001 kern.info] /scsi_vhci (scsi_vhci0):
Apr 21 05:22:11 server1 /scsi_vhci/ssd@g6001438005de946defa2000000020010 (ssd38): Command failed to complete (3) on path fp9/ssd@w50001fe15023ef59,a</blockquote>
The customer concludes: I have problems with my SAN, let us look on the switch for errors. But none are seen. So he think "It's not the SAN". At this moment the customer called me. Important as in "make a tattoo on your arm when you can't remember it": Check both sides! Checking just the error counters on the switch (or just on the server) is a just necessary, but not a sufficient condition for "It's not the SAN". At first i checked the error counters for the disks. You could useiostat -e
for this task however I thinkkstat -p
is easier to parse and you have the same kind of information in it. Dang … doesn't look good … let's look closer to the The interesting part is the highlighted one. A massive increase in invalid tx word count. When you see an massively increased counter here, your HBA just received rubbish from the storage. I never saw a different reason for this than a problem between the interface to the optics on the HBA and the interface to the optics on the switch ranging from not properly seated transceivers to blatant cases of ignoring the minimum bending radius of the fibre optic cables. Suggestions to the customer based on the rule "check cheapest solution first":In this case the problem could be solved by a new cable. Problem disappeared.
- reseat cables
- reseat transceivers
- use a new cable
- use new transceivers