Veritas-bu

Re: [Veritas-bu] scsi_pkt.us_reason = 3

2007-09-16 13:20:43
Subject: Re: [Veritas-bu] scsi_pkt.us_reason = 3
From: Vladimir Taleiko <taleiko AT jet.msk DOT su>
To: veritas-bu AT mailman.eng.auburn DOT edu
Date: Sun, 16 Sep 2007 20:53:42 +0400
Thank you for an assistance. I'll try to involve Sun support again.

BTW, if bptm is nice enough to post this message, why the scsi layer 
ignore such errors? IMHO, bptm should report this timeout right after 
scsi error. Maybe some lack of coordination of timings in scsi and bptm 
(it's ok for scsi, but bptm "think" there is a timeout)?

Tim Hoke пишет:
> Vladimir,
> 
> I think both of the support folks you've engaged are likely to be both
> correct and incorrect.  You need to look specifically at the hardware
> layer.  If Sun is your provider, then start there.  Otherwise, engage
> your hardware support vendor.
> 
> As a start to help you out, here's some information to examine...
> 
> Take a look at where the scsi_pkt.us_reason messages come from:
> /usr/include/sys/scsi/scsi_pkt.h
> /*
> * Definitions for the pkt_reason field.
> */
> 
> /*
> * Following defines are generic.
> */
> #define CMD_CMPLT 0 /* no transport errors- normal completion */
> #define CMD_INCOMPLETE 1 /* transport stopped with not normal state */
> #define CMD_DMA_DERR 2 /* dma direction error occurred */
> #define CMD_TRAN_ERR 3 /* unspecified transport error */
> #define CMD_RESET 4 /* Target completed hard reset sequence */
> #define CMD_ABORTED 5 /* Command transport aborted on request */
> #define CMD_TERMINATED 22 /* Command transport terminated on request */
> #define CMD_TIMEOUT 6 /* Command timed out */
> #define CMD_DATA_OVR 7 /* Data Overrun */
> #define CMD_CMD_OVR 8 /* Command Overrun */
> #define CMD_STS_OVR 9 /* Status Overrun */
> 
> As you can see, 3 is in fact CMD_TRAN_ERR or "unspecified transport
> error".  This is definitely something below the application
> (NetBackup) and bptm is just nice enough to post this message for you.
> 
> So, you need to look at the lower layers... HBA, Fabric, Switches,
> Bridges, etc.  Check out all the drivers and configs to make sure
> they're correct (not necessilarily the latest) and make sure there
> aren't any errors happening elsewhere.
> 
> It might be useful to examine the bptm log for each of those PIDs to
> see what operation was happening.  That may help narrow a particular
> command or device (drive) which is having a problem.  Then, you can
> focus your efforts there.
> 
> HTH
> -Tim
> 
> 
> 
> 
> On 9/14/07, Vladimir Taleiko <taleiko AT jet.msk DOT su> wrote:
>> Hi, all
>>
>> I have Solaris 9 server with latest microcode/patches installed
>> NetBackup 6.0 MP4
>> L180 (IBM LTO2 - 5AT0)
>> Brocade 3800 (v3.2.1b, dual fabric)
>>
>> c4                             fc-fabric    connected    configured
>> unknown
>> c4::500104f0006e4931,0         med-changer  connected    configured
>> unknown
>> c4::5005076300615978,0         tape         connected    configured
>> unknown
>> c4::500507630061605e,0         tape         connected    configured
>> unknown
>> c5                             fc-fabric    connected    configured
>> unknown
>> c5::5005076300615b41,0         tape         connected    configured
>> unknown
>> c5::5005076300615d9b,0         tape         connected    configured
>> unknown
>> c8                             fc-fabric    connected    configured
>> unknown
>> c8::500507630061556a,0         tape         connected    configured
>> unknown
>> c8::5005076300615aaa,0         tape         connected    configured
>> unknown
>> c9                             fc-fabric    connected    configured
>> unknown
>> c9::5005076300615937,0         tape         connected    configured
>> unknown
>> c9::5005076300615ba3,0         tape         connected    configured
>> unknown
>>
>> and the following errors:
>>
>> Jul 28 03:13:04 ni5nrp2 bptm[26639]: [ID 832037 daemon.error] scsi
>> command failed, may be timeout, scsi_pkt.us_reason = 3
>> Jul 28 07:11:25 ni5nrp2 bptm[27999]: [ID 832037 daemon.error] scsi
>> command failed, may be timeout, scsi_pkt.us_reason = 3
>> Jul 28 07:13:32 ni5nrp2 bptm[27983]: [ID 832037 daemon.error] scsi
>> command failed, may be timeout, scsi_pkt.us_reason = 3
>> Jul 28 08:42:36 ni5nrp2 bptm[15693]: [ID 832037 daemon.error] scsi
>> command failed, may be timeout, scsi_pkt.us_reason = 3
>> Jul 29 09:21:34 ni5nrp2 bptm[18406]: [ID 832037 daemon.error] scsi
>> command failed, may be timeout, scsi_pkt.us_reason = 3
>> Jul 29 10:51:12 ni5nrp2 bptm[5955]: [ID 832037 daemon.error] scsi
>> command failed, may be timeout, scsi_pkt.us_reason = 3
>> Jul 30 00:42:07 ni5nrp2 bptm[27613]: [ID 832037 daemon.error] scsi
>> command failed, may be timeout, scsi_pkt.us_reason = 3
>> Jul 30 00:52:03 ni5nrp2 bptm[27614]: [ID 832037 daemon.error] scsi
>> command failed, may be timeout, scsi_pkt.us_reason = 3
>> Jul 30 01:11:55 ni5nrp2 bptm[27620]: [ID 832037 daemon.error] scsi
>> command failed, may be timeout, scsi_pkt.us_reason = 3
>>
>> bptm logs doesn't show any issues.
>>
>> I've opened two cases in Sun and Symantec.
>> Sun guys said "There is no errors from scsi layer, call Symantec"
>> Symantec support said "these errors are being received by the bptm
>> process, and not being generated by it, please involve SUN"
>>
>> I'm slightly confused. What is the cause of this errors?
>>
>> --
>>   Vladimir
>> _______________________________________________
>> Veritas-bu maillist  -  Veritas-bu AT mailman.eng.auburn DOT edu
>> http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
>>
_______________________________________________
Veritas-bu maillist  -  Veritas-bu AT mailman.eng.auburn DOT edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
<Prev in Thread] Current Thread [Next in Thread>