Networker

Re: [Networker] SDLT full prematurely

2002-08-26 03:30:17
Subject: Re: [Networker] SDLT full prematurely
From: Lee Hwan Meng <hwanlee AT SOFTHOME DOT NET>
To: NETWORKER AT LISTMAIL.TEMPLE DOT EDU
Date: Mon, 26 Aug 2002 03:33:33 -0400
We use SAN switch to perform DDS. We also have mix of NTs and Unixs.

Our error messages are quite similar to yours, except that we don't have
SCSI lun resetting messages anywhere. This may make our debugging much more
difficult (sigh...)

I encountered 4 such errors for the past 10 days (I lost more than 200GB
space on each tape since each time the error comes after just backing up
less than 20GB of data). Everytime, it goes like this:


1. on backup server, /nsr/logs/daemon.log shows:

08/19/02 21:26:41 nsrd: media notice: Volume "WT402" on
device "/dev/rmt/6cbn": Block size is 32768 bytes not 131072 bytes. Verify
the device configuration. Tape positioning by record is disabled.
08/19/02 21:33:57 nsrd: media warning: /dev/rmt/6cbn reading: fsr 8105
read: I/O error
08/19/02 21:33:57 nsrd: media emergency: could not position WT402 to file
4, record 8107
08/19/02 21:33:57 nsrd: media warning: /dev/rmt/6cbn reading: I/O error
(above line repeated 6 times)


2. on the storage node (can be the backup server), /var/adm/messages shows:

Aug 19 21:25:48 <storage node> scsi: [ID 107833 kern.warning]
WARNING: /pci@4,2000/pci@1/JNI,FCE@4/st@2,1 (st647):
Aug 19 21:25:48 <storage node>  Error for Command: write
Error Level: Fatal
Aug 19 21:25:48 <storage node> scsi: [ID 107833 kern.notice]    Requested
Block: 8109                      Error Block: 8109
Aug 19 21:25:48 <storage node> scsi: [ID 107833 kern.notice]    Vendor:
QUANTUM                            Serial Number:  #  <
Aug 19 21:25:48 <storage node> scsi: [ID 107833 kern.notice]    Sense Key:
Media Error
Aug 19 21:25:48 <storage node> scsi: [ID 107833 kern.notice]    ASC: 0xc
(write error), ASCQ: 0x0, FRU: 0x0
Aug 19 21:33:57 <storage node> scsi: [ID 107833 kern.warning]
WARNING: /pci@4,2000/pci@1/JNI,FCE@4/st@2,1 (st647):
Aug 19 21:33:57 <storage node>  Error for Command: read
Error Level: Fatal
Aug 19 21:33:57 <storage node> scsi: [ID 107833 kern.notice]    Requested
Block: 7874                      Error Block: 7874
Aug 19 21:33:57 <storage node> scsi: [ID 107833 kern.notice]    Vendor:
QUANTUM                            Serial Number:  #  <
Aug 19 21:33:57 <storage node> scsi: [ID 107833 kern.notice]    Sense Key:
Media Error
Aug 19 21:33:57 <storage node> scsi: [ID 107833 kern.notice]    ASC: 0x11
(unrecovered read error), ASCQ: 0x0, FRU: 0x0
(more similar messages follow...)



I checked all other storage nodes each time when this occurred, nothing
seems to bother them then.

I am completely clueless what caused this!

Damaged tapes? I tend not to think so. I re-format one of the tapes last
week and subsequent backups went pass the problematic block with no error.

Any idea will be appreciated.

Regards
Hwan Meng






On Sun, 25 Aug 2002 20:45:56 -0700, han <k0s5 AT YAHOO DOT COM> wrote:

>We use SmartMedia and Networker to perform DDS, and
>the setup is mixed of NTs and Unixs
>The information below come is one package (at similar
>time)
>Networker error:
>- bad file number error
>- unable to read file xx record yy
>- then networker will marked the media full
>
>for system messages (/var/adm/messages)
>- there is SCSI error (with ASQ information)
>  like: SCSI bus is cleared by another initiator etc
>- there is reset reported on the drive networker used.
>
>Example:
>++++++nsr-log-messages++++++
>Jun 2 04:37:29 BckSvr
>rd=Clnt1:/opt/SmartMedia/handles/STK9840_3/P9i9xCUSCZGY
>writing: Bad file number,
>at file 5 record 10
>Jun 2 04:37:29 BckSvr 9840 tape PD0685 used 1055 MB of
>20 GB capacity
>Jun 2 04:37:29 BckSvr 9840 tape PD0685 on
>rd=Clnt1:/opt/SmartMedia/handles/STK9840_3/P9i9xCUSCZGY
>is full
>
>
>BckSvr /var/adm/messages
>Jun 2 04:37:29 BckSvr
>rd=Clnt1:/opt/SmartMedia/handles/STK9840_3/P9i9xCUSCZGY
>writing: Bad file number,
>at file 5 record 10
>Jun 2 04:37:29 BckSvr 9840 tape PD0685 used 1055 MB of
>20 GB capacity
>Jun 2 04:37:29 BckSvr 9840 tape PD0685 on
>rd=Clnt1:/opt/SmartMedia/handles/STK9840_3/P9i9xCUSCZGY
>is full
>
>Clnt2 /var/adm/messages
>Jun 2 04:37:29 Clnt2 fcaw: fcaw0: Target 0 Lun 3:
>Resetting...
>Jun 2 04:37:29 Clnt2 fcaw: fcaw0: Target 0 Lun 3:
>Resetting...
>
>Clnt1 /var/adm/messages
>Jun 2 04:37:29 Clnt1 scsi: WARNING:
>/sbus@2,0/fcaw@2,0/st@2,3 (st606):
>Jun 2 04:37:29 Clnt1 Error for Command: write Error
>Level: Fatal
>Jun 2 04:37:29 Clnt1 scsi: Requested Block: 10 Error
>Block: 10
>Jun 2 04:37:29 Clnt1 scsi: Vendor: STK Serial Number:
>.127
>Jun 2 04:37:29 Clnt1 scsi: Sense Key: Aborted Command
>Jun 2 04:37:29 Clnt1 scsi: ASC: 0x2f (commands cleared
>by another initiator), ASCQ: 0x0, FRU: 0x16
>
>The reset can happen on the random drive as well.
>After replacing the tape drive and the cables, we are
>now out of this problem.
>
>What's your setup like?
>
>cheers - han
>
>
>--- Lee Hwan Meng <hwanlee AT SOFTHOME DOT NET> wrote:
>> Yes, we are on SAN.
>>
>> I am not certain that our tape drive(s) cause this,
>> because everytime this
>> occurs it happens on a different drive. Unless of
>> course all the drives are
>> faulty, which is a possibility :-(
>>
>> How do you detect a SCSI bus reset?
>>
>> Thanks for your comments.
>>
>>
>>
>> On Sun, 25 Aug 2002 20:12:41 +0800, k0s5
>> <k0s5 AT YAHOO DOT COM> wrote:
>>
>> >At 8/25/2002 01:07 PM, Lee Hwan Meng wrote:
>> >>Hi
>> >>
>> >>Did a search through the archive and realise quite
>> a few experienced the
>> >>above.
>> >>
>> >>Well, I did now, and has opened a case with
>> Legato.
>> >>
>> >>Anyone willing to share his/her resolution?
>> >>
>> >>I will post mine when there is.
>> >
>> >Are you on SAN environment?
>> >Recently we had the same problem on STK9840.
>> >It turn out that one of the tape drive faulty, and
>> cause a SCSI bus reset
>> >every now and then
>> >It caused the Legato session to be terminated, and
>> the media is marked full
>> >Take a look at the hardware, cabling etc
>> >
>> >
>> >cheers - han
>> >
>> >--
>> >Note: To sign off this list, send a "signoff"
>> command via email
>> >to listserv AT listmail.temple DOT edu or visit the list's
>> Web site at
>> >http://listmail.temple.edu/archives/networker.html
>> where you can
>> >also view and post messages to the list.
>>
>>=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=
>>
>> --
>> Note: To sign off this list, send a "signoff"
>> command via email
>> to listserv AT listmail.temple DOT edu or visit the list's
>> Web site at
>> http://listmail.temple.edu/archives/networker.html
>> where you can
>> also view and post messages to the list.
>>
>=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=
>
>
>__________________________________________________
>Do You Yahoo!?
>Yahoo! Finance - Get real-time stock quotes
>http://finance.yahoo.com
>
>--
>Note: To sign off this list, send a "signoff" command via email
>to listserv AT listmail.temple DOT edu or visit the list's Web site at
>http://listmail.temple.edu/archives/networker.html where you can
>also view and post messages to the list.
>=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=

--
Note: To sign off this list, send a "signoff" command via email
to listserv AT listmail.temple DOT edu or visit the list's Web site at
http://listmail.temple.edu/archives/networker.html where you can
also view and post messages to the list.
=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=