Networker

Re: [Networker] tape drives getting "stuck" on storage node

2008-06-09 21:33:20
Subject: Re: [Networker] tape drives getting "stuck" on storage node
From: Fazil Saiyed <Fazil.Saiyed AT ANIXTER DOT COM>
To: NETWORKER AT LISTSERV.TEMPLE DOT EDU
Date: Mon, 9 Jun 2008 13:23:05 -0500
Hello,
If you are ejecting tapes directly from Tape Lib and having problems, that 
would certainly indicate problems with drives\hardware on EML Lib.
Have you checked with HP support ? 
Couple of questions
Is Block size the same between both servers ? How often do you clean your 
drives ?
Are you using same media type and interchange  them between the tape 
libraries, i believe ADIC uses IBM  & HP uses it's own Tape Drives, by 
itself not a problem but if you keep mixing media ?? then   it could act 
funny ?
Lastly, have you checked any errors logs  on EML &  Driver\Frimware ? 
HTH




Alex Alexiou <AAlexiou AT TARGETSITE DOT COM> 
Sent by: EMC NetWorker discussion <NETWORKER AT LISTSERV.TEMPLE DOT EDU>
06/09/2008 01:06 PM
Please respond to
EMC NetWorker discussion <NETWORKER AT LISTSERV.TEMPLE DOT EDU>; Please respond 
to
Alex Alexiou <AAlexiou AT TARGETSITE DOT COM>


To
NETWORKER AT LISTSERV.TEMPLE DOT EDU
cc

Subject
[Networker] tape drives getting "stuck" on storage node






Here's the background: We have Networker 7.3.3 running on a backup
server and a storage node server, both Red Hat AS 4. The backup server
is fibre-attached to an ADIC Library, the storage node to an HP EML
Library, via QLogic HBA's.

 

Over the past month or so, during weekend backup jobs, random tape
drives in the EML will get "stuck". They will load a tape and keep
trying forever to inventory the tape and never finish. This never
happens on the ADIC. Only power-cycling the EML seems to fix this. It's
not the same drives every time and it's only during weekends. I tried
disabling CDI on the drives, which only seemed to help slightly. Just
this past weekend, a drive was stuck while trying to eject a tape and
kept timing out, with the following in /nsr/logs/daemon.log on the
storage node: 

 

06/09/08 00:50:07 nsrd: media warning:
rd=tgtbackupnode01.targetsite.com:/dev/nst4 moving: eject: Input/output
error

 

I even tried opening the jukebox up and manually ejecting the tape, and
nothing happened. Again, I had to reboot the EML and everything was
fine.

 

It's possible it's the EML at fault and not Legato, but it's impossible
to tell right now. Has anyone seen anything like this where a change in
Legato fixed things? One thing I noticed is that the kernel version of
the backup server and storage node is slightly different; I was never
told to keep it the same, but I saw a posting that referring to this as
a possible problem.

 

Another thing I noticed was several entries in the storage node's log
like this. Server name was removed by me:

 

06/07/08 01:00:29 nsrexecd: GSS Legato authentication user session entry
(warning): "User authentication session timed out and is no

w invalid.". Session number = 4a8:1008, domain = NT AUTHORITY, user name
= SYSTEM, NetWorker Instance Name = server

06/07/08 01:00:29 nsrexecd: GSS Legato authentication user session entry
(warning): "User authentication session timed out and is no

w invalid.". Session number = 4a9:1009, domain = NT AUTHORITY, user name
= SYSTEM, NetWorker Instance Name = server

06/07/08 01:01:05 nsrexecd: SYSTEM error: An error occured when a client
attempted to acquire credentials: error: "A daemon requeste

d the information for a user session, but the user session was not found
in the list of valid sessions" session number: 465:224c, cl

ient ip address: 127.0.0.1, port number: 0, user id: (NONE).

06/07/08 01:01:05 nsrmmd #11: GSS Legato authentication from server
failed...

06/07/08 01:01:05 nsrmmd #11: RPC error: Authentication error

06/07/08 01:01:05 nsrexecd: SYSTEM error: An error occured when a client
attempted to acquire credentials: error: "A daemon requeste

d the information for a user session, but the user session was not found
in the list of valid sessions" session number: 466:2250, cl

ient ip address: 127.0.0.1, port number: 0, user id: (NONE).

 

 

We also often have errors like this in dmesg on the storage node:

 

st6: Failed to read 131072 byte block with 32768 byte transfer.

st4: Error 20000 (sugg. bt 0x0, driver bt 0x0, host bt 0x2).

st2: Failed to read 131072 byte block with 32768 byte transfer.

 

Let me know if any more information would help.


To sign off this list, send email to listserv AT listserv.temple DOT edu and 
type 
"signoff networker" in the body of the email. Please write to 
networker-request AT listserv.temple DOT edu if you have any problems with this 
list. You can access the archives at 
http://listserv.temple.edu/archives/networker.html or
via RSS at http://listserv.temple.edu/cgi-bin/wa?RSS&L=NETWORKER



To sign off this list, send email to listserv AT listserv.temple DOT edu and 
type "signoff networker" in the body of the email. Please write to 
networker-request AT listserv.temple DOT edu if you have any problems with this 
list. You can access the archives at 
http://listserv.temple.edu/archives/networker.html or
via RSS at http://listserv.temple.edu/cgi-bin/wa?RSS&L=NETWORKER