Networker

Re: [Networker] tape drives getting "stuck" on storage node

2008-06-09 20:52:44
Subject: Re: [Networker] tape drives getting "stuck" on storage node
From: "Clark, Patti" <clarkp AT OSTI DOT GOV>
To: NETWORKER AT LISTSERV.TEMPLE DOT EDU
Date: Mon, 9 Jun 2008 16:17:33 -0400
> -----Original Message-----
> From: EMC NetWorker discussion 
> [mailto:NETWORKER AT LISTSERV.TEMPLE DOT EDU] On Behalf Of Alex Alexiou
> Sent: Monday, June 09, 2008 2:07 PM
> To: NETWORKER AT LISTSERV.TEMPLE DOT EDU
> Subject: [Networker] tape drives getting "stuck" on storage node
> 
> Here's the background: We have Networker 7.3.3 running on a backup
> server and a storage node server, both Red Hat AS 4. The backup server
> is fibre-attached to an ADIC Library, the storage node to an HP EML
> Library, via QLogic HBA's.
> 
>  
> 
> Over the past month or so, during weekend backup jobs, random tape
> drives in the EML will get "stuck". They will load a tape and keep
> trying forever to inventory the tape and never finish. This never
> happens on the ADIC. Only power-cycling the EML seems to fix 
> this. It's
> not the same drives every time and it's only during weekends. I tried
> disabling CDI on the drives, which only seemed to help slightly. Just
> this past weekend, a drive was stuck while trying to eject a tape and
> kept timing out, with the following in /nsr/logs/daemon.log on the
> storage node: 
> 
>  
> 
> 06/09/08 00:50:07 nsrd: media warning:
> rd=tgtbackupnode01.targetsite.com:/dev/nst4 moving: eject: 
> Input/output
> error
> 
>  
> 
> I even tried opening the jukebox up and manually ejecting the 
> tape, and
> nothing happened. Again, I had to reboot the EML and everything was
> fine.
> 
>  
> 
> It's possible it's the EML at fault and not Legato, but it's 
> impossible
> to tell right now. Has anyone seen anything like this where a 
> change in
> Legato fixed things? One thing I noticed is that the kernel version of
> the backup server and storage node is slightly different; I was never
> told to keep it the same, but I saw a posting that referring 
> to this as
> a possible problem.
> 
>  
> 
> Another thing I noticed was several entries in the storage node's log
> like this. Server name was removed by me:
> 
>  
> 
> 06/07/08 01:00:29 nsrexecd: GSS Legato authentication user 
> session entry
> (warning): "User authentication session timed out and is no
> 
> w invalid.". Session number = 4a8:1008, domain = NT 
> AUTHORITY, user name
> = SYSTEM, NetWorker Instance Name = server
> 
> 06/07/08 01:00:29 nsrexecd: GSS Legato authentication user 
> session entry
> (warning): "User authentication session timed out and is no
> 
> w invalid.". Session number = 4a9:1009, domain = NT 
> AUTHORITY, user name
> = SYSTEM, NetWorker Instance Name = server
> 
> 06/07/08 01:01:05 nsrexecd: SYSTEM error: An error occured 
> when a client
> attempted to acquire credentials: error: "A daemon requeste
> 
> d the information for a user session, but the user session 
> was not found
> in the list of valid sessions" session number: 465:224c, cl
> 
> ient ip address: 127.0.0.1, port number: 0, user id: (NONE).
> 
> 06/07/08 01:01:05 nsrmmd #11: GSS Legato authentication from server
> failed...
> 
> 06/07/08 01:01:05 nsrmmd #11: RPC error: Authentication error
> 
> 06/07/08 01:01:05 nsrexecd: SYSTEM error: An error occured 
> when a client
> attempted to acquire credentials: error: "A daemon requeste
> 
> d the information for a user session, but the user session 
> was not found
> in the list of valid sessions" session number: 466:2250, cl
> 
> ient ip address: 127.0.0.1, port number: 0, user id: (NONE).
> 
>  
> 
>  
> 
> We also often have errors like this in dmesg on the storage node:
> 
>  
> 
> st6: Failed to read 131072 byte block with 32768 byte transfer.
> 
> st4: Error 20000 (sugg. bt 0x0, driver bt 0x0, host bt 0x2).
> 
> st2: Failed to read 131072 byte block with 32768 byte transfer.
> 
>  
> 
> Let me know if any more information would help.
> 
>>>>>>>>>>>>>>> 
Alex,

I cannot address your tape stuck issues, however, I can help with a
couple of other items that you mentioned.  I, too, use a RHEL4 host and
v7.3.3 Networker.  I have the "session timed out" messages rather
frequently for all of my Windows clients.  I am not having any
difficulties doing backups nor restores from them since I was given a
patched nsrexecd.  I did have to swap off of the 64-bit networker to the
32-bit version.

The "Failed to read" byte block messages have been around since my
server has been on Linux.  This has come up a few times on this list and
has been suggested that it's not a concern.  While I don't like noise in
my log files, I've not found this to cause any issues with either
backups or restores.

I hope some of this helps.

Patti Clark
Sr. Unix System Administrator - RHCT, GSEC
Office of Scientific and Technical Information

 

To sign off this list, send email to listserv AT listserv.temple DOT edu and 
type "signoff networker" in the body of the email. Please write to 
networker-request AT listserv.temple DOT edu if you have any problems with this 
list. You can access the archives at 
http://listserv.temple.edu/archives/networker.html or
via RSS at http://listserv.temple.edu/cgi-bin/wa?RSS&L=NETWORKER