Networker

Re: [Networker] Hung savesets sometimes never timeout

2006-06-26 05:17:06
Subject: Re: [Networker] Hung savesets sometimes never timeout
From: Stuart Whitby <swhitby AT DATAPROTECTORS.CO DOT UK>
To: NETWORKER AT LISTSERV.TEMPLE DOT EDU
Date: Mon, 26 Jun 2006 10:15:37 +0100
Well, the refusal to hit the inactivity timeout isn't a problem with the 
inactivity timeout mechanism itself.  That works on when the save last sent 
data to an mmd, so if no data has ever been sent the timeout will never kick 
in.  As such that would be an RFE rather than a bug - good luck getting that 
sorted out....
 
<support mode>
 
I'd be more interested to find out what the problem is with the save sending 
data.  We know that the save is started, so we have a savegrp -> nsrexec -> 
nsrexecd -> save communication path.  The way that a save session actually 
starts saving data is something along the lines of:
> connect to nsrd
>  ask where to send data
<  point to an mmd
> connect to mmd (which has a connection to the media DB to give saveset info)
> connect to main nsrindexd process
>  ask where to send data
  (nsrindexd creates a child to service those saves - the -ADD process)
<  tell client which indexd to connect to
> connect to child indexd process (which creates the client file index)
> send data to mmd, send metadata to indexd, (inform nsrexecd? & ) exit when 
> done
 
So if the save's created, what has it done after that point?  You'll need to 
check this out with tcpview (www.sysinternals.com) or lsof (www.sunfreeware.com 
 or ftp.vic.cc.purdue.edu/pub/tools/unix/lsof) to identify what connections 
have been made by save.  Ideally, if it's on a Unix client, truss the save to 
see what it's trying to do.  This should help identify where the problem 
exists.  Basic questions would be whether the correct connection state exists 
on both sides of the pipe (from save to the NW server/storage node) or whether 
that connection had ever been attempted (tougher to find out).  If you can get 
the same problem consistently from a client then you may be able to wrapper the 
save process (if nsrexecd is happy to have a script as a child) to run truss on 
the save, or netstatp (sysinternals) on a very regular basis (once per second) 
for the first part of the save.
 
</support mode>
 
Cheers,
 
Stuart.

________________________________

From: Legato NetWorker discussion on behalf of Oscar Olsson
Sent: Mon 26-Jun-06 08:22
To: NETWORKER AT LISTSERV.TEMPLE DOT EDU
Subject: Re: [Networker] Hung savesets sometimes never timeout



On Mon, 26 Jun 2006, Stan Horwitz wrote:

SH> Check this list's archives; this is a long-standing issue with NetWorker
SH> going way back. I have seen it several times. In fact, just the other day,
SH> we resolved that problem on a Windows 2003 client after at least two weeks
SH> of trial and error. The solution in that case was to update the hardware
SH> firmware for that client (network card, disk drives, RAID controller, etc.).
SH> After the firmware was updated, all the backups ran to fruition.

I know, I know.. It just seems like it happens fore frequently now than
before. And I still think that a client shouldn't be able to hang the
group completely forever. The inactivity timeout should be honored even
though the save hasn't started a save session yet. Oh well.

//Oscar

To sign off this list, send email to listserv AT listserv.temple DOT edu and 
type "signoff networker" in the
body of the email. Please write to networker-request AT listserv.temple DOT edu 
if you have any problems
wit this list. You can access the archives at 
http://listserv.temple.edu/archives/networker.html or
via RSS at http://listserv.temple.edu/cgi-bin/wa?RSS&L=NETWORKER



To sign off this list, send email to listserv AT listserv.temple DOT edu and 
type "signoff networker" in the
body of the email. Please write to networker-request AT listserv.temple DOT edu 
if you have any problems
wit this list. You can access the archives at 
http://listserv.temple.edu/archives/networker.html or
via RSS at http://listserv.temple.edu/cgi-bin/wa?RSS&L=NETWORKER