Well, the refusal to hit the inactivity timeout isn't a problem with the
inactivity timeout mechanism itself. That works on when the save last sent
data to an mmd, so if no data has ever been sent the timeout will never kick
in. As such that would be an RFE rather than a bug - good luck getting that
sorted out....
<support mode>
I'd be more interested to find out what the problem is with the save sending
data. We know that the save is started, so we have a savegrp -> nsrexec ->
nsrexecd -> save communication path. The way that a save session actually
starts saving data is something along the lines of:
> connect to nsrd
> ask where to send data
< point to an mmd
> connect to mmd (which has a connection to the media DB to give saveset info)
> connect to main nsrindexd process
> ask where to send data
(nsrindexd creates a child to service those saves - the -ADD process)
< tell client which indexd to connect to
> connect to child indexd process (which creates the client file index)
> send data to mmd, send metadata to indexd, (inform nsrexecd? & ) exit when
> done
So if the save's created, what has it done after that point? You'll need to
check this out with tcpview (www.sysinternals.com) or lsof (www.sunfreeware.com
or ftp.vic.cc.purdue.edu/pub/tools/unix/lsof) to identify what connections
have been made by save. Ideally, if it's on a Unix client, truss the save to
see what it's trying to do. This should help identify where the problem
exists. Basic questions would be whether the correct connection state exists
on both sides of the pipe (from save to the NW server/storage node) or whether
that connection had ever been attempted (tougher to find out). If you can get
the same problem consistently from a client then you may be able to wrapper the
save process (if nsrexecd is happy to have a script as a child) to run truss on
the save, or netstatp (sysinternals) on a very regular basis (once per second)
for the first part of the save.
</support mode>
Cheers,
Stuart.
________________________________
From: Legato NetWorker discussion on behalf of Oscar Olsson
Sent: Mon 26-Jun-06 08:22
To: NETWORKER AT LISTSERV.TEMPLE DOT EDU
Subject: Re: [Networker] Hung savesets sometimes never timeout
On Mon, 26 Jun 2006, Stan Horwitz wrote:
SH> Check this list's archives; this is a long-standing issue with NetWorker
SH> going way back. I have seen it several times. In fact, just the other day,
SH> we resolved that problem on a Windows 2003 client after at least two weeks
SH> of trial and error. The solution in that case was to update the hardware
SH> firmware for that client (network card, disk drives, RAID controller, etc.).
SH> After the firmware was updated, all the backups ran to fruition.
I know, I know.. It just seems like it happens fore frequently now than
before. And I still think that a client shouldn't be able to hang the
group completely forever. The inactivity timeout should be honored even
though the save hasn't started a save session yet. Oh well.
//Oscar
To sign off this list, send email to listserv AT listserv.temple DOT edu and
type "signoff networker" in the
body of the email. Please write to networker-request AT listserv.temple DOT edu
if you have any problems
wit this list. You can access the archives at
http://listserv.temple.edu/archives/networker.html or
via RSS at http://listserv.temple.edu/cgi-bin/wa?RSS&L=NETWORKER
To sign off this list, send email to listserv AT listserv.temple DOT edu and
type "signoff networker" in the
body of the email. Please write to networker-request AT listserv.temple DOT edu
if you have any problems
wit this list. You can access the archives at
http://listserv.temple.edu/archives/networker.html or
via RSS at http://listserv.temple.edu/cgi-bin/wa?RSS&L=NETWORKER
|