Networker

Re: [Networker] The problem of the hung savegroups

2009-07-22 11:56:12
Subject: Re: [Networker] The problem of the hung savegroups
From: "Clark, Patti" <clarkp AT OSTI DOT GOV>
To: NETWORKER AT LISTSERV.TEMPLE DOT EDU
Date: Wed, 22 Jul 2009 11:51:25 -0400
 > -----Original Message-----
> From: EMC NetWorker discussion 
> [mailto:NETWORKER AT LISTSERV.TEMPLE DOT EDU] On Behalf Of Stan Horwitz
> Sent: Tuesday, July 21, 2009 3:53 PM
> To: NETWORKER AT LISTSERV.TEMPLE DOT EDU
> Subject: [Networker] The problem of the hung savegroups
> 
> Greetings everyone;
> 
> I have a case open with EMC with regard to this problem where 
> a hung client
> will hang the savegroup that contains that client. I have 
> this case open for
> my NetWorker 7.4.4 server on Red Hat Linux AS 4.5, but I also 
> have the same
> problem with NetWorker 7.4.1 server on Solaris 10.
> 
> One of the things they had me do from EMC on my Linux server was
> 
>  save -v -D9 -g grp_name -c client_name
> 
> We did that for one client in a group that was hanging after 
> I manually
> stopped the group from the NCM GUI.
> 
> The problem just happened on my Solaris NetWorker server, so I tried a
> similar thing where I logged onto that NetWorker server and 
> for every client
> in the failed group, I issued a command along the lines of
> 
>  save -v -D9 -g grp_name -c client_name1 -c client_name2 -c 
> client_nameN
> 
> I included all nine clients in the group. The scheduled 
> backup today is an
> incremental for that group. The schedule is controlled by the 
> group. The
> backup worked. Then I went into the NMC GUI and I started the 
> group that
> way. Again, it worked.
> 
> What I am wondering is why it worked. Why did running the 
> backup from the
> NetWorker server manually fix the problem?
> 
> Actually, in this case, the group in question didn't hang, it 
> just died and
> all nine clients registered a failed backup in the resulting savegroup
> report.
> 
> EMC and I have been trying to troubleshoot this issue on my 
> Linux NetWorker
> server (which backs up much more important data then our 
> Solaris NetWorker
> server), but since we tried that stunt with manually backing 
> up one of the
> stuck groups, the problem hasn't occurred again so we are 
> waiting for it to
> happen again so we can collect some debugging info.
> 
>>>>>>>>>>>>>
What does -D9 provide?  Debug?

I've been seeing hung save groups every so often on both of my networker 
systems, one is very small.  This is NOT a DNS problem.  My small system is 
very static - no network changes nor client changes in months.  My scheduled 
backups will run flawlessly for weeks, then one night one group will hang.  
This is NOT a resource issue.  The group consists of 7 clients.  A typical 
incremental will take all of 5 minutes.  Usually, the clients are Linux.  One 
of the clients (different ones each time) will show aborted save sets in the 
completion report because of the job termination, however it also says that it 
"Cannot determine status of backup process.  Use mminfo to determine job 
status." for those same savesets.  I stop the group using NMC.  Wait for it to 
complete, then restart it.  If it's within the interval, it'll only grab the 
incomplete savesets (sometimes this is only an index) and wrap things up 
normally.  The next running of the group will be normal and life will contin!
 ue.  I wanted to clarify what is not the cause and I don't think there's any 
magic in running the backup from the server other than collecting debug 
information which it probably won't have anything of value, the condition that 
caused the hang having been cleared by stopping the hung group.  I think if you 
restarted your backup, it would have completed just as normally.  

My servers are RHEL4, one is x86_64, the other is x86.  Both are running 
Networker 7.4.3 - 32-bit.  The hanging save groups has certainly been since 
7.3.3 or earlier - it was one of the reasons to move to 7.4 - hoping the 
problem would go away. :-(

I hope this info helps.

Patti Clark
DOE/OSTI

To sign off this list, send email to listserv AT listserv.temple DOT edu and 
type "signoff networker" in the body of the email. Please write to 
networker-request AT listserv.temple DOT edu if you have any problems with this 
list. You can access the archives at 
http://listserv.temple.edu/archives/networker.html or
via RSS at http://listserv.temple.edu/cgi-bin/wa?RSS&L=NETWORKER