> -----Original Message-----
> From: EMC NetWorker discussion
> [mailto:NETWORKER AT LISTSERV.TEMPLE DOT EDU] On Behalf Of Stan Horwitz
> Sent: Tuesday, July 21, 2009 3:53 PM
> To: NETWORKER AT LISTSERV.TEMPLE DOT EDU
> Subject: [Networker] The problem of the hung savegroups
>
> Greetings everyone;
>
> I have a case open with EMC with regard to this problem where
> a hung client
> will hang the savegroup that contains that client. I have
> this case open for
> my NetWorker 7.4.4 server on Red Hat Linux AS 4.5, but I also
> have the same
> problem with NetWorker 7.4.1 server on Solaris 10.
>
> One of the things they had me do from EMC on my Linux server was
>
> save -v -D9 -g grp_name -c client_name
>
> We did that for one client in a group that was hanging after
> I manually
> stopped the group from the NCM GUI.
>
> The problem just happened on my Solaris NetWorker server, so I tried a
> similar thing where I logged onto that NetWorker server and
> for every client
> in the failed group, I issued a command along the lines of
>
> save -v -D9 -g grp_name -c client_name1 -c client_name2 -c
> client_nameN
>
> I included all nine clients in the group. The scheduled
> backup today is an
> incremental for that group. The schedule is controlled by the
> group. The
> backup worked. Then I went into the NMC GUI and I started the
> group that
> way. Again, it worked.
>
> What I am wondering is why it worked. Why did running the
> backup from the
> NetWorker server manually fix the problem?
>
> Actually, in this case, the group in question didn't hang, it
> just died and
> all nine clients registered a failed backup in the resulting savegroup
> report.
>
> EMC and I have been trying to troubleshoot this issue on my
> Linux NetWorker
> server (which backs up much more important data then our
> Solaris NetWorker
> server), but since we tried that stunt with manually backing
> up one of the
> stuck groups, the problem hasn't occurred again so we are
> waiting for it to
> happen again so we can collect some debugging info.
>
>>>>>>>>>>>>>
What does -D9 provide? Debug?
I've been seeing hung save groups every so often on both of my networker
systems, one is very small. This is NOT a DNS problem. My small system is
very static - no network changes nor client changes in months. My scheduled
backups will run flawlessly for weeks, then one night one group will hang.
This is NOT a resource issue. The group consists of 7 clients. A typical
incremental will take all of 5 minutes. Usually, the clients are Linux. One
of the clients (different ones each time) will show aborted save sets in the
completion report because of the job termination, however it also says that it
"Cannot determine status of backup process. Use mminfo to determine job
status." for those same savesets. I stop the group using NMC. Wait for it to
complete, then restart it. If it's within the interval, it'll only grab the
incomplete savesets (sometimes this is only an index) and wrap things up
normally. The next running of the group will be normal and life will contin!
ue. I wanted to clarify what is not the cause and I don't think there's any
magic in running the backup from the server other than collecting debug
information which it probably won't have anything of value, the condition that
caused the hang having been cleared by stopping the hung group. I think if you
restarted your backup, it would have completed just as normally.
My servers are RHEL4, one is x86_64, the other is x86. Both are running
Networker 7.4.3 - 32-bit. The hanging save groups has certainly been since
7.3.3 or earlier - it was one of the reasons to move to 7.4 - hoping the
problem would go away. :-(
I hope this info helps.
Patti Clark
DOE/OSTI
To sign off this list, send email to listserv AT listserv.temple DOT edu and
type "signoff networker" in the body of the email. Please write to
networker-request AT listserv.temple DOT edu if you have any problems with this
list. You can access the archives at
http://listserv.temple.edu/archives/networker.html or
via RSS at http://listserv.temple.edu/cgi-bin/wa?RSS&L=NETWORKER
|