Networker

Re: [Networker] The problem of the hung savegroups

2009-07-21 16:21:43
Subject: Re: [Networker] The problem of the hung savegroups
From: Stephanie Finnegan <sfinnega AT AIP DOT ORG>
To: NETWORKER AT LISTSERV.TEMPLE DOT EDU
Date: Tue, 21 Jul 2009 16:15:22 -0400
Unfortunately, I can't help you as I'm in a similar boat myself.  Are any of
these clients Oracle clients?  We've had a case open for months with EMC for
RMAN backups that hang.  We've run every de-bug and diagnostic under the sun,
and so far nothing.  We're running 7.4.4.4 on Solaris 10.  (Although just this
morning we were offered 7.4.6 which I didn't even know was out).  I've run the
backup(s) from the command line dozens of times with the same hanging result,
although not using the exact syntax you posted.  Can you tell me, is the -D9
your choice, or was that the EMC recommendation?  Any info would be appreciated.
 Thanks.
 

>>> On Tuesday, July 21, 2009 at 3:53 PM, Stan Horwitz <stan AT TEMPLE DOT EDU>
wrote:
Greetings everyone;

I have a case open with EMC with regard to this problem where a hung client
will hang the savegroup that contains that client. I have this case open for
my NetWorker 7.4.4 server on Red Hat Linux AS 4.5, but I also have the same
problem with NetWorker 7.4.1 server on Solaris 10.

One of the things they had me do from EMC on my Linux server was

save -v -D9 -g grp_name -c client_name

We did that for one client in a group that was hanging after I manually
stopped the group from the NCM GUI.

The problem just happened on my Solaris NetWorker server, so I tried a
similar thing where I logged onto that NetWorker server and for every client
in the failed group, I issued a command along the lines of

save -v -D9 -g grp_name -c client_name1 -c client_name2 -c client_nameN

I included all nine clients in the group. The scheduled backup today is an
incremental for that group. The schedule is controlled by the group. The
backup worked. Then I went into the NMC GUI and I started the group that
way. Again, it worked.

What I am wondering is why it worked. Why did running the backup from the
NetWorker server manually fix the problem?

Actually, in this case, the group in question didn't hang, it just died and
all nine clients registered a failed backup in the resulting savegroup
report.

EMC and I have been trying to troubleshoot this issue on my Linux NetWorker
server (which backs up much more important data then our Solaris NetWorker
server), but since we tried that stunt with manually backing up one of the
stuck groups, the problem hasn't occurred again so we are waiting for it to
happen again so we can collect some debugging info.

To sign off this list, send email to listserv AT listserv.temple DOT edu and 
type
"signoff networker" in the body of the email. Please write to
networker-request AT listserv.temple DOT edu if you have any problems with this
list. You can access the archives at
http://listserv.temple.edu/archives/networker.html or
via RSS at http://listserv.temple.edu/cgi-bin/wa?RSS&L=NETWORKER

To sign off this list, send email to listserv AT listserv.temple DOT edu and 
type "signoff networker" in the body of the email. Please write to 
networker-request AT listserv.temple DOT edu if you have any problems with this 
list. You can access the archives at 
http://listserv.temple.edu/archives/networker.html or
via RSS at http://listserv.temple.edu/cgi-bin/wa?RSS&L=NETWORKER