Networker

Re: [Networker] Are you sure your savepnpc backups are good?

2004-04-16 04:08:35
Subject: Re: [Networker] Are you sure your savepnpc backups are good?
From: Davina Treiber <Treiber AT HOTPOP DOT COM>
To: NETWORKER AT LISTMAIL.TEMPLE DOT EDU
Date: Fri, 16 Apr 2004 07:41:42 +0100
David E. Nelson wrote:
Hi All,

A word of caution for those folks using 'timeout:' within /nsr/res/<group>.res
savepnpc scripts.

There doesn't appear to be an easy and reliable method to determine if your
post-command scripts ran as a result of a timeout or the completion of backups.

So far, I've opened a case w/ our NetWorker tech-support, emailed this list,
and another highly technical list.  Bottom line, there isn't a simple, timely,
and accurate method to determine if a savepnpc timeout condition occurred.

A couple of problems that I've uncovered:

- The timeout is reported in /nsr/logs/savepnpc.log as:

    01/22/02 04:00:35 pstclntsave: Time out condition occurred.
    01/22/02 05:12:30 pstclntsave: All command(s) ran successfully.
    01/22/02 05:13:30 pstclntsave: All savesets on the worklist are done.
    01/22/02 05:13:30 pstclntsave: Exited.

Notice that even though a savepnpc time out condition occurred, the data was
still being backed up - it didn't finish until 1:13 hours later.  For oracle
backups, this is now trash since the DB has either been started or come out of
hot backup mode.  Are you aware that this occurred?  I'd be willing to bet that
not likely.

This part is not a bug, it's the way it was designed. When the timeout
occurs, the post-processing runs. It's left completely to the user what
to do at this point. It's not desperately difficult to script something
at this point that stops the overrunning sessions (although there are
some gotchas), but that's your decision since not everyone would want to
do this, some users might want the backup to continue but be notified
about the timeout.

I agree that doing nothing can cause problems, an Oracle backup will be
useless if the database comes back up part way through. On NT it can be
worse since the DB may fail to start if the files are open because of
the backup.


I was quite surprised as to the number of 'Time out' entries existed in our
savepnpc.log's.  I'd suggest you do the same if you're using savepnpc w/
timeout.

- No environment variable flag is passed into savepnpc scripts no matter if a
timeout occurred or not.  My research and testing has shown that the env for
post-savepnpc is identical for successful and timed out backups.

- Yes, you can script a 'grep' to look in /nsr/log/savepnpc.log,

I wouldn't count on that. The log entries often seem to be cached in
memory and not written until the post-processing is complete.

you can
construct a query for 'mminfo' and report 'sscomp(17)' and look for
'undefined', check the status of a saveset via mminfo for 'in-progress', etc,
etc, etc.  The problem with all these approaches is that things may have
changed before you performed the query.

I think these approaches are likely to be quite unreliable. The best
plan is for the post-processing script to check for savepnpc processes
still running with the relevant "-g" parameter for your group, or to
query the server (nsradmin) for outstanding save sets for this client in
the work list, this is exactly how pstclntsave does it.

--
Note: To sign off this list, send a "signoff networker" command via email
to listserv AT listmail.temple DOT edu or visit the list's Web site at
http://listmail.temple.edu/archives/networker.html where you can
also view and post messages to the list.
=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=