Networker

[Networker] Are you sure your savepnpc backups are good?

2004-04-15 11:41:17
Subject: [Networker] Are you sure your savepnpc backups are good?
From: "David E. Nelson" <david.nelson AT NI DOT COM>
To: NETWORKER AT LISTMAIL.TEMPLE DOT EDU
Date: Thu, 15 Apr 2004 10:40:44 -0500
Hi All,

A word of caution for those folks using 'timeout:' within /nsr/res/<group>.res
savepnpc scripts.

There doesn't appear to be an easy and reliable method to determine if your
post-command scripts ran as a result of a timeout or the completion of backups.

So far, I've opened a case w/ our NetWorker tech-support, emailed this list,
and another highly technical list.  Bottom line, there isn't a simple, timely,
and accurate method to determine if a savepnpc timeout condition occurred.

A couple of problems that I've uncovered:

- The timeout is reported in /nsr/logs/savepnpc.log as:

    01/22/02 04:00:35 pstclntsave: Time out condition occurred.
    01/22/02 05:12:30 pstclntsave: All command(s) ran successfully.
    01/22/02 05:13:30 pstclntsave: All savesets on the worklist are done.
    01/22/02 05:13:30 pstclntsave: Exited.

Notice that even though a savepnpc time out condition occurred, the data was
still being backed up - it didn't finish until 1:13 hours later.  For oracle
backups, this is now trash since the DB has either been started or come out of
hot backup mode.  Are you aware that this occurred?  I'd be willing to bet that
not likely.

I was quite surprised as to the number of 'Time out' entries existed in our
savepnpc.log's.  I'd suggest you do the same if you're using savepnpc w/
timeout.

- No environment variable flag is passed into savepnpc scripts no matter if a
timeout occurred or not.  My research and testing has shown that the env for
post-savepnpc is identical for successful and timed out backups.

- Yes, you can script a 'grep' to look in /nsr/log/savepnpc.log, you can
construct a query for 'mminfo' and report 'sscomp(17)' and look for
'undefined', check the status of a saveset via mminfo for 'in-progress', etc,
etc, etc.  The problem with all these approaches is that things may have
changed before you performed the query.  So while the chance is slim that your
backup is bad, you're still gambling - not a good thing when a production host
needs to be restored only to discover that the data is trash.

- Nothing in the backup reports that NW produces indicates any such activity
occurred.

Frankly, I'd like to see an env variable passed from savepnpc to scripts
stating that a timeout was encountered.  It's easy to detect, it's reliable, and
it's timely.  Also, the backup report e-mailed from NW should include this info
at the very top of the report.

Any thoughts on this?

Regards,
        /\/elson


On Thu, 8 Apr 2004, David E. Nelson wrote:

> Hi All,
>
> Is there a reliable method in a savepnpc script to detect if the group's
> timeout (as specified in the /nsr/res/<Group>.res file) was encountered?
>
> This is on Solaris 2.6/2.8 using 6.1.3 server and 6.x/7.x clients and the
> savepnpc scripts are written in Korn shell.
>
> Thanks,
>         /\/elson
>
> --
> ~~ ** ~~  If you didn't learn anything when you broke it the 1st ~~ ** ~~
>                         time, then break it again.
>
> --
> Note: To sign off this list, send a "signoff networker" command via email
> to listserv AT listmail.temple DOT edu or visit the list's Web site at
> http://listmail.temple.edu/archives/networker.html where you can
> also view and post messages to the list.
> =*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=
>

--
~~ ** ~~  If you didn't learn anything when you broke it the 1st ~~ ** ~~
                        time, then break it again.

--
Note: To sign off this list, send a "signoff networker" command via email
to listserv AT listmail.temple DOT edu or visit the list's Web site at
http://listmail.temple.edu/archives/networker.html where you can
also view and post messages to the list.
=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=