Networker

Re: [Networker] Limit on number of savesets in an nsrclone?

2008-03-11 16:01:22
Subject: Re: [Networker] Limit on number of savesets in an nsrclone?
From: Preston de Guise <enterprise.backup AT GMAIL DOT COM>
To: NETWORKER AT LISTSERV.TEMPLE DOT EDU
Date: Wed, 12 Mar 2008 06:58:00 +1100
Hi Ian,


Check to see whether the saveset reported as missing actually can be found under a /_AF_readonly component of the volume. If so, then you have a known bug whereby nsrd/nsrmm can accidentally generate data to the "wrong side" of the device. Because it shouldn't go there, then NetWorker can't find it when the saveset access is attempted.

No, I can't see them there: all the savesets that are missing are from around the time we had an array crash and restart so I'm suspecting it's a storage problem. But I'll watch out for the scenario you describe: it rings bells for other problems.

Failing ~300 savesets because one of them is unreadable is a bit naughty, too.

It's the nature of nsrmmd unfortunately. Basically it gets told by nsrclone to read all those savesets, one fails, so the atomic activity is considered a failure. You might also want to check to see if nsrmmd is coredumping; check /nsr/cores/nsrmmd to see if you've got core dumps from around the time the failed reads are occurring.

I've seen a similar problem caused by array crashes/connectivity losses. One way to check in advance is to do the following:

find /path/to/dbu -type f -print > /tmp/results.txt

for lssid in `mminfo -q "volume=dbu.RO" -r "ssid(60)"`
do
echo $lssid `grep -c $lssid /tmp/results.txt`
done

That'll check every ssid that NetWorker _thinks_ is on the disk backup unit. You should see output along the lines of say:

e65b3ac2-00000006-1bc0154e-47c0154e-00e60000-c0a86404 2

Which is the long ssid and the number of times it appears on the DBU. There should be 2 instances; the actual saveset and the note for the saveset. If you have an instance that reports a count of 0, then NetWorker thinks that it is on the disk, but it isn't. You can then use nsrmm to delete the saveset. To identify the short ssid,cloneid combo you could then run:

mminfo -q "ssid=e65b3ac2-00000006-1bc0154e-47c0154e-00e60000-c0a8640" - r volume,ssid,cloneid

And delete the instances for the DBU only (nsrmm -d -S ssid/cloneid).

You could make the script smarter, etc., but since I'm on a train with variable service, I'll leave that as an exercise for the reader :-)

Cheers,

Preston.

--
Preston de Guise


"Enterprise Systems Backup and Recovery: A Corporate Insurance Policy", due out August 15 2008:

http://www.crcpress.com/shopping_cart/products/product_detail.asp?sku=AU6396&isbn=9781420076394&parent_id=&pc=


To sign off this list, send email to listserv AT listserv.temple DOT edu and type 
"signoff networker" in the body of the email. Please write to networker-request 
AT listserv.temple DOT edu if you have any problems with this list. You can access the 
archives at http://listserv.temple.edu/archives/networker.html or
via RSS at http://listserv.temple.edu/cgi-bin/wa?RSS&L=NETWORKER