Networker

[Networker] automated staging failing to delete on-disk saveset clones

2008-02-27 22:18:21
Subject: [Networker] automated staging failing to delete on-disk saveset clones
From: Steve Groom <sgroom AT CALTECH DOT EDU>
To: NETWORKER AT LISTSERV.TEMPLE DOT EDU
Date: Wed, 27 Feb 2008 19:13:11 -0800
Last week we started having problems with our
NetWorker setup (7.4.1.Build.335 according to the log file,
we are using the Sun "Enterprise Backup" flavor
of NetWorker). This is a Sun V440 running Solaris 10.
We use a disk-to-disk-to-tape setup with automated staging
from disk to LTO3 tape, and it was working fine for a few months
as we gradually added more traffic to this system.

Last week we reconfigured this system in preparation for
adding a storage node and enabling drive sharing, though the
storage node itself isn't configured yet. But somehow we must have
messed something up.  Since then the staging is unreliable,
it sometimes gets stuck in a situation
where it can't delete the on-disk savesets after cloning them to
tape, repeatedly cloning the same savesets to tape again and again.

(To be clear, the other storage node and drive sharing are not
actively in use, they are configured but the storage node isn't
doing anything yet.)

After successfully cloning savesets to tape, nsrstage reports the error:

nsrstage Deletion of clone 1204089207 of ssid 4190433751 from media database failed with the error 'Purge save set operation already in progress'.

Because this deletion failed, the next time automatic
staging ran, another tape clone of the same saveset was made,
followed by the same deletion error. The cycle started
repeating until we figured out what was going and
eventually stopped it.

The first time this happened we ended up with 13 tape copies of
a 2TB saveset plus a bunch of smaller ones before we figured out what was
going on and how to clear it up. We figured we hadn't reset something
properly after configuring the library for drive sharing. To clear
it up that time we restarted networker (even rebooted the machine!),
and manually deleted all the extra clones,
recycled the wasted tapes,
and since then it's been running fine for several days.

We thought we were past it. But not so fast!
This morning, after working fine for several days, it
suddently started happening again. Again, an error message about
deleting the disk copy, and again the never-ending cycle of
copying the same savesets to tape over and over again
until we noticed and intervened.

Any idea what's going on? Any idea what we could have done
to cause this, and what we need to check to fix it?

When I first tried deleting the saveset manually
before we had restarted the server
(nsrmm -d -S ssid/cloneid ...)
I got the same error error message, but it was preceded by "RAP error:".
Could that be a clue? (What is RAP?)

It seems to me like some kind of locking problem, and locking
is exactly the kind of thing that drive sharing would touch.
So I'm sure that's where we went wrong. But I have no idea
how to go about finding the source of the problem or how to fix it.

Here is an excerpt from the daemon.log file when this happened this morning:

32313 02/27/08 5:55:00 AM nsrmmd#3 Device /kelley_bu01/pool1/ DiskFull/a0/_AF_readonly: Automated Staging has determined the need to migrate 0 KB 38718 02/27/08 5:57:12 AM nsrd kelley:cloning session saving to pool 'Master Full' (MstrFull.0053) 38730 02/27/08 5:57:13 AM nsrd cloning session: 1 save set(s) reading from Disk_Full.01.RO 896 KB of 54 GB 38714 02/27/08 6:08:55 AM nsrd kelley:cloning session done saving to pool 'Master Full' (MstrFull.0053) 53358 02/27/08 6:18:56 AM nsrstage Deletion of clone 1204089207 of ssid 4190433751 from media database failed with the error 'Purge save set operation already in progress'.

I've looked for references to that message in the docs, in the list
archives, and in EMC Powerlink, and in Sun's support system.
I can't find anything that helps in figuring out where to look.

Any hints would be very much welcomed!
Thanks in advance...

Steve Groom

To sign off this list, send email to listserv AT listserv.temple DOT edu and type 
"signoff networker" in the body of the email. Please write to networker-request 
AT listserv.temple DOT edu if you have any problems with this list. You can access the 
archives at http://listserv.temple.edu/archives/networker.html or
via RSS at http://listserv.temple.edu/cgi-bin/wa?RSS&L=NETWORKER