Networker

Re: [Networker] automated staging failing to delete on-disk saveset clones

2008-02-28 17:56:16
Subject: Re: [Networker] automated staging failing to delete on-disk saveset clones
From: Steve Groom <sgroom AT CALTECH DOT EDU>
To: NETWORKER AT LISTSERV.TEMPLE DOT EDU
Date: Thu, 28 Feb 2008 14:44:55 -0800
The intention here is just to share the tape devices,
not the AFTD's. I'll double check the configuration, but
I don't think that's what's happening here.
We certainly don't have a distributed
filesystem in place that would allow both servers to
mount the same AFTD at once anyway, though we were
planning on having the two servers use separate
LUN's (via separate controllers) on the same large
storage array.

-steve

On Feb 27, 2008, at 9:54 PM, Mathew Harvest wrote:

Steve,

Are you using Dynamic Drive Sharing on your Advanced File Type Devices,
or just the Tape Drives?

I've not had experience with using DDS on AFTD's but I've heard that
it's a bad thing to do ... and I can imagine that automated staging
policies could fall over themselves if this were the case

Mat.

-----Original Message-----
From: EMC NetWorker discussion [mailto:NETWORKER AT LISTSERV.TEMPLE DOT EDU] On
Behalf Of Steve Groom
Sent: Thursday, 28 February 2008 1:13 PM
To: NETWORKER AT LISTSERV.TEMPLE DOT EDU
Subject: [Networker] automated staging failing to delete on-disk saveset
clones

Last week we started having problems with our
NetWorker setup (7.4.1.Build.335 according to the log file,
we are using the Sun "Enterprise Backup" flavor
of NetWorker). This is a Sun V440 running Solaris 10.
We use a disk-to-disk-to-tape setup with automated staging
from disk to LTO3 tape, and it was working fine for a few months
as we gradually added more traffic to this system.

Last week we reconfigured this system in preparation for
adding a storage node and enabling drive sharing, though the
storage node itself isn't configured yet. But somehow we must have
messed something up.  Since then the staging is unreliable,
it sometimes gets stuck in a situation
where it can't delete the on-disk savesets after cloning them to
tape, repeatedly cloning the same savesets to tape again and again.

(To be clear, the other storage node and drive sharing are not
actively in use, they are configured but the storage node isn't
doing anything yet.)

After successfully cloning savesets to tape, nsrstage reports the error:

nsrstage Deletion of clone 1204089207 of ssid 4190433751 from media
database failed with the error 'Purge save set operation already in
progress'.

Because this deletion failed, the next time automatic
staging ran, another tape clone of the same saveset was made,
followed by the same deletion error. The cycle started
repeating until we figured out what was going and
eventually stopped it.

The first time this happened we ended up with 13 tape copies of
a 2TB saveset plus a bunch of smaller ones before we figured out what
was
going on and how to clear it up. We figured we hadn't reset something
properly after configuring the library for drive sharing. To clear
it up that time we restarted networker (even rebooted the machine!),
and manually deleted all the extra clones,
recycled the wasted tapes,
and since then it's been running fine for several days.

We thought we were past it. But not so fast!
This morning, after working fine for several days, it
suddently started happening again. Again, an error message about
deleting the disk copy, and again the never-ending cycle of
copying the same savesets to tape over and over again
until we noticed and intervened.

Any idea what's going on? Any idea what we could have done
to cause this, and what we need to check to fix it?

When I first tried deleting the saveset manually
before we had restarted the server
(nsrmm -d -S ssid/cloneid ...)
I got the same error error message, but it was preceded by "RAP error:".
Could that be a clue? (What is RAP?)

It seems to me like some kind of locking problem, and locking
is exactly the kind of thing that drive sharing would touch.
So I'm sure that's where we went wrong. But I have no idea
how to go about finding the source of the problem or how to fix it.

Here is an excerpt from the daemon.log file when this happened this
morning:

32313 02/27/08  5:55:00 AM  nsrmmd#3 Device /kelley_bu01/pool1/
DiskFull/a0/_AF_readonly: Automated Staging has determined the need to
migrate 0 KB
38718 02/27/08  5:57:12 AM  nsrd kelley:cloning session saving to pool
'Master Full' (MstrFull.0053)
38730 02/27/08  5:57:13 AM  nsrd cloning session: 1 save set(s)
reading from Disk_Full.01.RO 896 KB of 54 GB
38714 02/27/08  6:08:55 AM  nsrd kelley:cloning session done saving to
pool 'Master Full' (MstrFull.0053)
53358 02/27/08  6:18:56 AM  nsrstage Deletion of clone 1204089207 of
ssid 4190433751 from media database failed with the error 'Purge save
set operation already in progress'.

I've looked for references to that message in the docs, in the list
archives, and in EMC Powerlink, and in Sun's support system.
I can't find anything that helps in figuring out where to look.

Any hints would be very much welcomed!
Thanks in advance...

Steve Groom

To sign off this list, send email to listserv AT listserv.temple DOT edu and
type "signoff networker" in the body of the email. Please write to
networker-request AT listserv.temple DOT edu if you have any problems with this
list. You can access the archives at
http://listserv.temple.edu/archives/networker.html or
via RSS at http://listserv.temple.edu/cgi-bin/wa?RSS&L=NETWORKER




********************************* DISCLAIMER ********************************* The information contained in the above e-mail message or messages (which includes any attachments) is confidential and may be legally privileged. It is intended only for the use of the person or entity to which it is addressed. If you are not the addressee any form of disclosure, copying, modification, distribution or any action taken or omitted in reliance on the information is unauthorised. Opinions contained in the message(s) do not necessarily reflect the opinions of the Queensland Government and its authorities. If you received this communication in error, please notify the sender immediately and delete it from your computer system network.

To sign off this list, send email to listserv AT listserv.temple DOT edu and type "signoff networker" in the body of the email. Please write to networker-request AT listserv.temple DOT edu if you have any problems with this list. You can access the archives at http://listserv.temple.edu/archives/networker.html or
via RSS at http://listserv.temple.edu/cgi-bin/wa?RSS&L=NETWORKER


BEGIN-ANTISPAM-VOTING-LINKS
------------------------------------------------------
Teach CanIt if this mail (ID 10021592) is spam:
Spam:        
https://mail0.ipac.caltech.edu/canit/b.php?c=s&i=10021592&m=bcff12dcdeb5
Not spam:    
https://mail0.ipac.caltech.edu/canit/b.php?c=n&i=10021592&m=bcff12dcdeb5
Forget vote: 
https://mail0.ipac.caltech.edu/canit/b.php?c=f&i=10021592&m=bcff12dcdeb5
------------------------------------------------------
END-ANTISPAM-VOTING-LINKS

To sign off this list, send email to listserv AT listserv.temple DOT edu and type 
"signoff networker" in the body of the email. Please write to networker-request 
AT listserv.temple DOT edu if you have any problems with this list. You can access the 
archives at http://listserv.temple.edu/archives/networker.html or
via RSS at http://listserv.temple.edu/cgi-bin/wa?RSS&L=NETWORKER