Networker

Re: [Networker] Interminent problem with clone operation Networker 7.1.2/Solaris

2005-04-19 13:23:39
Subject: Re: [Networker] Interminent problem with clone operation Networker 7.1.2/Solaris
From: Will Parsons <w.parsons AT LEEDS.AC DOT UK>
To: NETWORKER AT LISTSERV.TEMPLE DOT EDU
Date: Tue, 19 Apr 2005 18:12:33 +0100
Hi Demitris,
We've been seeing exactly this problem for the last 18 months. We're running Networker 7.1.2 on Solaris9, although the problem has been present since the server was built on 7.1.0.

The issue seems more apparent on a Storage node than on the Master server, but the symptom is the same as yours. The clone operation starts, runs for an amount of time, and then just stops. The drive doing the read operation will be reported by the Solaris ST driver as 100% busy, but with no IO (try "iostat -xnz 2 100"). I've ended up killing off nsrmmd processes to free up the drives once they've got into this state. Nothing is ever reported in any log files, even running nsrmmd in debug mode doesn't give anything useful. The problem occurs with Manual clones, AND with clones started automatically from a save group.

We have one Windows Storage node which has never exhibited these symptoms, even though it manages groups with Auto-cloning switched on.

The settings that I've been working with on this are:
1) Clone Storage Node attribute. This is very fluffily defined, and its expected behaviour when cloning Manually is not apparent. 2) "no index save" on the savegroup. The thinking here was that the clone job (running on the storage node) might be waiting for a tape on the Master Server in order to clone the index data associated with the clone job. This has not been successful in resolving the problem, but I've only used it on one savegroup so far. I plan to try setting ALL the savegroups on that library to "No Index Save" and see if that allows the clones to run through.

This is a major issue to us as we can currently only use the Master server to carry out cloning operations, so it's running flat out 16 hours a day while the storage nodes sit idle. The capacity of our system is seriously limited by this one BUG.

I've had a case open with Legato (3109892) since September 2004 without a resolution.

I'd be very interested to hear about your configuration and the exact details of your problem to see if we can compare notes and figure out common themes in the configuration. Feel free to e-mail direct.


From
Will





John Reate wrote:

Hi all,

I have a very interesting cloning problem in our site here and I wonder if 
anyone
has seen something similar.

With networker 7.1.2 on Solaris 9 (Sun 280R), 2 IBM LTO/2 tape drives on an IBM 
3584 library through 2GBps SAN fabric I try to manually clone a number of 
savesets that constitute a full backup.

The content is about 181 savesets totalling to ~ 640GB.

The command is something like:

nsrclone -S `mminfo -r ssid -q '!incomplete,savetime>=last sunday,savetime<last 
monday'`

The operation starts fine, mounting a new tape from the default clone pool as a 
destination tape and goes on at really high speeds (average 40-50MB/sec) for 
some time -- maximum that I have seen is 35 minutes.

After that the clone command just does not do anything. It does not exit, it 
just seems to be doing nothing. The nsrwatch shows no activity and the 
performance statistics from the switches show nothing either.

This is not a consistent behaviour as to when it will happen; I have seen it 
succesfully cloning 40 or 50 savesets or 70GBs out of the total but at some 
point it just stops doing anything.

If I let the thing just run it never exits and since the drives are occupied by 
it, the scheduled backups will not start either. As soon as I press CTRL-C on 
the nsrclone command it releases the drives and everything continues normally.

If I try cloning a few savesets manually the operation succeds perfectly.

I should note here that even with the failed big cloning operation, if I check 
certain savesets (the ones that seem to have succeeded) with mminfo they seem 
to show number of copies = 2 therefore I assume they have been cloned properly.

Any ideas as to the reason behind that? The fiber switches do not show any 
extensive errors or anything out of the ordinary.

Regards,

Dimitris


Send instant messages to your online friends http://uk.messenger.yahoo.com
--
Note: To sign off this list, send a "signoff networker" command via email
to listserv AT listserv.temple DOT edu or visit the list's Web site at
http://listserv.temple.edu/archives/networker.html where you can
also view and post messages to the list. Questions regarding this list
should be sent to stan AT temple DOT edu
=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=


--


w.parsons AT leeds.ac DOT uk
UNIX Support
Information Systems Services
The University of Leeds
+44 113 343 5670

--
Note: To sign off this list, send a "signoff networker" command via email
to listserv AT listserv.temple DOT edu or visit the list's Web site at
http://listserv.temple.edu/archives/networker.html where you can
also view and post messages to the list. Questions regarding this list
should be sent to stan AT temple DOT edu
=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=