Re: [Networker] Interminent problem with clone operation Networker 7.1.2

Hi Demitris,

We've been seeing exactly this problem for the last 18 months. We'rerunning Networker 7.1.2 on Solaris9, although the problem has beenpresent since the server was built on 7.1.0.

The issue seems more apparent on a Storage node than on the Masterserver, but the symptom is the same as yours. The clone operationstarts, runs for an amount of time, and then just stops. The drive doingthe read operation will be reported by the Solaris ST driver as 100%busy, but with no IO (try "iostat -xnz 2 100"). I've ended up killingoff nsrmmd processes to free up the drives once they've got into this state.Nothing is ever reported in any log files, even running nsrmmd in debugmode doesn't give anything useful. The problem occurs with Manualclones, AND with clones started automatically from a save group.

We have one Windows Storage node which has never exhibited thesesymptoms, even though it manages groups with Auto-cloning switched on.


The settings that I've been working with on this are:

1) Clone Storage Node attribute. This is very fluffily defined, and itsexpected behaviour when cloning Manually is not apparent.2) "no index save" on the savegroup. The thinking here was that theclone job (running on the storage node) might be waiting for a tape onthe Master Server in order to clone the index data associated with theclone job. This has not been successful in resolving the problem, butI've only used it on one savegroup so far. I plan to try setting ALL thesavegroups on that library to "No Index Save" and see if that allows theclones to run through.

This is a major issue to us as we can currently only use the Masterserver to carry out cloning operations, so it's running flat out 16hours a day while the storage nodes sit idle. The capacity of our systemis seriously limited by this one BUG.

I've had a case open with Legato (3109892) since September 2004 withouta resolution.

I'd be very interested to hear about your configuration and the exactdetails of your problem to see if we can compare notes and figure outcommon themes in the configuration. Feel free to e-mail direct.



From
Will





John Reate wrote:

Hi all,

I have a very interesting cloning problem in our site here and I wonder if 
anyone
has seen something similar.

With networker 7.1.2 on Solaris 9 (Sun 280R), 2 IBM LTO/2 tape drives on an IBM 
3584 library through 2GBps SAN fabric I try to manually clone a number of 
savesets that constitute a full backup.

The content is about 181 savesets totalling to ~ 640GB.

The command is something like:

nsrclone -S `mminfo -r ssid -q '!incomplete,savetime>=last sunday,savetime<last 
monday'`

The operation starts fine, mounting a new tape from the default clone pool as a 
destination tape and goes on at really high speeds (average 40-50MB/sec) for 
some time -- maximum that I have seen is 35 minutes.

After that the clone command just does not do anything. It does not exit, it 
just seems to be doing nothing. The nsrwatch shows no activity and the 
performance statistics from the switches show nothing either.

This is not a consistent behaviour as to when it will happen; I have seen it 
succesfully cloning 40 or 50 savesets or 70GBs out of the total but at some 
point it just stops doing anything.

If I let the thing just run it never exits and since the drives are occupied by 
it, the scheduled backups will not start either. As soon as I press CTRL-C on 
the nsrclone command it releases the drives and everything continues normally.

If I try cloning a few savesets manually the operation succeds perfectly.

I should note here that even with the failed big cloning operation, if I check 
certain savesets (the ones that seem to have succeeded) with mminfo they seem 
to show number of copies = 2 therefore I assume they have been cloned properly.

Any ideas as to the reason behind that? The fiber switches do not show any 
extensive errors or anything out of the ordinary.

Regards,

Dimitris

Send instant messages to your online friends http://uk.messenger.yahoo.com

--
Note: To sign off this list, send a "signoff networker" command via email
to listserv AT listserv.temple DOT edu or visit the list's Web site at
http://listserv.temple.edu/archives/networker.html where you can
also view and post messages to the list. Questions regarding this list
should be sent to stan AT temple DOT edu
=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=



--


w.parsons AT leeds.ac DOT uk
UNIX Support
Information Systems Services
The University of Leeds
+44 113 343 5670

--
Note: To sign off this list, send a "signoff networker" command via email
to listserv AT listserv.temple DOT edu or visit the list's Web site at
http://listserv.temple.edu/archives/networker.html where you can
also view and post messages to the list. Questions regarding this list
should be sent to stan AT temple DOT edu
=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=

Re: [Networker] Interminent problem with clone operation Networker 7.1.2/Solaris