Re: [Networker] Interminent problem with clone operation Networker 7.1.2/Solaris
2005-04-19 13:23:39
Hi Demitris,
We've been seeing exactly this problem for the last 18 months. We're
running Networker 7.1.2 on Solaris9, although the problem has been
present since the server was built on 7.1.0.
The issue seems more apparent on a Storage node than on the Master
server, but the symptom is the same as yours. The clone operation
starts, runs for an amount of time, and then just stops. The drive doing
the read operation will be reported by the Solaris ST driver as 100%
busy, but with no IO (try "iostat -xnz 2 100"). I've ended up killing
off nsrmmd processes to free up the drives once they've got into this state.
Nothing is ever reported in any log files, even running nsrmmd in debug
mode doesn't give anything useful. The problem occurs with Manual
clones, AND with clones started automatically from a save group.
We have one Windows Storage node which has never exhibited these
symptoms, even though it manages groups with Auto-cloning switched on.
The settings that I've been working with on this are:
1) Clone Storage Node attribute. This is very fluffily defined, and its
expected behaviour when cloning Manually is not apparent.
2) "no index save" on the savegroup. The thinking here was that the
clone job (running on the storage node) might be waiting for a tape on
the Master Server in order to clone the index data associated with the
clone job. This has not been successful in resolving the problem, but
I've only used it on one savegroup so far. I plan to try setting ALL the
savegroups on that library to "No Index Save" and see if that allows the
clones to run through.
This is a major issue to us as we can currently only use the Master
server to carry out cloning operations, so it's running flat out 16
hours a day while the storage nodes sit idle. The capacity of our system
is seriously limited by this one BUG.
I've had a case open with Legato (3109892) since September 2004 without
a resolution.
I'd be very interested to hear about your configuration and the exact
details of your problem to see if we can compare notes and figure out
common themes in the configuration. Feel free to e-mail direct.
From
Will
John Reate wrote:
Hi all,
I have a very interesting cloning problem in our site here and I wonder if
anyone
has seen something similar.
With networker 7.1.2 on Solaris 9 (Sun 280R), 2 IBM LTO/2 tape drives on an IBM
3584 library through 2GBps SAN fabric I try to manually clone a number of
savesets that constitute a full backup.
The content is about 181 savesets totalling to ~ 640GB.
The command is something like:
nsrclone -S `mminfo -r ssid -q '!incomplete,savetime>=last sunday,savetime<last
monday'`
The operation starts fine, mounting a new tape from the default clone pool as a
destination tape and goes on at really high speeds (average 40-50MB/sec) for
some time -- maximum that I have seen is 35 minutes.
After that the clone command just does not do anything. It does not exit, it
just seems to be doing nothing. The nsrwatch shows no activity and the
performance statistics from the switches show nothing either.
This is not a consistent behaviour as to when it will happen; I have seen it
succesfully cloning 40 or 50 savesets or 70GBs out of the total but at some
point it just stops doing anything.
If I let the thing just run it never exits and since the drives are occupied by
it, the scheduled backups will not start either. As soon as I press CTRL-C on
the nsrclone command it releases the drives and everything continues normally.
If I try cloning a few savesets manually the operation succeds perfectly.
I should note here that even with the failed big cloning operation, if I check
certain savesets (the ones that seem to have succeeded) with mminfo they seem
to show number of copies = 2 therefore I assume they have been cloned properly.
Any ideas as to the reason behind that? The fiber switches do not show any
extensive errors or anything out of the ordinary.
Regards,
Dimitris
Send instant messages to your online friends http://uk.messenger.yahoo.com
--
Note: To sign off this list, send a "signoff networker" command via email
to listserv AT listserv.temple DOT edu or visit the list's Web site at
http://listserv.temple.edu/archives/networker.html where you can
also view and post messages to the list. Questions regarding this list
should be sent to stan AT temple DOT edu
=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=
--
w.parsons AT leeds.ac DOT uk
UNIX Support
Information Systems Services
The University of Leeds
+44 113 343 5670
--
Note: To sign off this list, send a "signoff networker" command via email
to listserv AT listserv.temple DOT edu or visit the list's Web site at
http://listserv.temple.edu/archives/networker.html where you can
also view and post messages to the list. Questions regarding this list
should be sent to stan AT temple DOT edu
=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=
|
|
|