[Networker] Savegrp Aborting

I'm experiencing a really weird technical issue
in Boston and I'm wondering if you've ever seen
something like this or have any ideas on what else to try?

The customer has been using Solstice Backup 6.0.3
(Sun's version of Networker) in their environment for
several years. They backup 7 clients, all Sun boxes running
Solaris 6, 7, 8, and 9. All of the clients have SBU 6.0.3
client sw on them and they backup fine to the original SBU server.

They want to move their config over from Solstice Backup to Legato
Networker. So the plan is to have them keep backing up to the original
server while we configure a new Networker 7.1.1 server. Once
backups are successful on the new server, the old server will no
longer be a Networker server and all of the clients will then have
their client sw updated from SBU 6.0.3 to Networker 7.1.1.

The new server is a Sun-Fire v880 running Solaris 9 with the latest
recommended patch cluster downloaded from Sunsolve. We've attached a
Qualstar 58132 library containing 3 SAIT-1 tape drives in it via
scsi. We installed Networker 7.1.1 on it.

Networker is configured and initial backups of a networker client run
fine. Great performance is sustained...seeing 30mb/sec to 40/mb/sec per
drive.

Add in the client systems and configure them. Kickoff backups and most seem
fine. The SBU server takes a long time because we are backing up
it's nsr partition to tape on the new server. nsr there is pretty large.
But backups generally succeed.

They also have several Procom NAS boxes which contain very large
filesystems. These filesystems are NFS mounted onto vision. I configured
several client resources for the NFS mounts on the new Networker
serve...each one containing several of the nfs mount points. The
total data of all the NFS mounts is almost 4Tb. The largest individual
nfs mount is 800gb.

Anyhow, when we kickoff everything backups start and again we see
really good performance....all 3 drives writing at speeds as high as
60mb/sec but averaging 30mb/sec. These backups can run for several
hours and look fine.
If any of the savesets complete it is updated in the group details
window.

However, after a while....with 'while' being not clearly defined, the
savegrps start to abort. When I watch the daemon.log I'm seeing a
message:

savegrp: SYSTEM error: No such file or directory
savegrp: Failed to update server, aborting Savegroup

And then the savegrp process gets killed off.

I ran the networker nsrd and nsrexecd daemons in debug mode and it
pointed me to the /nsr/tmp directory.

If I cd to /nsr/tmp/sec/sg/<groupname> and type ls, it lists the files
like pr000001 and pr00000a. However if I type ls -la it shows:

# ls -la
pr000001: No such file or directory
total 6
2 drwxr-xr-x   2   root   other    512  Apr   12  18:08  .
2 drwxr-xr-x   10 root   other    512  Apr   12  17:10  ..
2 -rw-r--r--    1   root   other    348  Apr    12  12:17 pr00000a

If I then:   'touch test'     I get a message:    test: cannot stat
However if I type 'ls', it shows pr000001, pr00000a, test
But if I type 'ls -la' it shows:

# ls -la
pr000001: No such file or directory
test: No such file or directory
total 6
2 drwxr-xr-x   2   root   other    512  Apr   12  18:08  .
2 drwxr-xr-x   10 root   other    512  Apr   12  17:10  ..

If I wait about 10 or 15 minutes and then type ls -la it shows all
the files including 'test' just fine, showing their permissions and
everything else.  This would appear to be an OS issue to me but it is
really weird.

I am able to get around the problem by turning on the "No Monitor"
option within the group resource. The problem with this is that it
then does not update the status of the savegroup and no savegroup
notifications are logged or created. So it's not really a fix....just
a temp workaround.

For kicks we totally reloaded Solaris 9 on a different disk yesterday
and reinstalled the patches and Networker 7.1.1. Same problem still
occurs.

Also, thinking that this might be a problem with Networker 7.1.1,
I pkgrm it and installed Networker 7.1....and got the same problem.
Removed 7.1 and installed Networker 6.1.4 and got the same problem.

So, wondering if you have any ideas on this?  I'm thinking the fact that
when I 'touch test' and get the cannot stat message and then ls -la shows
no such file or directory is my main culprit here. But not sure what the
heck is causing it. BTW...nothing helpful in /var/adm/messages.

One small clue...I have seen some issues on this box where the same
message is occuring for files in /var/spool/mqueue. Not sure if it's
because Networker is trying to email messages or what but there are
at time some files in /var/spool/mqueue that have the same symptoms
as those in /nsr/tmp/sec/sg...

Lastly, Legato's eknowledgebase does have a couple of instances
where the SYSTEM error: No such file or device message is seen but
it suggests checking the group names for illegal characters. With
this in mind I've kept the group names very simple, for instance: test.

Any thoughts or ideas are greatly appreciated!!!!

--
Note: To sign off this list, send a "signoff networker" command via email
to listserv AT listmail.temple DOT edu or visit the list's Web site at
http://listmail.temple.edu/archives/networker.html where you can
also view and post messages to the list.
=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=