A couple of things to try:
1/ DNS. Make sure that the FQN returned by a ping and reverse lookup are
identical. I have noticed discrepancies here to cause strange behaviour on
my systems.
2/ Remove the indices for the two affected clients and start fresh. If the
problem is with a corrupt index, this may solve the problem.
Good luck
Andrew McGeorge
Senior Systems Specialist
Group Technology Operations
ASB Bank Limited
-----Original Message-----
From: Gary Goldberg [mailto:og AT DIGIMARK DOT NET]
Sent: 19 February 2003 7:53:AM
To: NETWORKER AT LISTMAIL.TEMPLE DOT EDU
Subject: [Networker] Clients failing, retring, failing - Ugh!
Hello. I'm at my wit's end, and Legato tech support has been of no help
at all. I appreciate the help of anyone who wants to help me tackle
this. Some of you have seen parts of this before -- I've found some
additional info which may help understand the problem.
I'm running a Networker server (6.11 Build 238) on Windows NT 4 SP6a,
with 13 clients, all installed into the Default group and pool. The
group is set to Verbose, to run images on Sunday at 1:30AM and
incrementals the other six days of the week. The backups are run to
an AIT-1 30 tape Treefrog jukebox with two drives.
There has been an ongoing problem with this system for going on two
years now, where the nsrserverhost, its indices, and the indicies of
all the clients fail with a Unknown error 0x93. I don't think it
is relevant to this problem but I add it to this list in case I am
wrong.
----
Everything was fine until I started to deploy some SonicWALL SOHO2
and SOHO3 firewalls in front of several servers. Each of the
machines I operate, are run on behalf of different clients, so
the company LAN is more like a loose affiliation. Tweaking of the
SonicWALL rules in each firewall and setting verbose mode in the
default group in Networker allows the machines behind two of the
firewalls to backup, although they are considerably slower performing.
Two of the clients, each behind their own firewall, continue to
fail. One is a RedHat Linux 8 Dell PE1650 called "mystic", the
other is an UltraSPARC IIi at 440Mhz running Solaris 8 called
"clarinet". Both have plenty of memory and diskspace. Each machine
has a / and a /home partition. I've set Client Retries to 1,
Inactivity Timeout to 720 in the default group profile.
---
Sunday/Monday I ran the image backups. All the other clients ran
fine, but these two failed. I restarted the group and had the same
behavior. Amazingly, mystic:/ saved successfully (first time in
weeks). The other three file systems failed miserably. Here is the
monitor log output: (paraphrased)
11:11PM Default running on clarinet, mystic (this message appears
every thirty seconds or so, because of the Verbal setting
on the group. So in-between all the other messages is this
one.)
11:15 clarinet:/home done 967MB
11:15 clarinet:/home saving to pool Default
11:17 mystic:/home done 779MB
11:48 clarinet:/ done 1031MB
11:48 clarinet:/ saving to pool Default
12:34AM clarinet:/home done 963MB (You'd expect this to be
slightly smaller, after
midnight some logs have
rolled and compressed.)
01:01 clarinet:/ done 1031MB
01:02 media info: verification of volume, volid, succeeded.
01:02 Write completion notice: writing to volume completed.
(still issuing "Default running on clarinet, mystic" msgs.)
01:16 mystic:/home saving to pool Default
01:57 mystic:/home done 772MB
01:58 media info: verification of volume, volid, succeeded.
01:58 Write completion notice: writing to volume completed.
02:44 Default running on mystic (clarinet stopped being in this
list.)
04:00 Default running on (nsrserverhost)
04:01 Default completed.
Important points:
1. The "Default running on clarinet,mystic repeats inbetween all
those lines above.
2. For each failed backup, it backs up the entire filesystem, but
then "something" fails to register that the filesystem has
completed, the backup times out, retries (because of Client
Retries), does the same exact thing, then finally aborts
after some timeout. The tapes are filled with "aborted" savesets
from these clients each night.
3. Despite the full run of each filesystem twice, the three filesystems
are market "aborted/failed" and are unavailable.
mystic's daemon log has these entries (tail):
02/14/03 05:27:22 nsrexecd: failed to write NUL handshake on 5: errno 32,
Broken pipe
02/14/03 05:52:56 nsrexecd: failed to write NUL handshake on 5: errno 32,
Broken pipe
02/15/03 03:03:26 nsrexecd: failed to write NUL handshake on 5: errno 32,
Broken pipe
02/15/03 03:25:33 nsrexecd: failed to write NUL handshake on 5: errno 32,
Broken pipe
02/15/03 05:45:58 nsrexecd: failed to write NUL handshake on 5: errno 32,
Broken pipe
02/15/03 05:56:16 nsrexecd: failed to write NUL handshake on 5: errno 32,
Broken pipe
02/17/03 07:15:37 nsrexecd: failed to write NUL handshake on 5: errno 32,
Broken pipe
02/17/03 13:48:45 nsrexecd: failed to write NUL handshake on 5: errno 32,
Broken pipe
02/17/03 23:08:20 nsrexecd: failed to write NUL handshake on 5: errno 32,
Broken pipe
02/18/03 01:51:32 nsrexecd: failed to write NUL handshake on 5: errno 32,
Broken pipe
clarinet's daemon.log has these entries (tail):
02/09/03 15:44:58 nsrexecd: failed to write NUL handshake on 6: errno 32,
Broken pipe
02/10/03 01:40:56 nsrexecd: failed to write NUL handshake on 6: errno 32,
Broken pipe
02/11/03 01:44:06 nsrexecd: failed to write NUL handshake on 6: errno 32,
Broken pipe
02/16/03 02:17:22 nsrexecd: failed to write NUL handshake on 6: errno 32,
Broken pipe
02/17/03 11:16:49 nsrexecd: failed to write NUL handshake on 6: errno 32,
Broken pipe
02/18/03 00:29:00 nsrexecd: failed to write NUL handshake on 6: errno 32,
Broken pipe
02/18/03 13:41:59 nsrexecd: failed to write NUL handshake on 6: errno 32,
Broken pipe
3. Is there a way to mark aborted backups as good?
4. Is there something about the timing between retries that might tell us
something?
The firewall is set as loosely as possible, with known good rules allowing
Networker
between client and nsrserverhost.
---
I have no idea what to do. Thank you very much. -Gary
-- "We don't see things as they are, we see them as we are." - Anais Nin
Gary Goldberg KA3ZYW <og AT digimark DOT net> V:301/249-6501 F:301/390-1955
AIM:OgGreeb
Digital Marketing/Bowie MD/Systems & Networks Consult
<http://www.digimark.net/>
--
Note: To sign off this list, send a "signoff networker" command via email
to listserv AT listmail.temple DOT edu or visit the list's Web site at
http://listmail.temple.edu/archives/networker.html where you can
also view and post messages to the list.
=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=
========================================================================================
This email message and attachments are confidential to our organisation and
subject to legal privilege. If you have received this email in error, please
advise the sender immediately and destroy the message and any attachments. If
you are not the intended recipient you are notified that any use, distribution,
amendment, copying or any action taken or omitted to be taken in reliance of
this message or attachments is prohibited. You can read our Privacy Policy
here: <http://www.asbbank.co.nz/privacystatement.stm>
=========================================================================================
--
Note: To sign off this list, send a "signoff networker" command via email
to listserv AT listmail.temple DOT edu or visit the list's Web site at
http://listmail.temple.edu/archives/networker.html where you can
also view and post messages to the list.
=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=
|