Networker

Re: [Networker] Clients failing, retring, failing - Ugh!

2003-02-18 14:50:22
Subject: Re: [Networker] Clients failing, retring, failing - Ugh!
From: Andrew McGeorge <Andrew.McGeorge AT ASBBANK.CO DOT NZ>
To: NETWORKER AT LISTMAIL.TEMPLE DOT EDU
Date: Wed, 19 Feb 2003 08:49:45 +1300
A couple of things to try:

1/ DNS. Make sure that the FQN returned by a ping and reverse lookup are
identical. I have noticed discrepancies here to cause strange behaviour on
my systems.

2/ Remove the indices for the two affected clients and start fresh. If the
problem is with a corrupt index, this may solve the problem.

Good luck
Andrew McGeorge
Senior Systems Specialist
Group Technology Operations
ASB Bank Limited


-----Original Message-----
From: Gary Goldberg [mailto:og AT DIGIMARK DOT NET]
Sent: 19 February 2003 7:53:AM
To: NETWORKER AT LISTMAIL.TEMPLE DOT EDU
Subject: [Networker] Clients failing, retring, failing - Ugh!


Hello. I'm at my wit's end, and Legato tech support has been of no help
at all. I appreciate the help of anyone who wants to help me tackle
this. Some of you have seen parts of this before -- I've found some
additional info which may help understand the problem.

I'm running a Networker server (6.11 Build 238) on Windows NT 4 SP6a,
with 13 clients, all installed into the Default group and pool. The
group is set to Verbose, to run images on Sunday at 1:30AM and
incrementals the other six days of the week. The backups are run to
an AIT-1 30 tape Treefrog jukebox with two drives.

There has been an ongoing problem with this system for going on two
years now, where the nsrserverhost, its indices, and the indicies of
all the clients fail with a Unknown error 0x93. I don't think it
is relevant to this problem but I add it to this list in case I am
wrong.

----

Everything was fine until I started to deploy some SonicWALL SOHO2
and SOHO3 firewalls in front of several servers. Each of the
machines I operate, are run on behalf of different clients, so
the company LAN is more like a loose affiliation. Tweaking of the
SonicWALL rules in each firewall and setting verbose mode in the
default group in Networker allows the machines behind two of the
firewalls to backup, although they are considerably slower performing.

Two of the clients, each behind their own firewall, continue to
fail. One is a RedHat Linux 8 Dell PE1650 called "mystic", the
other is an UltraSPARC IIi at 440Mhz running Solaris 8 called
"clarinet". Both have plenty of memory and diskspace. Each machine
has a / and a /home partition. I've set Client Retries to 1,
Inactivity Timeout to 720 in the default group profile.

---

Sunday/Monday I ran the image backups. All the other clients ran
fine, but these two failed. I restarted the group and had the same
behavior. Amazingly, mystic:/ saved successfully (first time in
weeks). The other three file systems failed miserably. Here is the
monitor log output: (paraphrased)

11:11PM Default running on clarinet, mystic   (this message appears
        every thirty seconds or so, because of the Verbal setting
        on the group. So in-between all the other messages is this
        one.)

11:15   clarinet:/home done     967MB
11:15   clarinet:/home saving to pool Default
11:17   mystic:/home done       779MB
11:48   clarinet:/ done         1031MB
11:48   clarinet:/ saving to pool Default
12:34AM clarinet:/home done     963MB  (You'd expect this to be
                                       slightly smaller, after
                                       midnight some logs have
                                       rolled and compressed.)
01:01   clarinet:/ done         1031MB
01:02   media info: verification of volume, volid, succeeded.
01:02   Write completion notice: writing to volume completed.
 (still issuing "Default running on clarinet, mystic" msgs.)
01:16   mystic:/home saving to pool Default
01:57   mystic:/home done       772MB
01:58   media info: verification of volume, volid, succeeded.
01:58   Write completion notice: writing to volume completed.
02:44   Default running on mystic  (clarinet stopped being in this
        list.)
04:00   Default running on (nsrserverhost)
04:01   Default completed.

Important points:

1. The "Default running on clarinet,mystic repeats inbetween all
   those lines above.
2. For each failed backup, it backs up the entire filesystem, but
   then "something" fails to register that the filesystem has
   completed, the backup times out, retries (because of Client
   Retries), does the same exact thing, then finally aborts
   after some timeout. The tapes are filled with "aborted" savesets
   from these clients each night.
3. Despite the full run of each filesystem twice, the three filesystems
   are market "aborted/failed" and are unavailable.

mystic's daemon log has these entries (tail):

02/14/03 05:27:22 nsrexecd: failed to write NUL handshake on 5: errno 32,
Broken pipe
02/14/03 05:52:56 nsrexecd: failed to write NUL handshake on 5: errno 32,
Broken pipe
02/15/03 03:03:26 nsrexecd: failed to write NUL handshake on 5: errno 32,
Broken pipe
02/15/03 03:25:33 nsrexecd: failed to write NUL handshake on 5: errno 32,
Broken pipe
02/15/03 05:45:58 nsrexecd: failed to write NUL handshake on 5: errno 32,
Broken pipe
02/15/03 05:56:16 nsrexecd: failed to write NUL handshake on 5: errno 32,
Broken pipe
02/17/03 07:15:37 nsrexecd: failed to write NUL handshake on 5: errno 32,
Broken pipe
02/17/03 13:48:45 nsrexecd: failed to write NUL handshake on 5: errno 32,
Broken pipe
02/17/03 23:08:20 nsrexecd: failed to write NUL handshake on 5: errno 32,
Broken pipe
02/18/03 01:51:32 nsrexecd: failed to write NUL handshake on 5: errno 32,
Broken pipe

clarinet's daemon.log has these entries (tail):

02/09/03 15:44:58 nsrexecd: failed to write NUL handshake on 6: errno 32,
Broken pipe
02/10/03 01:40:56 nsrexecd: failed to write NUL handshake on 6: errno 32,
Broken pipe
02/11/03 01:44:06 nsrexecd: failed to write NUL handshake on 6: errno 32,
Broken pipe
02/16/03 02:17:22 nsrexecd: failed to write NUL handshake on 6: errno 32,
Broken pipe
02/17/03 11:16:49 nsrexecd: failed to write NUL handshake on 6: errno 32,
Broken pipe
02/18/03 00:29:00 nsrexecd: failed to write NUL handshake on 6: errno 32,
Broken pipe
02/18/03 13:41:59 nsrexecd: failed to write NUL handshake on 6: errno 32,
Broken pipe

3. Is there a way to mark aborted backups as good?
4. Is there something about the timing between retries that might tell us
something?

The firewall is set as loosely as possible, with known good rules allowing
Networker
between client and nsrserverhost.

---

I have no idea what to do. Thank you very much. -Gary

-- "We don't see things as they are, we see them as we are." - Anais Nin
Gary Goldberg KA3ZYW <og AT digimark DOT net> V:301/249-6501 F:301/390-1955
AIM:OgGreeb
Digital Marketing/Bowie MD/Systems & Networks Consult
<http://www.digimark.net/>

--
Note: To sign off this list, send a "signoff networker" command via email
to listserv AT listmail.temple DOT edu or visit the list's Web site at
http://listmail.temple.edu/archives/networker.html where you can
also view and post messages to the list.
=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=

========================================================================================
This email message and attachments are confidential to our organisation and 
subject to legal privilege.  If you have received this email in error, please 
advise the sender immediately and destroy the message and any attachments. If 
you are not the intended recipient you are notified that any use, distribution, 
amendment, copying or any action taken or omitted to be taken in reliance of 
this message or attachments is prohibited.  You can read our Privacy Policy 
here: <http://www.asbbank.co.nz/privacystatement.stm>
=========================================================================================

--
Note: To sign off this list, send a "signoff networker" command via email
to listserv AT listmail.temple DOT edu or visit the list's Web site at
http://listmail.temple.edu/archives/networker.html where you can
also view and post messages to the list.
=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=