Networker

Re: [Networker] Clients failing, retring, failing - Ugh!

2003-02-18 15:05:24
Subject: Re: [Networker] Clients failing, retring, failing - Ugh!
From: "Thomas, Calvin" <calvin.thomas AT NACALOGISTICS DOT COM>
To: NETWORKER AT LISTMAIL.TEMPLE DOT EDU
Date: Tue, 18 Feb 2003 12:05:12 -0800
I don't know much about running behind a firewall, but I do understand that
the backup sessions use different ports. Obviously the firewalls are
blocking some of the critical ports on these two machines. I would examine
the firewall port logs with a fine tooth comb, find the entrys where
networker is being blocked, and add those port/client combinations to be
allowed through.  After all, you said yourself that everything was fine
until you installed the firewalls.

Calvin Thomas
UNIX System Administrator
NACA Logistics


-----Original Message-----
From: Gary Goldberg [mailto:og AT DIGIMARK DOT NET]
Sent: Tuesday, February 18, 2003 10:53 AM
To: NETWORKER AT LISTMAIL.TEMPLE DOT EDU
Subject: [Networker] Clients failing, retring, failing - Ugh!


Hello. I'm at my wit's end, and Legato tech support has been of no help
at all. I appreciate the help of anyone who wants to help me tackle
this. Some of you have seen parts of this before -- I've found some
additional info which may help understand the problem.

I'm running a Networker server (6.11 Build 238) on Windows NT 4 SP6a,
with 13 clients, all installed into the Default group and pool. The
group is set to Verbose, to run images on Sunday at 1:30AM and
incrementals the other six days of the week. The backups are run to
an AIT-1 30 tape Treefrog jukebox with two drives.

There has been an ongoing problem with this system for going on two
years now, where the nsrserverhost, its indices, and the indicies of
all the clients fail with a Unknown error 0x93. I don't think it
is relevant to this problem but I add it to this list in case I am
wrong.

----

Everything was fine until I started to deploy some SonicWALL SOHO2
and SOHO3 firewalls in front of several servers. Each of the
machines I operate, are run on behalf of different clients, so
the company LAN is more like a loose affiliation. Tweaking of the
SonicWALL rules in each firewall and setting verbose mode in the
default group in Networker allows the machines behind two of the
firewalls to backup, although they are considerably slower performing.

Two of the clients, each behind their own firewall, continue to
fail. One is a RedHat Linux 8 Dell PE1650 called "mystic", the
other is an UltraSPARC IIi at 440Mhz running Solaris 8 called
"clarinet". Both have plenty of memory and diskspace. Each machine
has a / and a /home partition. I've set Client Retries to 1,
Inactivity Timeout to 720 in the default group profile.

---

Sunday/Monday I ran the image backups. All the other clients ran
fine, but these two failed. I restarted the group and had the same
behavior. Amazingly, mystic:/ saved successfully (first time in
weeks). The other three file systems failed miserably. Here is the
monitor log output: (paraphrased)

11:11PM Default running on clarinet, mystic   (this message appears
        every thirty seconds or so, because of the Verbal setting
        on the group. So in-between all the other messages is this
        one.)

11:15   clarinet:/home done     967MB
11:15   clarinet:/home saving to pool Default
11:17   mystic:/home done       779MB
11:48   clarinet:/ done         1031MB
11:48   clarinet:/ saving to pool Default
12:34AM clarinet:/home done     963MB  (You'd expect this to be
                                       slightly smaller, after
                                       midnight some logs have
                                       rolled and compressed.)
01:01   clarinet:/ done         1031MB
01:02   media info: verification of volume, volid, succeeded.
01:02   Write completion notice: writing to volume completed.
 (still issuing "Default running on clarinet, mystic" msgs.)
01:16   mystic:/home saving to pool Default
01:57   mystic:/home done       772MB
01:58   media info: verification of volume, volid, succeeded.
01:58   Write completion notice: writing to volume completed.
02:44   Default running on mystic  (clarinet stopped being in this
        list.)
04:00   Default running on (nsrserverhost)
04:01   Default completed.

Important points:

1. The "Default running on clarinet,mystic repeats inbetween all
   those lines above.
2. For each failed backup, it backs up the entire filesystem, but
   then "something" fails to register that the filesystem has
   completed, the backup times out, retries (because of Client
   Retries), does the same exact thing, then finally aborts
   after some timeout. The tapes are filled with "aborted" savesets
   from these clients each night.
3. Despite the full run of each filesystem twice, the three filesystems
   are market "aborted/failed" and are unavailable.

mystic's daemon log has these entries (tail):

02/14/03 05:27:22 nsrexecd: failed to write NUL handshake on 5: errno 32,
Broken pipe
02/14/03 05:52:56 nsrexecd: failed to write NUL handshake on 5: errno 32,
Broken pipe
02/15/03 03:03:26 nsrexecd: failed to write NUL handshake on 5: errno 32,
Broken pipe
02/15/03 03:25:33 nsrexecd: failed to write NUL handshake on 5: errno 32,
Broken pipe
02/15/03 05:45:58 nsrexecd: failed to write NUL handshake on 5: errno 32,
Broken pipe
02/15/03 05:56:16 nsrexecd: failed to write NUL handshake on 5: errno 32,
Broken pipe
02/17/03 07:15:37 nsrexecd: failed to write NUL handshake on 5: errno 32,
Broken pipe
02/17/03 13:48:45 nsrexecd: failed to write NUL handshake on 5: errno 32,
Broken pipe
02/17/03 23:08:20 nsrexecd: failed to write NUL handshake on 5: errno 32,
Broken pipe
02/18/03 01:51:32 nsrexecd: failed to write NUL handshake on 5: errno 32,
Broken pipe

clarinet's daemon.log has these entries (tail):

02/09/03 15:44:58 nsrexecd: failed to write NUL handshake on 6: errno 32,
Broken pipe
02/10/03 01:40:56 nsrexecd: failed to write NUL handshake on 6: errno 32,
Broken pipe
02/11/03 01:44:06 nsrexecd: failed to write NUL handshake on 6: errno 32,
Broken pipe
02/16/03 02:17:22 nsrexecd: failed to write NUL handshake on 6: errno 32,
Broken pipe
02/17/03 11:16:49 nsrexecd: failed to write NUL handshake on 6: errno 32,
Broken pipe
02/18/03 00:29:00 nsrexecd: failed to write NUL handshake on 6: errno 32,
Broken pipe
02/18/03 13:41:59 nsrexecd: failed to write NUL handshake on 6: errno 32,
Broken pipe

3. Is there a way to mark aborted backups as good?
4. Is there something about the timing between retries that might tell us
something?

The firewall is set as loosely as possible, with known good rules allowing
Networker
between client and nsrserverhost.

---

I have no idea what to do. Thank you very much. -Gary

-- "We don't see things as they are, we see them as we are." - Anais Nin
Gary Goldberg KA3ZYW <og AT digimark DOT net> V:301/249-6501 F:301/390-1955
AIM:OgGreeb
Digital Marketing/Bowie MD/Systems & Networks Consult
<http://www.digimark.net/>

--
Note: To sign off this list, send a "signoff networker" command via email
to listserv AT listmail.temple DOT edu or visit the list's Web site at
http://listmail.temple.edu/archives/networker.html where you can
also view and post messages to the list.
=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=

--
Note: To sign off this list, send a "signoff networker" command via email
to listserv AT listmail.temple DOT edu or visit the list's Web site at
http://listmail.temple.edu/archives/networker.html where you can
also view and post messages to the list.
=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=