Networker

Re: [Networker] Clients failing, retring, failing - Ugh!

2003-02-18 16:07:58
Subject: Re: [Networker] Clients failing, retring, failing - Ugh!
From: Gary Goldberg <og AT DIGIMARK DOT NET>
To: NETWORKER AT LISTMAIL.TEMPLE DOT EDU
Date: Tue, 18 Feb 2003 16:03:15 -0500
Thank you for the advice. I'm currently allowing ports 7937-8764 and
10001-30000 on UDP and TCP between the nsrserverhost and the clients.
This is from the Legato firewalls white paper. I've examined the firewall
logs (and re-examined them before replying, and nothing between the
nsrserverhost and the clients was recorded. Is it possible there is a
port not listed that is being used? I know each client generally has
two nsrexecd's running -- once for Networker and one for its own
version of portmapper. Could something here be hanging that would
account for the failures? -Gary

-- "We don't see things as they are, we see them as we are." - Anais Nin
Gary Goldberg KA3ZYW <og AT digimark DOT net> V:301/249-6501 F:301/390-1955 
AIM:OgGreeb
Digital Marketing/Bowie MD/Systems & Networks Consult <http://www.digimark.net/>

On Tue, 18 Feb 2003, Thomas, Calvin wrote:

> I don't know much about running behind a firewall, but I do understand that
> the backup sessions use different ports. Obviously the firewalls are
> blocking some of the critical ports on these two machines. I would examine
> the firewall port logs with a fine tooth comb, find the entrys where
> networker is being blocked, and add those port/client combinations to be
> allowed through.  After all, you said yourself that everything was fine
> until you installed the firewalls.
>
> Calvin Thomas
> UNIX System Administrator
> NACA Logistics
>
>
> -----Original Message-----
> From: Gary Goldberg [mailto:og AT DIGIMARK DOT NET]
> Sent: Tuesday, February 18, 2003 10:53 AM
> To: NETWORKER AT LISTMAIL.TEMPLE DOT EDU
> Subject: [Networker] Clients failing, retring, failing - Ugh!
>
>
> Hello. I'm at my wit's end, and Legato tech support has been of no help
> at all. I appreciate the help of anyone who wants to help me tackle
> this. Some of you have seen parts of this before -- I've found some
> additional info which may help understand the problem.
>
> I'm running a Networker server (6.11 Build 238) on Windows NT 4 SP6a,
> with 13 clients, all installed into the Default group and pool. The
> group is set to Verbose, to run images on Sunday at 1:30AM and
> incrementals the other six days of the week. The backups are run to
> an AIT-1 30 tape Treefrog jukebox with two drives.
>
> There has been an ongoing problem with this system for going on two
> years now, where the nsrserverhost, its indices, and the indicies of
> all the clients fail with a Unknown error 0x93. I don't think it
> is relevant to this problem but I add it to this list in case I am
> wrong.
>
> ----
>
> Everything was fine until I started to deploy some SonicWALL SOHO2
> and SOHO3 firewalls in front of several servers. Each of the
> machines I operate, are run on behalf of different clients, so
> the company LAN is more like a loose affiliation. Tweaking of the
> SonicWALL rules in each firewall and setting verbose mode in the
> default group in Networker allows the machines behind two of the
> firewalls to backup, although they are considerably slower performing.
>
> Two of the clients, each behind their own firewall, continue to
> fail. One is a RedHat Linux 8 Dell PE1650 called "mystic", the
> other is an UltraSPARC IIi at 440Mhz running Solaris 8 called
> "clarinet". Both have plenty of memory and diskspace. Each machine
> has a / and a /home partition. I've set Client Retries to 1,
> Inactivity Timeout to 720 in the default group profile.
>
> ---
>
> Sunday/Monday I ran the image backups. All the other clients ran
> fine, but these two failed. I restarted the group and had the same
> behavior. Amazingly, mystic:/ saved successfully (first time in
> weeks). The other three file systems failed miserably. Here is the
> monitor log output: (paraphrased)
>
> 11:11PM Default running on clarinet, mystic   (this message appears
>         every thirty seconds or so, because of the Verbal setting
>         on the group. So in-between all the other messages is this
>         one.)
>
> 11:15   clarinet:/home done     967MB
> 11:15   clarinet:/home saving to pool Default
> 11:17   mystic:/home done       779MB
> 11:48   clarinet:/ done         1031MB
> 11:48   clarinet:/ saving to pool Default
> 12:34AM clarinet:/home done     963MB  (You'd expect this to be
>                                        slightly smaller, after
>                                        midnight some logs have
>                                        rolled and compressed.)
> 01:01   clarinet:/ done         1031MB
> 01:02   media info: verification of volume, volid, succeeded.
> 01:02   Write completion notice: writing to volume completed.
>  (still issuing "Default running on clarinet, mystic" msgs.)
> 01:16   mystic:/home saving to pool Default
> 01:57   mystic:/home done       772MB
> 01:58   media info: verification of volume, volid, succeeded.
> 01:58   Write completion notice: writing to volume completed.
> 02:44   Default running on mystic  (clarinet stopped being in this
>         list.)
> 04:00   Default running on (nsrserverhost)
> 04:01   Default completed.
>
> Important points:
>
> 1. The "Default running on clarinet,mystic repeats inbetween all
>    those lines above.
> 2. For each failed backup, it backs up the entire filesystem, but
>    then "something" fails to register that the filesystem has
>    completed, the backup times out, retries (because of Client
>    Retries), does the same exact thing, then finally aborts
>    after some timeout. The tapes are filled with "aborted" savesets
>    from these clients each night.
> 3. Despite the full run of each filesystem twice, the three filesystems
>    are market "aborted/failed" and are unavailable.
>
> mystic's daemon log has these entries (tail):
>
> 02/14/03 05:27:22 nsrexecd: failed to write NUL handshake on 5: errno 32,
> Broken pipe
> 02/14/03 05:52:56 nsrexecd: failed to write NUL handshake on 5: errno 32,
> Broken pipe
> 02/15/03 03:03:26 nsrexecd: failed to write NUL handshake on 5: errno 32,
> Broken pipe
> 02/15/03 03:25:33 nsrexecd: failed to write NUL handshake on 5: errno 32,
> Broken pipe
> 02/15/03 05:45:58 nsrexecd: failed to write NUL handshake on 5: errno 32,
> Broken pipe
> 02/15/03 05:56:16 nsrexecd: failed to write NUL handshake on 5: errno 32,
> Broken pipe
> 02/17/03 07:15:37 nsrexecd: failed to write NUL handshake on 5: errno 32,
> Broken pipe
> 02/17/03 13:48:45 nsrexecd: failed to write NUL handshake on 5: errno 32,
> Broken pipe
> 02/17/03 23:08:20 nsrexecd: failed to write NUL handshake on 5: errno 32,
> Broken pipe
> 02/18/03 01:51:32 nsrexecd: failed to write NUL handshake on 5: errno 32,
> Broken pipe
>
> clarinet's daemon.log has these entries (tail):
>
> 02/09/03 15:44:58 nsrexecd: failed to write NUL handshake on 6: errno 32,
> Broken pipe
> 02/10/03 01:40:56 nsrexecd: failed to write NUL handshake on 6: errno 32,
> Broken pipe
> 02/11/03 01:44:06 nsrexecd: failed to write NUL handshake on 6: errno 32,
> Broken pipe
> 02/16/03 02:17:22 nsrexecd: failed to write NUL handshake on 6: errno 32,
> Broken pipe
> 02/17/03 11:16:49 nsrexecd: failed to write NUL handshake on 6: errno 32,
> Broken pipe
> 02/18/03 00:29:00 nsrexecd: failed to write NUL handshake on 6: errno 32,
> Broken pipe
> 02/18/03 13:41:59 nsrexecd: failed to write NUL handshake on 6: errno 32,
> Broken pipe
>
> 3. Is there a way to mark aborted backups as good?
> 4. Is there something about the timing between retries that might tell us
> something?
>
> The firewall is set as loosely as possible, with known good rules allowing
> Networker
> between client and nsrserverhost.
>
> ---
>
> I have no idea what to do. Thank you very much. -Gary
>
> -- "We don't see things as they are, we see them as we are." - Anais Nin
> Gary Goldberg KA3ZYW <og AT digimark DOT net> V:301/249-6501 F:301/390-1955
> AIM:OgGreeb
> Digital Marketing/Bowie MD/Systems & Networks Consult
> <http://www.digimark.net/>
>
> --
> Note: To sign off this list, send a "signoff networker" command via email
> to listserv AT listmail.temple DOT edu or visit the list's Web site at
> http://listmail.temple.edu/archives/networker.html where you can
> also view and post messages to the list.
> =*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=
>

--
Note: To sign off this list, send a "signoff networker" command via email
to listserv AT listmail.temple DOT edu or visit the list's Web site at
http://listmail.temple.edu/archives/networker.html where you can
also view and post messages to the list.
=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=