Networker

Re: [Networker] Hanging backup job through a firewall ... ?

2010-04-19 19:28:17
Subject: Re: [Networker] Hanging backup job through a firewall ... ?
From: "Small, Joshua" <joshua.small AT CITI DOT COM>
To: NETWORKER AT LISTSERV.TEMPLE DOT EDU
Date: Tue, 20 Apr 2010 07:08:54 +0800
Have you set NSR_KEEPALIVE, or a general tcp keepalive on the server and client?
We had problems with solaris machines backing up through firewalls, where NW's 
control/status connection between the client and server timed out during long 
backups. 

Our firewalls have a 30 minute tcp idle connection timeout. If a single saveset 
ran for longer than this, the server lost track of the job once this connection 
was closed - as there doesn't appear to be any activity on the control 
connection between the start and end of the saveset job, it appears idle to the 
firewall.
 
For solaris, we set a 20 minute keepalive with "ndd -set /dev/tcp 
tcp_keepalive_interval 1200000"
For XP, I think it's done with 
HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\Tcpip\Parameters\KeepAliveTime
 (DWORD 1200000) and KeepAliveInterval (DWORD 60000) - after 20 minutes, send a 
keepalive every 60 seconds.

Also, NW7.4 and 7.5 (and 7.3 too, I think) use the sunrpc port (111) by default 
for rpc communications.
If that port is dropped by the firewall, there will be a number of delays per 
saveset while NW waits for the sunrpc connection to timeout before falling back 
to NW's own rpc broker. 
The Solaris default timeout is around 4 or 5 minutes.
If your firewall blocks (as opposed to drops) port 111 then the timeout 
shouldn't be a problem, as the connection is actively refused. 
Likewise if your firewall allows port 111/tcp then no issues either.

With NW7.5, the rpc behavior can be changed to only use tcp port 7938. (This 
can't be done for 7.4 though)
On solaris servers and clients, you can get NW to use 7938/tcp by either 
removing the sunrpc entry from /etc/services, or adding an entry for nsrrpc. 
i.e. add the following to /etc/services and restart nsrexecd:
nsrrpc      7938/tcp

We stumbled across this one after upgrading a small datazone in a firewall 
environment from 7.2 to 7.4, and couldn't work out why each client was taking 
so much longer to start each backup job. Switching to 7.5 and nsrrpc port fixed 
it.  

HTH,
Josh
 
  

-----Original Message-----
From: EMC NetWorker discussion [mailto:NETWORKER AT LISTSERV.TEMPLE DOT EDU] On 
Behalf Of kel AT STERIA DOT DK
Sent: Tuesday, 20 April 2010 6:25 AM
To: NETWORKER AT LISTSERV.TEMPLE DOT EDU
Subject: Re: [Networker] Hanging backup job through a firewall ... ?

a CISCO firewall ? 

Try increasing the timeout in the firewall to say 12 hours.. 

We had issues like that too, until we increased the timeout in the firewall, 
its not a inactivity timeout, but a connection timeout. Nothing we  did, on the 
client or server helped, the firewall cut the connection the twice the timeout 
period was up, no questions asked, not even if a backup was in full swing 

**************************************************
Med venlig hilsen / Regards
Kenneth Larsen
System Konsulent

Steria A/S
Tonsbakken 16-18
2740 Skovlunde
Mobile: +45 2630 6261
email: kel AT steria DOT dk
www.steria.dk
**************************************************



From:   Stephanie Finnegan <sfinnega AT AIP DOT ORG>
To:     NETWORKER AT LISTSERV.TEMPLE DOT EDU
Date:   14-04-2010 22:26
Subject:        Re: [Networker] Hanging backup job through a firewall ... 
?
Sent by:        EMC NetWorker discussion <NETWORKER AT LISTSERV.TEMPLE DOT EDU>



Try running the full with the retries set to zero to get more information about 
the "failure".

We have randomly had similar issues across all platforms - backup appears to 
complete, then suddenly it restarts, and only on the fulls.  Same "cannot 
determine status" error(s).  Once we set the retries to zero, we got a 
different, more descriptive (but, no less helpful unfortunately) error message. 
 That might work for you and at least you can get an idea of what's causing the 
restart.

In our case, we're on 7.4.4.4 Solaris 10.  We've had this happen on Windows 
server clients, Solaris 10 clients, and a Netware client.  We've had a case 
open with EMC for months on one client - the error is "nsr_end: 
bad file number" - no resolution as of yet.

 

-----Original Message-----
From: EMC NetWorker discussion [mailto:NETWORKER AT LISTSERV.TEMPLE DOT EDU] On 
Behalf Of Len Philpot
Sent: Wednesday, April 14, 2010 3:02 PM
To: NETWORKER AT LISTSERV.TEMPLE DOT EDU
Subject: [Networker] Hanging backup job through a firewall ... ?

    ------------------------------------------------------
    Standard disclaimer: 
    Yes, I know Networker 7.2 is old (and so is Solaris 8)
    and believe me we're trying to move on, but a highly-
    interdependent environment is slow to move. Next week, 
    we move to Solaris 10 and will get to 7.5.x ASAP, but 
    we have dependent Solaris 8 clients out there that 
    are holding us back. Plus, we're updating the entire 
    infrastructure later this year...  'nuff said.  :-)
    ------------------------------------------------------


So, as you probably guessed, we're running Sun badged "EBS" 7.2 on Solaris 

8 SPARC server, writing to SDLT 320. Since we moved a specific Windows XP 
client behind a Cisco firewall, we've seen strange behavior with one 
saveset (D:\) that actually finishes backing up, but the group/job never 
completes for full (only) backups. The firewall rules allow two way 
communication between the server and client, via TCP and UDP ports 
7937-9936. 


Here's what I've seen :

Mon-Thu, scheduled  Level 5   C:\ 10 MB    Completes normally
                    Level 5   D:\ 15 GB    Completes normally

Friday, scheduled   Full      C:\ 6 GB     Completes normally
                    Full      D:\ 48 GB    Saveset finishes *

Test run, manual    Level 5   D:\ 68 MB    Completed normally
Test run, manual    Full      D:\ 48 GB    Saveset finished *

* but not the job!


As I watch the group, D:\ finishes :

    04/14/10 11:29:37 nsrd: client1:D:\ done saving to pool 'pool1' 
(001828) 48 GB

...but the index never saves and it just sits there :

Looking on the server, the only two related processes I see are :

    root  3346  3338  0 10:13:08 ?        0:00 /usr/sbin/nsr/nsrexec -c 
client1 -a  -- client1:D:\
    root  3338   557  0 10:13:05 ?        0:00 /usr/sbin/nsr/savegrp 
missed

During this time, our firewall guy didn't see anything hitting the 
firewall from the client, though.

Finally this appears in the daemon.log and it tries again :

    04/14/10 12:21:20 savegrp: client1:D:\ unexpectedly exited.
    * client1:D:\ Cannot determine status of the backup process.  Use 
mminfo to determine job status.
    04/14/10 12:21:20 savegrp: client1:D:\ will retry 5 more time(s)
    04/14/10 12:21:21 nsrd: client1:D:\ saving to pool 'pool1' (001826)

When I finally kill the group, I get the same kind of message :

    * client1:D:\ 5 retries attempted
    * client1:D:\ Cannot determine status of the backup process.  Use 
mminfo to determine job status.

mminfo seems to indicate the backup is OK; I can indeed browse and recover 

from it. But somehow, Networker never seems to know it's actually 
finished, so it can backup the index and close the group/job. In fact, 
given a scheduled full backup last Friday night, two manual attemped fulls 

and other tests since then, we now have 17+ copies of this saveset! It 
*appears* to be related to the backup level/size, but it may be a timeout 
or other secondary issue. At this point, I have no idea.

Is there something we missed? Another port (range), maybe?

Given this just started happening when the client was physically moved and 

firewalled-off, it's pretty hard to ignore the coincidence of it. Help! 
:-)

Thanks!

To sign off this list, send email to listserv AT listserv.temple DOT edu and 
type 
"signoff networker" in the body of the email. Please write to 
networker-request AT listserv.temple DOT edu if you have any problems with this 
list. You can access the archives at 
http://listserv.temple.edu/archives/networker.html or
via RSS at http://listserv.temple.edu/cgi-bin/wa?RSS&L=NETWORKER

To sign off this list, send email to listserv AT listserv.temple DOT edu and 
type 
"signoff networker" in the body of the email. Please write to 
networker-request AT listserv.temple DOT edu if you have any problems with this 
list. You can access the archives at 
http://listserv.temple.edu/archives/networker.html or
via RSS at http://listserv.temple.edu/cgi-bin/wa?RSS&L=NETWORKER






------------------------------------------------------------------------------------
Med en omsætning på DKK 14 mia. og 19.000 ansatte er Steria blandt de ti 
førende it-serviceleverandører i Europa. Gruppen, som er repræsenteret i 16 
lande, herunder i Danmark med ca. 170 medarbejdere, leverer komplette løsninger 
inden for følgende fokusområder: Rådgivning, systemintegration og it-drift. Det 
betyder, at Steria bistår med alt fra forretnings- og it-rådgivning til 
projektledelse, systemudvikling og infrastrukturleverancer til drift, hosting 
og vedligehold. Yderligere oplysninger kan læses på www.steria.dk og 
www.steria.com.

This email originates from Steria A/S, Tonsbakken 16-18, DK-2740 Skovlunde - 
www.steria.dk. 
This email and any attachments may contain confidential/intellectual 
property/copyright information and is only for the use of the addressee(s). You 
are prohibited from copying, forwarding, disclosing, saving or otherwise using 
it in any way if you are not the addressee(s) or responsible for delivery. If 
you receive this email by mistake, please advise the sender and cancel it 
immediately. Steria may monitor the content of emails within its network to 
ensure compliance with its policies and procedures. Any email is susceptible to 
alteration and its integrity cannot be assured. Steria shall not be liable if 
the message is altered, modified, falsified, or even edited.

To sign off this list, send email to listserv AT listserv.temple DOT edu and 
type "signoff networker" in the body of the email. Please write to 
networker-request AT listserv.temple DOT edu if you have any problems with this 
list. You can access the archives at 
http://listserv.temple.edu/archives/networker.html or
via RSS at http://listserv.temple.edu/cgi-bin/wa?RSS&L=NETWORKER

To sign off this list, send email to listserv AT listserv.temple DOT edu and 
type "signoff networker" in the body of the email. Please write to 
networker-request AT listserv.temple DOT edu if you have any problems with this 
list. You can access the archives at 
http://listserv.temple.edu/archives/networker.html or
via RSS at http://listserv.temple.edu/cgi-bin/wa?RSS&L=NETWORKER