[Networker] Hanging backup job through a firewall ... ?

    ------------------------------------------------------
    Standard disclaimer: 
    Yes, I know Networker 7.2 is old (and so is Solaris 8)
    and believe me we're trying to move on, but a highly-
    interdependent environment is slow to move. Next week, 
    we move to Solaris 10 and will get to 7.5.x ASAP, but 
    we have dependent Solaris 8 clients out there that 
    are holding us back. Plus, we're updating the entire 
    infrastructure later this year...  'nuff said.  :-)
    ------------------------------------------------------


So, as you probably guessed, we're running Sun badged "EBS" 7.2 on Solaris 
8 SPARC server, writing to SDLT 320. Since we moved a specific Windows XP 
client behind a Cisco firewall, we've seen strange behavior with one 
saveset (D:\) that actually finishes backing up, but the group/job never 
completes for full (only) backups. The firewall rules allow two way 
communication between the server and client, via TCP and UDP ports 
7937-9936. 


Here's what I've seen :

Mon-Thu, scheduled  Level 5   C:\ 10 MB    Completes normally
                    Level 5   D:\ 15 GB    Completes normally

Friday, scheduled   Full      C:\ 6 GB     Completes normally
                    Full      D:\ 48 GB    Saveset finishes *

Test run, manual    Level 5   D:\ 68 MB    Completed normally
Test run, manual    Full      D:\ 48 GB    Saveset finished *

* but not the job!


As I watch the group, D:\ finishes :

    04/14/10 11:29:37 nsrd: client1:D:\ done saving to pool 'pool1' 
(001828) 48 GB

...but the index never saves and it just sits there :

Looking on the server, the only two related processes I see are :

    root  3346  3338  0 10:13:08 ?        0:00 /usr/sbin/nsr/nsrexec -c 
client1 -a  -- client1:D:\
    root  3338   557  0 10:13:05 ?        0:00 /usr/sbin/nsr/savegrp 
missed

During this time, our firewall guy didn't see anything hitting the 
firewall from the client, though.

Finally this appears in the daemon.log and it tries again :

    04/14/10 12:21:20 savegrp: client1:D:\ unexpectedly exited.
    * client1:D:\ Cannot determine status of the backup process.  Use 
mminfo to determine job status.
    04/14/10 12:21:20 savegrp: client1:D:\ will retry 5 more time(s)
    04/14/10 12:21:21 nsrd: client1:D:\ saving to pool 'pool1' (001826)

When I finally kill the group, I get the same kind of message :

    * client1:D:\ 5 retries attempted
    * client1:D:\ Cannot determine status of the backup process.  Use 
mminfo to determine job status.

mminfo seems to indicate the backup is OK; I can indeed browse and recover 
from it. But somehow, Networker never seems to know it's actually 
finished, so it can backup the index and close the group/job. In fact, 
given a scheduled full backup last Friday night, two manual attemped fulls 
and other tests since then, we now have 17+ copies of this saveset! It 
*appears* to be related to the backup level/size, but it may be a timeout 
or other secondary issue. At this point, I have no idea.

Is there something we missed? Another port (range), maybe?

Given this just started happening when the client was physically moved and 
firewalled-off, it's pretty hard to ignore the coincidence of it. Help! 
:-)

Thanks!

To sign off this list, send email to listserv AT listserv.temple DOT edu and 
type "signoff networker" in the body of the email. Please write to 
networker-request AT listserv.temple DOT edu if you have any problems with this 
list. You can access the archives at 
http://listserv.temple.edu/archives/networker.html or
via RSS at http://listserv.temple.edu/cgi-bin/wa?RSS&L=NETWORKER