Bacula-users

[Bacula-users] Bacula 5.0.2, backup works but errors out on the 'finishing touch' after a long time, network issue?

2010-05-18 12:41:13
Subject: [Bacula-users] Bacula 5.0.2, backup works but errors out on the 'finishing touch' after a long time, network issue?
From: Foo <bfoo33 AT yahoo.co DOT uk>
To: "bacula-users AT lists.sourceforge DOT net" <bacula-users AT lists.sourceforge DOT net>
Date: Tue, 18 May 2010 18:38:29 +0200
Hi,

I have a couple of (W2K8) servers on a different subnet, network config is  
correct as far as I can see (routes/gateways added on both subnets, can  
ping both ways, telnet into 9102 on client from director/sd, telnet into  
9103 on sd/dir machine from clients, status client works from bconsole).

The backup commences and the volume files start getting written, bconsole  
however reports only up to the following lines:

18-May 16:38 DIRHOSTNAME-sd JobId 32487: Job write elapsed time =  
00:37:39, Transfer rate = 5.191 M Bytes/second
18-May 16:40 DIRHOSTNAME-sd JobId 32486: Job write elapsed time =  
00:39:38, Transfer rate = 5.241 M Bytes/second

Normally you get a bunch of VSS lines after that and the summary with an  
OK. The /var/working/bacula/log file does not contain the above two lines,  
only a bunch of the intermediate failures on junction points, in fact it  
freezes in mid line at some point (first other line continues there  
without newline in between, other director output continues fine  
afterwards.

status dir reports:

Running Jobs:
Console connected at 18-May-10 17:41
  JobId Level   Name                       Status
======================================================================
  32486 Full    HOSTNAME1.2010-05-18_16.00.56_18 is running
  32487 Full    HOSTNAME2.2010-05-18_16.01.03_19 is running

The resource monitor on the hosts does not report network activity (i.e.  
an open connection) to the sd/dir, except when I do a status client on it  
(which works), and it seems like the (5.0.2) client thinks it has  
successfully finished the job:

*st client=HOSTNAME1-fd
Connecting to Client HOSTNAME1-fd at 1.2.3.4:9102

HOSTNAME1-fd Version: 5.0.2 (28 April 2010)  VSS Linux Cross-compile Win64
Daemon started 18-May-10 15:54, 1 Job run since started.
  Heap: heap=0 smbytes=131,202 max_bytes=292,179 bufs=89 max_bufs=274
  Sizeof: boffset_t=8 size_t=8 debug=0 trace=1

Running Jobs:
Director connected at: 18-May-10 17:45
No Jobs running.
====

Terminated Jobs:
  JobId  Level    Files      Bytes   Status   Finished        Name
======================================================================
  32486  Full     86,470    12.44 G  OK       18-May-10 16:40 HOSTNAME1
====
*

HOSTNAME2 produces similar output.

Somewhat later they error out:

18-May 18:04 DIRHOSTNAME-dir JobId 32487: Fatal error: Network error with  
FD during Backup: ERR=Connection reset by peer
18-May 18:04 DIRHOSTNAME-dir JobId 32487: Fatal error: No Job status  
returned from FD.
18-May 18:04 DIRHOSTNAME-dir JobId 32487: Error: Bacula DIRHOSTNAME-dir  
5.0.2 (28Apr10): 18-May-2010 18:04:13
   Build OS:               i686-pc-linux-gnu debian 5.0.4
   JobId:                  32487
   Job:                    HOSTNAME2.2010-05-18_16.01.03_19
   Backup Level:           Full (upgraded from Incremental)
   Client:                 "HOSTNAME2-fd" 5.0.2 (28Apr10)  
Linux,Cross-compile,Win64
   FileSet:                "Windows HOSTNAME2 set" 2010-05-18 16:01:03
   Pool:                   "Pool_HOSTNAME2" (From Job resource)
   Catalog:                "MyCatalog" (From Client resource)
   Storage:                "HOSTNAME2_storage" (From Job resource)
   Scheduled time:         18-May-2010 16:01:01
   Start time:             18-May-2010 16:01:05
   End time:               18-May-2010 18:04:13
   Elapsed time:           2 hours 3 mins 8 secs
   Priority:               10
   FD Files Written:       0
   SD Files Written:       85,253
   FD Bytes Written:       0 (0 B)
   SD Bytes Written:       11,726,994,434 (11.72 GB)
   Rate:                   0.0 KB/s
   Software Compression:   None
   VSS:                    no
   Encryption:             no
   Accurate:               no
   Volume name(s):         Vol_HOSTNAME2_0001
   Volume Session Id:      9
   Volume Session Time:    1274189949
   Last Volume Bytes:      11,738,369,181 (11.73 GB)
   Non-fatal FD errors:    0
   SD Errors:              0
   FD termination status:  Error
   SD termination status:  OK
   Termination:            *** Backup Error ***

Same for HOSTNAME1 (interestingly, it came right after HOSTNAME2, the  
order reversed only due to timing apparently, but they fail at exactly the  
same moment (18:04)):

18-May 18:04 DIRHOSTNAME-dir JobId 32486: Fatal error: Network error with  
FD during Backup: ERR=Connection reset by peer
18-May 18:04 DIRHOSTNAME-dir JobId 32486: Fatal error: No Job status  
returned from FD.
18-May 18:04 DIRHOSTNAME-dir JobId 32486: Error: Bacula DIRHOSTNAME-dir  
5.0.2 (28Apr10): 18-May-2010 18:04:31
   Build OS:               i686-pc-linux-gnu debian 5.0.4
   JobId:                  32486
   Job:                    HOSTNAME1.2010-05-18_16.00.56_18
   Backup Level:           Full (upgraded from Incremental)
   Client:                 "HOSTNAME1-fd" 5.0.2 (28Apr10)  
Linux,Cross-compile,Win64
   FileSet:                "Windows HOSTNAME1 set" 2010-05-18 16:00:56
   Pool:                   "Pool_HOSTNAME1" (From Job resource)
   Catalog:                "MyCatalog" (From Client resource)
   Storage:                "HOSTNAME1_storage" (From Job resource)
   Scheduled time:         18-May-2010 16:00:55
   Start time:             18-May-2010 16:00:58
   End time:               18-May-2010 18:04:31
   Elapsed time:           2 hours 3 mins 33 secs
   Priority:               10
   FD Files Written:       0
   SD Files Written:       86,470
   FD Bytes Written:       0 (0 B)
   SD Bytes Written:       12,464,036,220 (12.46 GB)
   Rate:                   0.0 KB/s
   Software Compression:   None
   VSS:                    no
   Encryption:             no
   Accurate:               no
   Volume name(s):         Vol_HOSTNAME1_0001
   Volume Session Id:      8
   Volume Session Time:    1274189949
   Last Volume Bytes:      12,475,995,727 (12.47 GB)
   Non-fatal FD errors:    0
   SD Errors:              0
   FD termination status:  Error
   SD termination status:  OK
   Termination:            *** Backup Error ***

After that the director finally sees them as errored out instead of still  
running (but the clients report OK in the termination status).

The /var/bacula/working/log now contains the failure lines as well, again  
interestingly it continue in mid sentence where it left off before.

Is this a networking issue where some "I'm done" packet was lost/held up?  
If so, does this go to another port (I don't think so), or does it use a  
special protocol/form so a specific network issue may block that but not  
everything else?

------------------------------------------------------------------------------

_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users