Hi,
I have a couple of (W2K8) servers on a different subnet, network config is
correct as far as I can see (routes/gateways added on both subnets, can
ping both ways, telnet into 9102 on client from director/sd, telnet into
9103 on sd/dir machine from clients, status client works from bconsole).
The backup commences and the volume files start getting written, bconsole
however reports only up to the following lines:
18-May 16:38 DIRHOSTNAME-sd JobId 32487: Job write elapsed time =
00:37:39, Transfer rate = 5.191 M Bytes/second
18-May 16:40 DIRHOSTNAME-sd JobId 32486: Job write elapsed time =
00:39:38, Transfer rate = 5.241 M Bytes/second
Normally you get a bunch of VSS lines after that and the summary with an
OK. The /var/working/bacula/log file does not contain the above two lines,
only a bunch of the intermediate failures on junction points, in fact it
freezes in mid line at some point (first other line continues there
without newline in between, other director output continues fine
afterwards.
status dir reports:
Running Jobs:
Console connected at 18-May-10 17:41
JobId Level Name Status
======================================================================
32486 Full HOSTNAME1.2010-05-18_16.00.56_18 is running
32487 Full HOSTNAME2.2010-05-18_16.01.03_19 is running
The resource monitor on the hosts does not report network activity (i.e.
an open connection) to the sd/dir, except when I do a status client on it
(which works), and it seems like the (5.0.2) client thinks it has
successfully finished the job:
*st client=HOSTNAME1-fd
Connecting to Client HOSTNAME1-fd at 1.2.3.4:9102
HOSTNAME1-fd Version: 5.0.2 (28 April 2010) VSS Linux Cross-compile Win64
Daemon started 18-May-10 15:54, 1 Job run since started.
Heap: heap=0 smbytes=131,202 max_bytes=292,179 bufs=89 max_bufs=274
Sizeof: boffset_t=8 size_t=8 debug=0 trace=1
Running Jobs:
Director connected at: 18-May-10 17:45
No Jobs running.
====
Terminated Jobs:
JobId Level Files Bytes Status Finished Name
======================================================================
32486 Full 86,470 12.44 G OK 18-May-10 16:40 HOSTNAME1
====
*
HOSTNAME2 produces similar output.
Somewhat later they error out:
18-May 18:04 DIRHOSTNAME-dir JobId 32487: Fatal error: Network error with
FD during Backup: ERR=Connection reset by peer
18-May 18:04 DIRHOSTNAME-dir JobId 32487: Fatal error: No Job status
returned from FD.
18-May 18:04 DIRHOSTNAME-dir JobId 32487: Error: Bacula DIRHOSTNAME-dir
5.0.2 (28Apr10): 18-May-2010 18:04:13
Build OS: i686-pc-linux-gnu debian 5.0.4
JobId: 32487
Job: HOSTNAME2.2010-05-18_16.01.03_19
Backup Level: Full (upgraded from Incremental)
Client: "HOSTNAME2-fd" 5.0.2 (28Apr10)
Linux,Cross-compile,Win64
FileSet: "Windows HOSTNAME2 set" 2010-05-18 16:01:03
Pool: "Pool_HOSTNAME2" (From Job resource)
Catalog: "MyCatalog" (From Client resource)
Storage: "HOSTNAME2_storage" (From Job resource)
Scheduled time: 18-May-2010 16:01:01
Start time: 18-May-2010 16:01:05
End time: 18-May-2010 18:04:13
Elapsed time: 2 hours 3 mins 8 secs
Priority: 10
FD Files Written: 0
SD Files Written: 85,253
FD Bytes Written: 0 (0 B)
SD Bytes Written: 11,726,994,434 (11.72 GB)
Rate: 0.0 KB/s
Software Compression: None
VSS: no
Encryption: no
Accurate: no
Volume name(s): Vol_HOSTNAME2_0001
Volume Session Id: 9
Volume Session Time: 1274189949
Last Volume Bytes: 11,738,369,181 (11.73 GB)
Non-fatal FD errors: 0
SD Errors: 0
FD termination status: Error
SD termination status: OK
Termination: *** Backup Error ***
Same for HOSTNAME1 (interestingly, it came right after HOSTNAME2, the
order reversed only due to timing apparently, but they fail at exactly the
same moment (18:04)):
18-May 18:04 DIRHOSTNAME-dir JobId 32486: Fatal error: Network error with
FD during Backup: ERR=Connection reset by peer
18-May 18:04 DIRHOSTNAME-dir JobId 32486: Fatal error: No Job status
returned from FD.
18-May 18:04 DIRHOSTNAME-dir JobId 32486: Error: Bacula DIRHOSTNAME-dir
5.0.2 (28Apr10): 18-May-2010 18:04:31
Build OS: i686-pc-linux-gnu debian 5.0.4
JobId: 32486
Job: HOSTNAME1.2010-05-18_16.00.56_18
Backup Level: Full (upgraded from Incremental)
Client: "HOSTNAME1-fd" 5.0.2 (28Apr10)
Linux,Cross-compile,Win64
FileSet: "Windows HOSTNAME1 set" 2010-05-18 16:00:56
Pool: "Pool_HOSTNAME1" (From Job resource)
Catalog: "MyCatalog" (From Client resource)
Storage: "HOSTNAME1_storage" (From Job resource)
Scheduled time: 18-May-2010 16:00:55
Start time: 18-May-2010 16:00:58
End time: 18-May-2010 18:04:31
Elapsed time: 2 hours 3 mins 33 secs
Priority: 10
FD Files Written: 0
SD Files Written: 86,470
FD Bytes Written: 0 (0 B)
SD Bytes Written: 12,464,036,220 (12.46 GB)
Rate: 0.0 KB/s
Software Compression: None
VSS: no
Encryption: no
Accurate: no
Volume name(s): Vol_HOSTNAME1_0001
Volume Session Id: 8
Volume Session Time: 1274189949
Last Volume Bytes: 12,475,995,727 (12.47 GB)
Non-fatal FD errors: 0
SD Errors: 0
FD termination status: Error
SD termination status: OK
Termination: *** Backup Error ***
After that the director finally sees them as errored out instead of still
running (but the clients report OK in the termination status).
The /var/bacula/working/log now contains the failure lines as well, again
interestingly it continue in mid sentence where it left off before.
Is this a networking issue where some "I'm done" packet was lost/held up?
If so, does this go to another port (I don't think so), or does it use a
special protocol/form so a specific network issue may block that but not
everything else?
------------------------------------------------------------------------------
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users
|