Re: [Bacula-users] jobs fail with various "broken pipe" errors
2012-02-23 04:13:59
On Wednesday 22 February 2012 15:20:10 Silver Salonen wrote:
> On Wed, 22 Feb 2012 12:33:49 +0100, Hugo Letemplier wrote:
> > I think you can try to configure the Heartbeat Interval directive on
> > your various daemons.
>
> Hi.
>
> My SD already had Heartbeat Interval set to 60. I now tried it on one
> FD too, but the job still failed with the same error.
>
> Other FD's on both FreeBSD and Linux are able to run jobs for hours and
> complete them successfully.
>
> --
> Silver
>
> > 2012/2/22 Silver Salonen <silver AT serverock DOT ee>:
> >> Hi.
> >>
> >>
> >>
> >> Recently we changed the network connection for our backup server
> >> which is
> >> Bacula 5.2.3 on FreeBSD 9.0.
> >>
> >>
> >>
> >> After that many jobs running across WAN started failing with various
> >> "broken
> >> pipe" errors. Some examples:
> >>
> >>
> >>
> >> 21-Feb 22:42 fbsd1-fd JobId 57779: Error: bsock.c:398 Wrote 32151
> >> bytes to
> >> Storage daemon:backupsrv.url:9103, but only 16384 accepted.
> >>
> >> 21-Feb 22:42 fbsd1-fd JobId 57779: Fatal error: backup.c:1024
> >> Network send
> >> error to SD. ERR=Broken pipe
> >>
> >> 21-Feb 22:42 fbsd1-fd JobId 57779: Error: bsock.c:339 Socket has
> >> errors=1 on
> >> call to Storage daemon:backupsrv.url:9103
> >>
> >>
> >>
> >> (this one runs from the same backup-network to another SD)
> >>
> >> 22-Feb 00:14 backupsrv-dir JobId 57852: Fatal error: Network error
> >> with FD
> >> during Backup: ERR=Broken pipe
> >>
> >> 22-Feb 00:14 backupsrv-sd2 JobId 57852: JobId=57852
> >> Job="linux1-userdata.2012-02-21_23.05.02_02" marked to be canceled.
> >>
> >> 22-Feb 00:14 backupsrv-sd2 JobId 57852: Job write elapsed time =
> >> 00:33:20,
> >> Transfer rate = 591.5 K Bytes/second
> >>
> >> 22-Feb 00:14 backupsrv-sd2 JobId 57852: Error: bsock.c:529 Read
> >> expected
> >> 65568 got 1448 from client:123.45.67.81:36643
> >>
> >> 22-Feb 00:14 backupsrv-dir JobId 57852: Fatal error: No Job status
> >> returned
> >> from FD.
> >>
> >>
> >>
> >> 22-Feb 00:16 backupsrv-dir JobId 57821: Fatal error: Network error
> >> with FD
> >> during Backup: ERR=Broken pipe
> >>
> >> 22-Feb 00:16 backupsrv-sd JobId 57821: Job write elapsed time =
> >> 00:57:00,
> >> Transfer rate = 26.69 K Bytes/second
> >>
> >> 22-Feb 00:16 backupsrv-dir JobId 57821: Fatal error: No Job status
> >> returned
> >> from FD.
> >>
> >>
> >>
> >> 22-Feb 00:24 winsrv1-fd JobId 57784: Error:
> >> /home/kern/bacula/k/bacula/src/lib/bsock.c:393 Write error sending
> >> 9363
> >> bytes to Storage daemon:backupsrv.url:9103: ERR=Input/output error
> >>
> >> 22-Feb 00:24 winsrv1-fd JobId 57784: Fatal error:
> >> /home/kern/bacula/k/bacula/src/filed/backup.c:1024 Network send
> >> error to SD.
> >> ERR=Input/output error
> >>
> >> 22-Feb 00:26 winsrv1-fd JobId 57784: Error:
> >> /home/kern/bacula/k/bacula/src/lib/bsock.c:339 Socket has errors=1
> >> on call
> >> to Storage daemon:backupsrv.url:9103
> >>
> >>
> >>
> >> (this one runs from the same backup-network to another SD)
> >>
> >> 22-Feb 01:33 backupsrv-dir JobId 57872: Fatal error: Socket error on
> >> ClientRunBeforeJob command: ERR=Broken pipe
> >>
> >> 22-Feb 01:33 backupsrv-dir JobId 57872: Fatal error: Client
> >> "winsrv2-fd"
> >> RunScript failed.
> >>
> >> 22-Feb 01:33 backupsrv-dir JobId 57872: Fatal error: Network error
> >> with FD
> >> during Backup: ERR=Broken pipe
> >>
> >> 22-Feb 01:33 backupsrv-sd2 JobId 57872: JobId=57872
> >> Job="winsrv2.2012-02-22_01.00.00_27" marked to be canceled.
> >>
> >> 22-Feb 01:33 backupsrv-dir JobId 57872: Fatal error: No Job status
> >> returned
> >> from FD.
> >>
> >>
> >>
> >> 22-Feb 01:51 fbsd2-fd JobId 57806: Error: bsock.c:398 Wrote 61750
> >> bytes to
> >> Storage daemon:backupsrv.url:9103, but only 16384 accepted.
> >>
> >> 22-Feb 01:51 fbsd2-fd JobId 57806: Fatal error: backup.c:1024
> >> Network send
> >> error to SD. ERR=Broken pipe
> >>
> >> 22-Feb 01:51 fbsd2-fd JobId 57806: Error: bsock.c:339 Socket has
> >> errors=1 on
> >> call to Storage daemon:backupsrv.url:9103
> >>
> >>
> >>
> >> 22-Feb 02:15 backupsrv-dir JobId 57819: Fatal error: Network error
> >> with FD
> >> during Backup: ERR=Connection reset by peer
> >>
> >> 22-Feb 02:15 backupsrv-dir JobId 57819: Fatal error: No Job status
> >> returned
> >> from FD.
> >>
> >>
> >>
> >>
> >>
> >> These jobs have been failing every day for a week now. Meanwhile
> >> other jobs
> >> complete just fine, and it seems not to about jobs' size or scripts
> >> to be
> >> run before jobs on clients etc.
> >>
> >>
> >>
> >> Any idea what could be wrong?
What's also interesting about these failures are these lines (similar in all
these failing jobs):
FD Files Written: 381
SD Files Written: 0
FD Bytes Written: 391,430,239 (391.4 MB)
SD Bytes Written: 0 (0 B)
Last Volume Bytes: 260 (260 B)
And the actual volume file seems to contain all the data (its size is 373MB).
What can we conclude from that?
Does the failure/timeout/whatever occur after the FD--SD connection, eg. when
SD tries to communicate with DIR about the end of the job or smth?
--
Silver
------------------------------------------------------------------------------
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users
|
|
|