Bacula-users

Re: [Bacula-users] jobs fail with various "broken pipe" errors

2012-02-28 05:03:35
Subject: Re: [Bacula-users] jobs fail with various "broken pipe" errors
From: Silver Salonen <silver AT serverock DOT ee>
To: bacula-users AT lists.sourceforge DOT net
Date: Tue, 28 Feb 2012 12:00:57 +0200
On Thursday 23 February 2012 11:11:54 Silver Salonen wrote:
> On Wednesday 22 February 2012 15:20:10 Silver Salonen wrote:
> > On Wed, 22 Feb 2012 12:33:49 +0100, Hugo Letemplier wrote:
> > > I think you can try to configure the Heartbeat Interval directive on
> > > your various daemons.
> > 
> > Hi.
> > 
> > My SD already had Heartbeat Interval set to 60. I now tried it on one 
> > FD too, but the job still failed with the same error.
> > 
> > Other FD's on both FreeBSD and Linux are able to run jobs for hours and 
> > complete them successfully.
> > 
> > --
> > Silver
> > 
> > > 2012/2/22 Silver Salonen <silver AT serverock DOT ee>:
> > >> Hi.
> > >>
> > >>
> > >>
> > >> Recently we changed the network connection for our backup server 
> > >> which is
> > >> Bacula 5.2.3 on FreeBSD 9.0.
> > >>
> > >>
> > >>
> > >> After that many jobs running across WAN started failing with various 
> > >> "broken
> > >> pipe" errors. Some examples:
> > >>
> > >>
> > >>
> > >> 21-Feb 22:42 fbsd1-fd JobId 57779: Error: bsock.c:398 Wrote 32151 
> > >> bytes to
> > >> Storage daemon:backupsrv.url:9103, but only 16384 accepted.
> > >>
> > >> 21-Feb 22:42 fbsd1-fd JobId 57779: Fatal error: backup.c:1024 
> > >> Network send
> > >> error to SD. ERR=Broken pipe
> > >>
> > >> 21-Feb 22:42 fbsd1-fd JobId 57779: Error: bsock.c:339 Socket has 
> > >> errors=1 on
> > >> call to Storage daemon:backupsrv.url:9103
> > >>
> > >>
> > >>
> > >> (this one runs from the same backup-network to another SD)
> > >>
> > >> 22-Feb 00:14 backupsrv-dir JobId 57852: Fatal error: Network error 
> > >> with FD
> > >> during Backup: ERR=Broken pipe
> > >>
> > >> 22-Feb 00:14 backupsrv-sd2 JobId 57852: JobId=57852
> > >> Job="linux1-userdata.2012-02-21_23.05.02_02" marked to be canceled.
> > >>
> > >> 22-Feb 00:14 backupsrv-sd2 JobId 57852: Job write elapsed time = 
> > >> 00:33:20,
> > >> Transfer rate = 591.5 K Bytes/second
> > >>
> > >> 22-Feb 00:14 backupsrv-sd2 JobId 57852: Error: bsock.c:529 Read 
> > >> expected
> > >> 65568 got 1448 from client:123.45.67.81:36643
> > >>
> > >> 22-Feb 00:14 backupsrv-dir JobId 57852: Fatal error: No Job status 
> > >> returned
> > >> from FD.
> > >>
> > >>
> > >>
> > >> 22-Feb 00:16 backupsrv-dir JobId 57821: Fatal error: Network error 
> > >> with FD
> > >> during Backup: ERR=Broken pipe
> > >>
> > >> 22-Feb 00:16 backupsrv-sd JobId 57821: Job write elapsed time = 
> > >> 00:57:00,
> > >> Transfer rate = 26.69 K Bytes/second
> > >>
> > >> 22-Feb 00:16 backupsrv-dir JobId 57821: Fatal error: No Job status 
> > >> returned
> > >> from FD.
> > >>
> > >>
> > >>
> > >> 22-Feb 00:24 winsrv1-fd JobId 57784: Error:
> > >> /home/kern/bacula/k/bacula/src/lib/bsock.c:393 Write error sending 
> > >> 9363
> > >> bytes to Storage daemon:backupsrv.url:9103: ERR=Input/output error
> > >>
> > >> 22-Feb 00:24 winsrv1-fd JobId 57784: Fatal error:
> > >> /home/kern/bacula/k/bacula/src/filed/backup.c:1024 Network send 
> > >> error to SD.
> > >> ERR=Input/output error
> > >>
> > >> 22-Feb 00:26 winsrv1-fd JobId 57784: Error:
> > >> /home/kern/bacula/k/bacula/src/lib/bsock.c:339 Socket has errors=1 
> > >> on call
> > >> to Storage daemon:backupsrv.url:9103
> > >>
> > >>
> > >>
> > >> (this one runs from the same backup-network to another SD)
> > >>
> > >> 22-Feb 01:33 backupsrv-dir JobId 57872: Fatal error: Socket error on
> > >> ClientRunBeforeJob command: ERR=Broken pipe
> > >>
> > >> 22-Feb 01:33 backupsrv-dir JobId 57872: Fatal error: Client 
> > >> "winsrv2-fd"
> > >> RunScript failed.
> > >>
> > >> 22-Feb 01:33 backupsrv-dir JobId 57872: Fatal error: Network error 
> > >> with FD
> > >> during Backup: ERR=Broken pipe
> > >>
> > >> 22-Feb 01:33 backupsrv-sd2 JobId 57872: JobId=57872
> > >> Job="winsrv2.2012-02-22_01.00.00_27" marked to be canceled.
> > >>
> > >> 22-Feb 01:33 backupsrv-dir JobId 57872: Fatal error: No Job status 
> > >> returned
> > >> from FD.
> > >>
> > >>
> > >>
> > >> 22-Feb 01:51 fbsd2-fd JobId 57806: Error: bsock.c:398 Wrote 61750 
> > >> bytes to
> > >> Storage daemon:backupsrv.url:9103, but only 16384 accepted.
> > >>
> > >> 22-Feb 01:51 fbsd2-fd JobId 57806: Fatal error: backup.c:1024 
> > >> Network send
> > >> error to SD. ERR=Broken pipe
> > >>
> > >> 22-Feb 01:51 fbsd2-fd JobId 57806: Error: bsock.c:339 Socket has 
> > >> errors=1 on
> > >> call to Storage daemon:backupsrv.url:9103
> > >>
> > >>
> > >>
> > >> 22-Feb 02:15 backupsrv-dir JobId 57819: Fatal error: Network error 
> > >> with FD
> > >> during Backup: ERR=Connection reset by peer
> > >>
> > >> 22-Feb 02:15 backupsrv-dir JobId 57819: Fatal error: No Job status 
> > >> returned
> > >> from FD.
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> These jobs have been failing every day for a week now. Meanwhile 
> > >> other jobs
> > >> complete just fine, and it seems not to about jobs' size or scripts 
> > >> to be
> > >> run before jobs on clients etc.
> > >>
> > >>
> > >>
> > >> Any idea what could be wrong?
> 
> What's also interesting about these failures are these lines (similar in all 
> these failing jobs):
>   FD Files Written:       381
>   SD Files Written:       0
>   FD Bytes Written:       391,430,239 (391.4 MB)
>   SD Bytes Written:       0 (0 B)
>   Last Volume Bytes:      260 (260 B)
> 
> And the actual volume file seems to contain all the data (its size is 373MB).
> 
> What can we conclude from that?
> Does the failure/timeout/whatever occur after the FD--SD connection, eg. when 
> SD tries to communicate with DIR about the end of the job or smth?

I look at these lines:

28-Feb 00:04 backupsrv-dir JobId 58414: Sending Accurate information.
28-Feb 00:07 backupsrv-sd2 JobId 58414: Recycled volume 
"fbsd1-userdata-incr-6106" on device "device-fbsd1" (/mnt/backup/bacula/fbsd1), 
all previous data lost.
28-Feb 00:07 backupsrv-dir JobId 58414: Max Volume jobs=1 exceeded. Marking 
Volume "fbsd1-userdata-incr-6106" as Used.
28-Feb 00:40 backupsrv-dir JobId 58414: Fatal error: Network error with FD 
during Backup: ERR=Broken pipe
28-Feb 00:40 backupsrv-sd2 JobId 58414: JobId=58414 
Job="fbsd1-userdata.2012-02-27_23.05.02_42" marked to be canceled.
28-Feb 00:40 backupsrv-dir JobId 58414: Fatal error: No Job status returned 
from FD.
28-Feb 00:40 backupsrv-dir JobId 58414: Error: Bacula backupsrv-dir 5.2.3 
(16Dec11):

...and wonder why does DIR get a "Network error" after 33 minutes.

What does DIR talk about with FD after backing up?
Does DIR take up a new network session for that? It wouldn't be possible 
otherwise, would it?

If I understand it correctly, communications go smth like that:

1) DIR --> FD (auth, SD info etc)
2) DIR --> SD (auth, ...)
3) FD --> SD (data, attributes)
4) SD --> DIR (attributes)
5) DIR --> FD (??) - this is the point where "Broken pipe" occurs. What step is 
that?

--
Silver

------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users