Bacula-users

Re: [Bacula-users] Bacula suddenly choking on Full backups with Unknown term code

2010-02-22 21:39:19
Subject: Re: [Bacula-users] Bacula suddenly choking on Full backups with Unknown term code
From: Glen Barber <glen.j.barber AT gmail DOT com>
To: bacula-users AT lists.sourceforge DOT net
Date: Mon, 22 Feb 2010 20:36:24 -0500
Hi Martin,

Martin Simmons wrote: 
> >>>>> On Sun, 21 Feb 2010 12:15:01 -0500, Glen Barber said:
> > 
> > fd JobId 13934: Fatal error: backup.c:892 Network send error to SD. 
> > ERR=Broken pipe
> > sd JobId 13934: Job client.2010-02-20_17.43.07 marked to be canceled.
> > sd JobId 13934: Fatal error: append.c:259 Network error on data channel. 
> > ERR=Connection reset by peer
> > sd JobId 13934: Job write elapsed time = 02:58:46, Transfer rate = 1.451 M 
> > bytes/second
> > sd JobId 13934: Error: bsock.c:444 Read error from 
> > client:xxx.xxx.xxx.xxx:36643: 
> > ERR=Connection reset by peer
> > dir JobId 13934: Error: Bacula dir 2.4.3 (10Oct08): 20-Feb-2010 21:12:25
> > 
> 
> The fd got "Network send error to SD. ERR=Broken pipe" so the fd's OS thinks 
> that
> the socket was closed by the peer (i.e. the sd).
> 
> Conversely, the sd got "Network error on data channel. ERR=Connection reset by
> peer" so the sd's OS thinks the socket was forcibly closed by the peer
> (i.e. the fd).
> 
> They can't both be right, unless something in between is messing up.  That
> looks very much like a network problem to me, maybe not in the colo switch but
> somewhere in between.

I haven't yet completely dismissed this possibility, but the reproducible
mbox failures are far too strange for me not to think there may be a file
issue also.  I've copied this file to the backup server itself and ran an
individual backup on it without an issue.

Another run earlier today shows this in the debug log:

fd: backup.c:895-0 Send data to SD len=65552
fd: heartbeat.c:95-0 wait_intr=0 stop=0
fd: heartbeat.c:95-0 wait_intr=0 stop=0
fd: heartbeat.c:95-0 wait_intr=0 stop=0
fd: heartbeat.c:95-0 wait_intr=0 stop=0
[ ... a few more times ... ]
fd: heartbeat.c:139-0 Send kill to heartbeat id
fd: backup.c:197-0 end blast_data ok=0
fd: job.c:1447-0 Error in blast_data.

I'd like to be able to view the datastream as this failure occurs, but I
don't see how to accomplish this in the documentation.  I can use truss or
ktrace if needed, but if bacula has a built-in function, that would be
even better.

Best,

-- 
Glen Barber

------------------------------------------------------------------------------
Download Intel&#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users