Bacula-users

Re: [Bacula-users] Bacula suddenly choking on Full backups with Unknown term code

2010-02-22 06:44:00
Subject: Re: [Bacula-users] Bacula suddenly choking on Full backups with Unknown term code
From: Martin Simmons <martin AT lispworks DOT com>
To: bacula-users AT lists.sourceforge DOT net
Date: Mon, 22 Feb 2010 11:41:19 GMT
>>>>> On Sun, 21 Feb 2010 12:15:01 -0500, Glen Barber said:
> 
> Howdy,
> 
> I'm running bacula 2.4.3 on FreeBSD which up until recently hasn't been
> giving me issues.
> 
> I run daily incrementals, weekly differentials, and monthly fulls on
> colo-stored clients.  One of these client machines began failing to complete
> differential and full backups, with an "Unknown term code" in the email
> notification, with the following in the log:
> 
> fd JobId 13934: Fatal error: backup.c:892 Network send error to SD. 
> ERR=Broken pipe
> sd JobId 13934: Job client.2010-02-20_17.43.07 marked to be canceled.
> sd JobId 13934: Fatal error: append.c:259 Network error on data channel. 
> ERR=Connection reset by peer
> sd JobId 13934: Job write elapsed time = 02:58:46, Transfer rate = 1.451 M 
> bytes/second
> sd JobId 13934: Error: bsock.c:444 Read error from 
> client:xxx.xxx.xxx.xxx:36643: 
> ERR=Connection reset by peer
> dir JobId 13934: Error: Bacula dir 2.4.3 (10Oct08): 20-Feb-2010 21:12:25
> 
> In November, I changed the fileset for this client, where a full backup
> was scheduled and terminated successfully.  Since the initial full backup
> due to the fileset change, there have been two successful full and seven
> differentials which terminated successfully.  Incremental backups are not
> affected.
> 
> I initially began to suspect the network, but the colo switch does not show
> errors.  I've already enabled the heartbeat on the client with settings as
> low as 15 seconds, with no luck.  I ran the client fd with -d200 to track
> the failures, and found the backup was choking on a mbox file.

The fd got "Network send error to SD. ERR=Broken pipe" so the fd's OS thinks 
that
the socket was closed by the peer (i.e. the sd).

Conversely, the sd got "Network error on data channel. ERR=Connection reset by
peer" so the sd's OS thinks the socket was forcibly closed by the peer
(i.e. the fd).

They can't both be right, unless something in between is messing up.  That
looks very much like a network problem to me, maybe not in the colo switch but
somewhere in between.

__Martin

------------------------------------------------------------------------------
Download Intel&#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users