Bacula-users

Re: [Bacula-users] bacula-fd crashes on FreeBSD 9.2

2013-11-04 15:03:59
Subject: Re: [Bacula-users] bacula-fd crashes on FreeBSD 9.2
From: Dan Langille <dan AT langille DOT org>
To: dweimer AT dweimer DOT net
Date: Mon, 4 Nov 2013 15:01:29 -0500
On Oct 30, 2013, at 4:48 PM, dweimer wrote:

> On 10/16/2013 5:43 pm, David Newman wrote:
>> On 10/16/13 12:44 PM, dweimer wrote:
>>> On 10/16/2013 2:13 pm, David Newman wrote:
>>>> On 10/14/13 2:44 AM, Martin Simmons wrote:
>>>>>>>>>> On Sun, 13 Oct 2013 18:25:07 -0700, David Newman said:
>>>>>> 
>>>>>> On 10/9/13 4:41 PM, David Newman wrote:
>>>>>>> FreeBSD 9.2-RELEASE, bacula-client-5.2.12_3 installed from ports
>>>>>>> 
>>>>>>> Ever since upgrading this host to FreeBSD 9.2, bacula-fd crashes 
>>>>>>> as
>>>>>>> soon
>>>>>>> as bacula-dir starts a backup job. The entry in /var/log/messages
>>>>>>> is:
>>>>>>> 
>>>>>>> Oct  9 16:25:50 o bacula-fd: Bacula interrupted by signal 0: 
>>>>>>> UNKNOWN
>>>>>>> SIGNAL
>>>>>>> 
>>>>>>> Backups worked fine on this host running FreeBSD 9.1 and other 
>>>>>>> hosts
>>>>>>> upgraded to FreeBSD 9.2 run backups OK.
>>>>>>> 
>>>>>>> I've done the uninstall/reinstall thing with the bacula-client 
>>>>>>> port,
>>>>>>> but
>>>>>>> that made no difference.
>>>>>>> 
>>>>>>> Thanks in advance for troubleshooting clues.
>>>>>>> 
>>>>>>> dn
>>>>>> 
>>>>>> Is there a Wireshark decode for Bacula?
>>>>>> 
>>>>>> I'm still stuck on this problem, and need more info on what's 
>>>>>> causing
>>>>>> that UNKNOWN SIGNAL error. Wireshark 1.8.6 just shows strings of
>>>>>> bytes
>>>>>> for the Bacula stuff.
>>>>>> 
>>>>>> Thanks.
>>>>>> 
>>>>>> dn
>>>>> 
>>>>> A wireshark decode won't help much here because problems like this
>>>>> must be in
>>>>> the fd itself.
>>>>> 
>>>>> Try attaching gdb to the bacula-fd process and see if it catches the
>>>>> mysterious signal (see
>>>>> http://www.bacula.org/5.2.x-manuals/en/problems/problems/What_Do_When_Bacula.html#SECTION00640000000000000000).
>>>> 
>>>> No luck with this. Per that URL, I've put the btraceback.gdb file in
>>>> the
>>>> same directory as the bacula-fd executable on the client (in this 
>>>> case,
>>>> /usr/local/sbin) and made the .gdb file executable.
>>>> 
>>>> At run time it produces this error:
>>>> 
>>>> /usr/local/sbin/btraceback.gdb:1: Error in sourced command file:
>>>> No symbol table is loaded.  Use the "file" command.
>>>> 
>>>> That's problem 1. Problem 2 is that the syntax given for capturing
>>>> STDERR and STDOUT -- 2>\&1 -- doesn't work on either csh (root's
>>>> default
>>>> on FreeBSD) or bash.
>>>> 
>>>> Any ideas on remedying either issue?
>>>> 
>>>> Thanks.
>>>> 
>>>> dn
>>>> 
>>> 
>>> I have 2>&1, no backslash before the ampersand used with /bin/sh in
>>> several cron scripts, on FreeBSD seems to do the job
>> 
>> Thanks, that works for capturing STDERR and STDOUT.
>> 
>> But that .gdb file still produces the same error:
>> 
>> /usr/local/sbin/btraceback.gdb:1: Error in sourced command file:
>> No symbol table is loaded.  Use the "file" command.
>> 
>> So, I'm still blocked on debugging this issue.
>> 
>> dn
>> 
>> 
> 
> Well one of my FreeBSD 9.2 systems decided to take a new route to this 
> problem.  My backups starting failing this morning, without the 
> bacula-fd process stopping, it starts the client run before job script, 
> then after two hours fails with no response from the client.
> 
> 2013-10-30 07:52:34   bacula-dir JobId 291: Start Backup JobId 291, 
> Job=Webmail-Backup.2013-10-30_07.52.32_46
> 2013-10-30 07:52:34   bacula-dir JobId 291: Using Device "FileStorage"
> 2013-10-30 07:52:35   webmail-fd JobId 291: shell command: run 
> ClientRunBeforeJob "/root/bacula/before.sh"
> 2013-10-30 07:52:35   webmail-fd JobId 291: ClientRunBeforeJob:
> 2013-10-30 07:52:35   webmail-fd JobId 291: ClientRunBeforeJob: Create 
> PostgreSQL Backup...
> 2013-10-30 07:52:35   webmail-fd JobId 291: ClientRunBeforeJob:
> 2013-10-30 07:52:35   webmail-fd JobId 291: ClientRunBeforeJob: Getting 
> Database List
> 2013-10-30 07:52:35   webmail-fd JobId 291: ClientRunBeforeJob:
> 2013-10-30 09:58:46 bacula-dir JobId 291: Fatal error: Socket error on 
> ClientRunBeforeJob command: ERR=Connection reset by peer

I have no idea.  But I have one suggestion, just for kicks.

I've long been skeptical of multiple run before/after scripts.  I've always 
preferred
to have just one script.  Is it worth combining them into one?

> 
> 2013-10-30 09:58:46   bacula-dir JobId 291: Fatal error: Client 
> "webmail-fd" RunScript failed.
> 2013-10-30 09:58:46 bacula-dir JobId 291: Fatal error: Network error 
> with FD during Backup: ERR=Connection reset by peer

That definitely sounds like a networking issue.  Some kind of communication 
issue.

> 
> 2013-10-30 09:58:47   bacula-dir JobId 291: Fatal error: No Job status 
> returned from FD.
> 2013-10-30 09:58:47   bacula-dir JobId 291: Error: Bacula bacula-dir 
> 5.2.12 (12Sep12):
>   Build OS:               amd64-portbld-freebsd9.2 freebsd 9.2-RELEASE
>   JobId:                  291
>   Job:                    Webmail-Backup.2013-10-30_07.52.32_46
>   Backup Level:           Incremental, since=2013-10-29 00:07:02
>   Client:                 "webmail-fd" 5.2.12 (12Sep12) 
> amd64-portbld-freebsd9.2,freebsd,9.2-RELEASE
>   FileSet:                "WebmailZFS-FileSet" 2013-09-27 13:12:07
>   Pool:                   "File" (From Job resource)
>   Catalog:                "MyCatalog" (From Client resource)
>   Storage:                "File" (From Pool resource)
>   Scheduled time:         30-Oct-2013 07:52:30
>   Start time:             30-Oct-2013 07:52:34
>   End time:               30-Oct-2013 09:58:47
>   Elapsed time:           2 hours 6 mins 13 secs
>   Priority:               10
>   FD Files Written:       0
>   SD Files Written:       0
>   FD Bytes Written:       0 (0 B)
>   SD Bytes Written:       0 (0 B)
>   Rate:                   0.0 KB/s
>   Software Compression:   None
>   VSS:                    no
>   Encryption:             no
>   Accurate:               no
>   Volume name(s):
>   Volume Session Id:      6
>   Volume Session Time:    1383098903
>   Last Volume Bytes:      27,632,643,492 (27.63 GB)
>   Non-fatal FD errors:    1
>   SD Errors:              0
>   FD termination status:  Error
>   SD termination status:  OK
>   Termination:            *** Backup Error ***
> 
> 
> When I check this server, the client run before job script completed, 
> all the database dumps, were successful, and the ZFS snapshots that 
> follow the Database dumps complete as well.  However Bacula stops 
> returning the script's status.
> 
> This server was running fine on up through the full backup done Monday 
> morning, but now comes right back to this problem on every attempt to 
> backup today.  A reboot didn't help, trying a full backup instead of 
> incremental made no difference.
> 
> Canceled one of the attempts, and restarted after removing the client 
> run before script, its now backing up files just fine. so I have 
> temporarily setup a cron job to run 30 minutes before backup to execute 
> my database backups and zfs snapshots.  and removed the client run 
> before job.

Do smaller jobs help?  That is, if you do not have the RunBefore scripts, does 
the job work?

> I can find no errors logged on the server running the bacula-fd or the 
> bacula server with the exception of the timeout error message.  Tried 
> adding heartbeat interval of 1 minute on the client, that didn't help 
> either.
> 
> -- 
> Thanks,
>    Dean E. Weimer
>    http://www.dweimer.net/
> 
> ------------------------------------------------------------------------------
> Android is increasing in popularity, but the open development platform that
> developers love is also attractive to malware creators. Download this white
> paper to learn more about secure code signing practices that can help keep
> Android apps secure.
> http://pubads.g.doubleclick.net/gampad/clk?id=65839951&iu=/4140/ostg.clktrk
> _______________________________________________
> Bacula-users mailing list
> Bacula-users AT lists.sourceforge DOT net
> https://lists.sourceforge.net/lists/listinfo/bacula-users

-- 
Dan Langille - http://langille.org


------------------------------------------------------------------------------
Android is increasing in popularity, but the open development platform that
developers love is also attractive to malware creators. Download this white
paper to learn more about secure code signing practices that can help keep
Android apps secure.
http://pubads.g.doubleclick.net/gampad/clk?id=65839951&iu=/4140/ostg.clktrk
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users