Bacula-users

Re: [Bacula-users] SD crashes

2012-02-14 10:22:50
Subject: Re: [Bacula-users] SD crashes
From: Joe Nyland <joenyland AT me DOT com>
To: Bacula Users <bacula-users AT lists.sourceforge DOT net>
Date: Tue, 14 Feb 2012 15:18:18 +0000 (GMT)
On 13 Feb, 2012,at 02:30 PM, "Joe Nyland" <joenyland AT me DOT com> wrote:

On 13 Feb, 2012,at 02:11 PM, John Drescher <drescherjm AT gmail DOT com> wrote:

2012/2/13 Joe Nyland <joenyland AT me DOT com>:
> Hello everyone,
>
> I hope someone would be able to offer any suggestions of why I am seeing the
> following behaviour in my current Bacula setup:
>
> Since the tail end of last week, I have been having issues with my MySQL
> backups in Bacula, where they would randomly appear to 'crash', normally
> when performing a copy of a backup to another pool - but I'm not sure yet if
> this is the trigger.
>
> Running 'status dir' after one of these 'crashes' gives the following output
> for the running jobs:
>
> Running Jobs:
> Console connected at 12-Feb-12 15:53
> Console connected at 13-Feb-12 06:58
>  JobId Level   Name                       Status
> ======================================================================
>   2107 Full    WebServer1_MySQL_Copy.2012-02-13_04.30.00_28 is running
> <Crashed Job>
>   2108 Full    WebServer1_MySQL.2012-02-13_04.30.00_29 is running <Crashed
> Job>
>   2111 Full    MythTVServer1_MySQL.2012-02-13_05.00.00_32 is waiting for
> higher priority jobs to finish
>   2113 Full    TestServer_MySQL.2012-02-13_05.00.00_34 is waiting execution
>   2114 Full    MythTVServer1_MySQL_Copy.2012-02-13_05.30.00_35 is waiting
> execution
>   2115 Full    WebServer1_MySQL_Copy.2012-02-13_05.30.00_36 is waiting
> execution
>   2116 Full    WebServer1_MySQL.2012-02-13_05.30.00_37 has a fatal error
>   2117 Full    TestServer_MySQL_Copy.2012-02-13_05.30.00_38 is waiting
> execution
>   2121 Full    MythTVServer1_MySQL_Copy.2012-02-13_06.30.00_42 is waiting
> execution
>   2122 Full    WebServer1_MySQL_Copy.2012-02-13_06.30.00_43 is waiting
> execution
>   2123 Full    WebServer1_MySQL.2012-02-13_06.30.00_44 has a fatal error
>   2124 Full    TestServer_MySQL_Copy.2012-02-13_06.30.00_45 is waiting
> execution
>   2125 Full    MythTVServer1_MySQL.2012-02-13_07.00.00_47 has a fatal error
>   2126 Full    WebServer1_MySQL.2012-02-13_07.00.00_48 has a fatal error
> ====
>
> Once the above appears, I am unable to view the status of any storage
> resource on my SD:
>
> *status storage=FileServer1_Full
> Connecting to Storage daemon FileServer1_Full at FileServer1:9103
>
> FileServer1-sd Version: 5.0.1 (24 February 2010) x86_64-pc-linux-gnu ubuntu
> 10.04
> Daemon started 12-Feb-12 15:53, 92 Jobs run since started.
>  Heap: heap=1,671,168 smbytes=1,188,608 max_bytes=1,388,208 bufs=577
> max_bufs=994
> Sizes: boffset_t=8 size_t=8 int32_t=4 int64_t=8
>
> Running Jobs:
> Reading: Full Copy job WebServer1_MySQL_Copy JobId=2107
> Volume="WebServer1_MySQL_1325"
>     pool="WebServer1_MySQL" device="WebServer1_MySQL"
> (/mnt/backup/Bacula/Databases/WebServer1)
>     Files=4 Bytes=164,924 Bytes/sec=17
>     FDSocket closed
> ====
>
> Jobs waiting to reserve a drive:
> ====
>
> Terminated Jobs:
>  JobId  Level    Files      Bytes   Status   Finished        Name
> ===================================================================
>   2091  Full          2    92.45 K  OK       13-Feb-12 03:30
> TestServer_MySQL_Copy
>   2096  Full          5    2.258 M  OK       13-Feb-12 03:30
> MythTVServer1_MySQL_Copy
>   2098  Full          4    164.9 K  OK       13-Feb-12 03:30
> WebServer1_MySQL_Copy
>   2100  Full          2    92.45 K  OK       13-Feb-12 03:30
> TestServer_MySQL_Copy
>   2078  Full      1,145    2.942 G  OK       13-Feb-12 03:31 SVN_Copy
>   2102  Full          5    2.259 M  OK       13-Feb-12 04:01
> MythTVServer1_MySQL
>   2103  Full          4    164.9 K  OK       13-Feb-12 04:01
> WebServer1_MySQL
>   2104  Full          2    92.37 K  OK       13-Feb-12 04:01
> TestServer_MySQL
>   2105  Full          5    2.259 M  OK       13-Feb-12 04:30
> MythTVServer1_MySQL_Copy
>   2109  Full          2    92.37 K  OK       13-Feb-12 04:30
> TestServer_MySQL_Copy
> ====
>
> Device status:
> Device "Default" (/mnt/backup/Bacula) is not open.
> <snip>
> Device "WebServer1_Inc" (/mnt/backup/Bacula/WebServer1/Incremental) is not
> open.
> Device "WebServer1_MySQL" (/mnt/backup/Bacula/Databases/WebServer1) is
> mounted with:
>     Volume:      WebServer1_MySQL_1325
>     Pool:        WebServer1_MySQL
>     Media type:  File
>     Total Bytes Read=0 Blocks Read=0 Bytes/block=0
>     Positioned at File=0 Block=0
> Device "WebServer1_MySQL_Copy" (/mnt/mac_backup/Bacula/Databases/WebServer1)
> is not open.
> Device "WebServer1_Full_Copy" (/mnt/mac_backup/Bacula/WebServer1/Full) is
> not open.
> Device "WebServer1_Inc_Copy"
> (/mnt/mac_backup/Bacula/WebServer1/Incrementals) is not open.
> <snip>
> Device "SharedData_Diff" (/mnt/backup/Bacula/Shared/Differential) is not
> open.
> ====
>
> Used Volume status:
>
> NOTE: bconsole appears to crash here - no further output is produced, and
> bconsole does not respond to any key presses. I have to Ctrl + C to exit out
> from bconsole. Furthermore, the only way I can clear our the failed jobs
> from the 'Running jobs queue' is to exit from bconsole, issue 'sudo service
> bacula-sd stop' twice, then restart the SD and restart bacula-director.
>
>
> What I have is for 4 of my clients I run a MySQL backup hourly at 00:00,
> 01:00, etc. I then copy the MySQL backups to another storage resource on my
> SD at 00:30, 01:30, etc. The MySQL databases which I am backing up are
> relatively small, the biggest of which is my Bacula catalog - ~160Mb -
> although this backup is currently disabled and the database backed up
> outside of Bacula until I can resolve this issue.
>
> Here's the config for one of the client's MySQL backups:
>
> JobDefs {
>   Name = DefaultBackup
>   Type = Backup
>   Accurate = yes
>   Level = Full
>   Client = FileServer1-fd
>   Messages = Standard
>   Pool = Default
>   Storage = Default
>   Priority = 10
>   Allow Duplicate Jobs = No
>   Cancel Lower Level Duplicates = yes
> }
>
> JobDefs {
>   Name = DefaultCopy
>   Type = Copy
>   Level = Full
>   Client = FileServer1-fd
>   Messages = Standard
>   Selection Type = PoolUncopiedJobs
>   Priority = 12
> }
>
> Job {
>   Name = TestServer_MySQL
>   Type = Backup
>   JobDefs = DefaultBackup
>   Client = TestServer-fd
>   FileSet = "MySQL Databases"
>   ClientRunBeforeJob = "/etc/bacula/scripts/client-scripts/mysql-backup.sh
> bacula_backup Gromit123"
>   ClientRunAfterJob = "/etc/bacula/scripts/client-scripts/mysql-backup.sh
> cleanup"
>   Schedule = "Hourly MySQL Database Schedule"
>   Messages = Standard
>   Pool = TestServer_MySQL
>   Storage = TestServer_MySQL
>   Enabled = No
> }
>
> Job {
>   Name = "TestServer_MySQL_Copy"
>   JobDefs = DefaultCopy
>   Type = Copy
>   Client = TestServer-fd
>   FileSet = "MySQL Databases"
>   Pool = TestServer_MySQL
>   Messages = Standard
>   Schedule = "Hourly MySQL Database Copy Schedule"
>   Storage = TestServer_MySQL
>   Enabled = No
> }
>
> Reading back through console messages leading up to the crash, there doesn't
> appear to be any suggestion for why the jobs have crashed, only messages
> about duplicate jobs not being allowed for the jobs which are queued after
> the crashed jobs at the top of the queue.
>
>
> If I can provide any further information to help diagnose this issue, please
> let me know and I will be able to provide it.
>

I would look at the log for the sd. One way to get this is to run
bacula-sd in a console with the debug -d 100 option enabled instead of
running it as a daemon. You can also google for bacula kaboom for more
debugging tips.


John
 
Hi John,

Thank you for your reply too - only just received it after replying to Adrian Reyer.

That sounds like a logical step to me too. I'll set this up later on, so that it's in place for when it happens again.

Thank you for your input.

Joe
 
Hello,

I've been running the SD using the following command (I know the combination of options I have used may be excessive, but I wanted as much chance of catching the error as I could!) since yesterday afternoon:
   sudo bacula-sd -c /etc/bacula/bacula-sd.conf -d 100 -dt -f -u bacula -g tape -m -v | tee -a /mnt/array/bacula-sd.screen.log

However, (as luck would have it) I've not seen the behaviour I originally reported whilst running with debug options.

Is there any way in which running the SD with the combination of options I have used above, could cause any different behaviour of the SD? Or interfere in any way with it? I'm asking, becuase I have re-enabled all of the backups jobs I have on the server, and I have still not seen it crash again.

Thanks,

Joe
------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users