2012/2/13 Joe Nyland <
joenyland AT me DOT com>:
> Hello everyone,
>
> I hope someone would be able to offer any suggestions of why I am seeing the
> following behaviour in my current Bacula setup:
>
> Since the tail end of last week, I have been having issues with my MySQL
> backups in Bacula, where they would randomly appear to 'crash', normally
> when performing a copy of a backup to another pool - but I'm not sure yet if
> this is the trigger.
>
> Running 'status dir' after one of these 'crashes' gives the following output
> for the running jobs:
>
> Running Jobs:
> Console connected at 12-Feb-12 15:53
> Console connected at 13-Feb-12 06:58
> JobId Level Name Status
> ======================================================================
> 2107 Full WebServer1_MySQL_Copy.2012-02-13_04.30.00_28 is running
> <Crashed Job>
> 2108 Full WebServer1_MySQL.2012-02-13_04.30.00_29 is running <Crashed
> Job>
> 2111 Full MythTVServer1_MySQL.2012-02-13_05.00.00_32 is waiting for
> higher priority jobs to finish
> 2113 Full TestServer_MySQL.2012-02-13_05.00.00_34 is waiting execution
> 2114 Full MythTVServer1_MySQL_Copy.2012-02-13_05.30.00_35 is waiting
> execution
> 2115 Full WebServer1_MySQL_Copy.2012-02-13_05.30.00_36 is waiting
> execution
> 2116 Full WebServer1_MySQL.2012-02-13_05.30.00_37 has a fatal error
> 2117 Full TestServer_MySQL_Copy.2012-02-13_05.30.00_38 is waiting
> execution
> 2121 Full MythTVServer1_MySQL_Copy.2012-02-13_06.30.00_42 is waiting
> execution
> 2122 Full WebServer1_MySQL_Copy.2012-02-13_06.30.00_43 is waiting
> execution
> 2123 Full WebServer1_MySQL.2012-02-13_06.30.00_44 has a fatal error
> 2124 Full TestServer_MySQL_Copy.2012-02-13_06.30.00_45 is waiting
> execution
> 2125 Full MythTVServer1_MySQL.2012-02-13_07.00.00_47 has a fatal error
> 2126 Full WebServer1_MySQL.2012-02-13_07.00.00_48 has a fatal error
> ====
>
> Once the above appears, I am unable to view the status of any storage
> resource on my SD:
>
> *status storage=FileServer1_Full
> Connecting to Storage daemon FileServer1_Full at FileServer1:9103
>
> FileServer1-sd Version: 5.0.1 (24 February 2010) x86_64-pc-linux-gnu ubuntu
> 10.04
> Daemon started 12-Feb-12 15:53, 92 Jobs run since started.
> Heap: heap=1,671,168 smbytes=1,188,608 max_bytes=1,388,208 bufs=577
> max_bufs=994
> Sizes: boffset_t=8 size_t=8 int32_t=4 int64_t=8
>
> Running Jobs:
> Reading: Full Copy job WebServer1_MySQL_Copy JobId=2107
> Volume="WebServer1_MySQL_1325"
> pool="WebServer1_MySQL" device="WebServer1_MySQL"
> (/mnt/backup/Bacula/Databases/WebServer1)
> Files=4 Bytes=164,924 Bytes/sec=17
> FDSocket closed
> ====
>
> Jobs waiting to reserve a drive:
> ====
>
> Terminated Jobs:
> JobId Level Files Bytes Status Finished Name
> ===================================================================
> 2091 Full 2 92.45 K OK 13-Feb-12 03:30
> TestServer_MySQL_Copy
> 2096 Full 5 2.258 M OK 13-Feb-12 03:30
> MythTVServer1_MySQL_Copy
> 2098 Full 4 164.9 K OK 13-Feb-12 03:30
> WebServer1_MySQL_Copy
> 2100 Full 2 92.45 K OK 13-Feb-12 03:30
> TestServer_MySQL_Copy
> 2078 Full 1,145 2.942 G OK 13-Feb-12 03:31 SVN_Copy
> 2102 Full 5 2.259 M OK 13-Feb-12 04:01
> MythTVServer1_MySQL
> 2103 Full 4 164.9 K OK 13-Feb-12 04:01
> WebServer1_MySQL
> 2104 Full 2 92.37 K OK 13-Feb-12 04:01
> TestServer_MySQL
> 2105 Full 5 2.259 M OK 13-Feb-12 04:30
> MythTVServer1_MySQL_Copy
> 2109 Full 2 92.37 K OK 13-Feb-12 04:30
> TestServer_MySQL_Copy
> ====
>
> Device status:
> Device "Default" (/mnt/backup/Bacula) is not open.
> <snip>
> Device "WebServer1_Inc" (/mnt/backup/Bacula/WebServer1/Incremental) is not
> open.
> Device "WebServer1_MySQL" (/mnt/backup/Bacula/Databases/WebServer1) is
> mounted with:
> Volume: WebServer1_MySQL_1325
> Pool: WebServer1_MySQL
> Media type: File
> Total Bytes Read=0 Blocks Read=0 Bytes/block=0
> Positioned at File=0 Block=0
> Device "WebServer1_MySQL_Copy" (/mnt/mac_backup/Bacula/Databases/WebServer1)
> is not open.
> Device "WebServer1_Full_Copy" (/mnt/mac_backup/Bacula/WebServer1/Full) is
> not open.
> Device "WebServer1_Inc_Copy"
> (/mnt/mac_backup/Bacula/WebServer1/Incrementals) is not open.
> <snip>
> Device "SharedData_Diff" (/mnt/backup/Bacula/Shared/Differential) is not
> open.
> ====
>
> Used Volume status:
>
> NOTE: bconsole appears to crash here - no further output is produced, and
> bconsole does not respond to any key presses. I have to Ctrl + C to exit out
> from bconsole. Furthermore, the only way I can clear our the failed jobs
> from the 'Running jobs queue' is to exit from bconsole, issue 'sudo service
> bacula-sd stop' twice, then restart the SD and restart bacula-director.
>
>
> What I have is for 4 of my clients I run a MySQL backup hourly at 00:00,
> 01:00, etc. I then copy the MySQL backups to another storage resource on my
> SD at 00:30, 01:30, etc. The MySQL databases which I am backing up are
> relatively small, the biggest of which is my Bacula catalog - ~160Mb -
> although this backup is currently disabled and the database backed up
> outside of Bacula until I can resolve this issue.
>
> Here's the config for one of the client's MySQL backups:
>
> JobDefs {
> Name = DefaultBackup
> Type = Backup
> Accurate = yes
> Level = Full
> Client = FileServer1-fd
> Messages = Standard
> Pool = Default
> Storage = Default
> Priority = 10
> Allow Duplicate Jobs = No
> Cancel Lower Level Duplicates = yes
> }
>
> JobDefs {
> Name = DefaultCopy
> Type = Copy
> Level = Full
> Client = FileServer1-fd
> Messages = Standard
> Selection Type = PoolUncopiedJobs
> Priority = 12
> }
>
> Job {
> Name = TestServer_MySQL
> Type = Backup
> JobDefs = DefaultBackup
> Client = TestServer-fd
> FileSet = "MySQL Databases"
> ClientRunBeforeJob = "/etc/bacula/scripts/client-scripts/mysql-backup.sh
> bacula_backup Gromit123"
> ClientRunAfterJob = "/etc/bacula/scripts/client-scripts/mysql-backup.sh
> cleanup"
> Schedule = "Hourly MySQL Database Schedule"
> Messages = Standard
> Pool = TestServer_MySQL
> Storage = TestServer_MySQL
> Enabled = No
> }
>
> Job {
> Name = "TestServer_MySQL_Copy"
> JobDefs = DefaultCopy
> Type = Copy
> Client = TestServer-fd
> FileSet = "MySQL Databases"
> Pool = TestServer_MySQL
> Messages = Standard
> Schedule = "Hourly MySQL Database Copy Schedule"
> Storage = TestServer_MySQL
> Enabled = No
> }
>
> Reading back through console messages leading up to the crash, there doesn't
> appear to be any suggestion for why the jobs have crashed, only messages
> about duplicate jobs not being allowed for the jobs which are queued after
> the crashed jobs at the top of the queue.
>
>
> If I can provide any further information to help diagnose this issue, please
> let me know and I will be able to provide it.
>
I would look at the log for the sd. One way to get this is to run
bacula-sd in a console with the debug -d 100 option enabled instead of
running it as a daemon. You can also google for bacula kaboom for more
debugging tips.
John