Hello,
we are using Bacula 5.2.13-18 on CentOS6 and from time to time bacula-sd crashes with, causing all backups to fail until bacula-sd is started again:
Mar 3 06:59:00 XXXX bacula-sd: XXXX:storage:default: ABORTING due to ERROR in lockmgr.c:100#012Mutex lock failure. ERR=Invalid argument
Mar 3 06:59:00 XXXX bacula-sd: Bacula interrupted by signal 6: IOT trap
Setup:
3 Servers:
1 Bacula Director (extra machine)
1 Bacula Catalog Server (extra machine)
1 Bacula Storage Deamon (extra machine)
We have ~573 Jobs (some TB, all Full Backups) to backup each day. Jobs are distributed across the day depending on minimum load of the server, distributed evenly otherwise:
Time Jobs
0:00-1:00 35
1:00-2:00 121
2:00-3:00 93
3:00-4:00 60
4:00-5:00 46
5:00-6:00 71
6:00-7:00 60
7:00-8:00 43
8:00-9:00 32
9:00-10:00 12
10:00-11:00 7
11:00-12:00 3
12:00-13:00 5
13:00-14:00 2
14:00-15:00 7
15:00-16:00 8
16:00-17:00 7
17:00-18:00 3
18:00-19:00 2
19:00-20:00 3
20:00-21:00 11
21:00-22:00 14
22:00-23:00 28
23:00-24:00 25
Our SD is configured with 20 virtual drives in a backup2disk setup allowing 20 concurrent backups to disk. Each Backup Job is an individual file in the backend (so full backups can be accessed and restored through bls/bextract).
We have an external “scripted” job, which cleans up unused / purged volumes from disk.
Bacula Director Configuration:
------------------------------
Storage {
Name = "XXXX:storage:default"
Address = HOSTNAME_OF_THE_SD_MACHINE
Password = "SECRET"
Device = "FileStorage"
Maximum Concurrent Jobs = 20
Media Type = File
Heartbeat Interval = 15
TLS Enable = no
}
Pool {
Name = " HOSTNAME_OF_THE_SD_MACHINE:pool:default"
Storage = "XXXX:storage:default"
# All Volumes will have the format standard.date.time to ensure they
# are kept unique throughout the operation and also aid quick analysis
# We won't use a counter format for this at the moment.
Label Format = "BACULA-${Job}.${Year}${Month:p/2/0/r}${Day:p/2/0/r}.${Hour:p/2/0/r}${Minute:p/2/0/r}.${JobId}"
Pool Type = Backup
# Clean up any we don't need, and keep them for a maximum of a month (in
# theory the same time period for weekly backups from the clients)
# Note the files for the old volumes will still remain on the disk but will
# be truncated to a zero size.
Recycle = No
Auto Prune = Yes
Action On Purge = Truncate
Volume Retention = 30 days
# Don't allow re-use of volumes; one volume per job only
Maximum Volume Jobs = 1
}
Bacula SD Configuration:
------------------------------
Autochanger {
Name = "FileStorage"
Changer Device = /dev/null
Changer Command = ""
Device = FileStorage-sd-0
Device = FileStorage-sd-1
Device = FileStorage-sd-2
Device = FileStorage-sd-3
Device = FileStorage-sd-4
Device = FileStorage-sd-5
Device = FileStorage-sd-6
Device = FileStorage-sd-7
Device = FileStorage-sd-8
Device = FileStorage-sd-9
Device = FileStorage-sd-10
Device = FileStorage-sd-11
Device = FileStorage-sd-12
Device = FileStorage-sd-13
Device = FileStorage-sd-14
Device = FileStorage-sd-15
Device = FileStorage-sd-16
Device = FileStorage-sd-17
Device = FileStorage-sd-18
Device = FileStorage-sd-19
Device = FileStorage-sd-20
}
Autochanger {
Name = "FileStorage-restore"
Changer Device = /dev/null
Changer Command = ""
Device = FileStorage-sd-restore-0
Device = FileStorage-sd-restore-1
Device = FileStorage-sd-restore-2
Device = FileStorage-sd-restore-3
Device = FileStorage-sd-restore-4
Device = FileStorage-sd-restore-5
Device = FileStorage-sd-restore-6
Device = FileStorage-sd-restore-7
Device = FileStorage-sd-restore-8
Device = FileStorage-sd-restore-9
Device = FileStorage-sd-restore-10
Device = FileStorage-sd-restore-11
Device = FileStorage-sd-restore-12
Device = FileStorage-sd-restore-13
Device = FileStorage-sd-restore-14
Device = FileStorage-sd-restore-15
Device = FileStorage-sd-restore-16
Device = FileStorage-sd-restore-17
Device = FileStorage-sd-restore-18
Device = FileStorage-sd-restore-19
Device = FileStorage-sd-restore-20
}
Backup Drives like this:
Device {
Name = FileStorage-sd-0 # Add a hyphen to SD/autochanger name & match with drive index
Device Type = File
Media Type = File #unique to each archive device path, different path, different mediatype
Archive Device = /bacula/data01
AutomaticMount = yes
AlwaysOpen = yes
RemovableMedia = yes
Autochanger = yes
Drive Index = 0
Maximum Concurrent Jobs = 1
Volume Poll Interval = 5
LabelMedia = yes
Spool Directory = /bacula/spool01
Autoselect = yes
Maximum Network Buffer Size = 65536
}
… 18 more…
Device {
Name = FileStorage-sd-20 # Add a hyphen to SD/autochanger name & match with drive index
Device Type = File
Media Type = File #unique to each archive device path, different path, different mediatype
Archive Device = /bacula/data01
AutomaticMount = yes
AlwaysOpen = yes
RemovableMedia = yes
Autochanger = yes
Drive Index = 20
Maximum Concurrent Jobs = 1
Volume Poll Interval = 5
LabelMedia = yes
Spool Directory = /bacula/spool01
Autoselect = yes
Maximum Network Buffer Size = 65536
}
Restore Drives like this:
Device {
Name = FileStorage-sd-restore-0 # Add a hyphen to SD/autochanger name & match with drive index
Device Type = File
Media Type = File #unique to each archive device path, different path, different mediatype
Archive Device = /bacula/data01
AutomaticMount = yes
AlwaysOpen = yes
RemovableMedia = yes
Autochanger = yes
Drive Index = 0
Maximum Concurrent Jobs = 1
Volume Poll Interval = 5
LabelMedia = yes
Spool Directory = /bacula/spool01
Autoselect = no
Maximum Network Buffer Size = 65536
}
Any idea what’s causing the bacula-sd crash ? how can be debug further ?
Regards,
Robert