Bacula-users

Re: [Bacula-users] Bacula SD 5.2.13 crash - Mutex lock failure. ERR=Invalid argument

2015-03-03 08:27:36
Subject: Re: [Bacula-users] Bacula SD 5.2.13 crash - Mutex lock failure. ERR=Invalid argument
From: "Clark, Patricia A." <clarkpa AT ornl DOT gov>
To: Robert Heinzmann <r.heinzmann AT freelancer.traviangames DOT com>, "bacula-users AT lists.sourceforge DOT net" <bacula-users AT lists.sourceforge DOT net>
Date: Tue, 3 Mar 2015 13:21:35 +0000
Any reason for not updating to v7 Bacula?  It contains a number of fixes as 
well as new features.  The version that you are running is nearly 2 years old, 
although there were a few bug fixes along the way – however no updates since 
April 2014.

Patti Clark
Linux System Administrator
R&D Systems Support Oak Ridge National Laboratory

From: Robert Heinzmann <r.heinzmann AT freelancer.traviangames DOT 
com<mailto:r.heinzmann AT freelancer.traviangames DOT com>>
Date: Tuesday, March 3, 2015 at 3:37 AM
To: "bacula-users AT lists.sourceforge DOT net<mailto:bacula-users AT 
lists.sourceforge DOT net>" <bacula-users AT lists.sourceforge DOT 
net<mailto:bacula-users AT lists.sourceforge DOT net>>
Subject: [Bacula-users] Bacula SD 5.2.13 crash - Mutex lock failure. 
ERR=Invalid argument

Hello,

we are using Bacula 5.2.13-18 on CentOS6 and from time to time bacula-sd 
crashes with, causing all backups to fail until bacula-sd is started again:

Mar  3 06:59:00 XXXX bacula-sd: XXXX:storage:default: ABORTING due to ERROR in 
lockmgr.c:100#012Mutex lock failure. ERR=Invalid argument
Mar  3 06:59:00 XXXX bacula-sd: Bacula interrupted by signal 6: IOT trap

Setup:

3 Servers:
  1 Bacula Director (extra machine)
  1 Bacula Catalog Server (extra machine)
  1 Bacula Storage Deamon (extra machine)

We have ~573 Jobs (some TB, all Full Backups) to backup each day. Jobs are 
distributed across the day depending on minimum load of the server, distributed 
evenly otherwise:

Time Jobs
0:00-1:00 35
1:00-2:00 121
2:00-3:00 93
3:00-4:00 60
4:00-5:00 46
5:00-6:00 71
6:00-7:00 60
7:00-8:00 43
8:00-9:00 32
9:00-10:00 12
10:00-11:00 7
11:00-12:00 3
12:00-13:00 5
13:00-14:00 2
14:00-15:00 7
15:00-16:00 8
16:00-17:00 7
17:00-18:00 3
18:00-19:00 2
19:00-20:00 3
20:00-21:00 11
21:00-22:00 14
22:00-23:00 28
23:00-24:00 25

Our SD is configured with 20 virtual drives in a  backup2disk setup allowing 20 
concurrent backups to disk. Each Backup Job is an individual file in the 
backend (so full backups can be accessed and restored through bls/bextract). We 
have an external “scripted” job, which cleans up unused / purged volumes from 
disk.


Bacula Director Configuration:
------------------------------

Storage {
  Name = "XXXX:storage:default"
  Address = HOSTNAME_OF_THE_SD_MACHINE
  Password = "SECRET"
  Device = "FileStorage"
  Maximum Concurrent Jobs = 20
  Media Type = File
  Heartbeat Interval = 15
  TLS Enable = no
}

Pool {
  Name = " HOSTNAME_OF_THE_SD_MACHINE:pool:default"
  Storage = "XXXX:storage:default"
  # All Volumes will have the format standard.date.time to ensure they
  # are kept unique throughout the operation and also aid quick analysis
  # We won't use a counter format for this at the moment.
  Label Format = 
"BACULA-${Job}.${Year}${Month:p/2/0/r}${Day:p/2/0/r}.${Hour:p/2/0/r}${Minute:p/2/0/r}.${JobId}"
  Pool Type = Backup
  # Clean up any we don't need, and keep them for a maximum of a month (in
  # theory the same time period for weekly backups from the clients)
  # Note the files for the old volumes will still remain on the disk but will
  # be truncated to a zero size.
  Recycle = No
  Auto Prune = Yes
  Action On Purge = Truncate
  Volume Retention = 30 days
  # Don't allow re-use of volumes; one volume per job only
  Maximum Volume Jobs = 1
}

Bacula SD Configuration:
------------------------------

Autochanger {
  Name = "FileStorage"
  Changer Device = /dev/null
  Changer Command = ""
    Device = FileStorage-sd-0
    Device = FileStorage-sd-1
    Device = FileStorage-sd-2
    Device = FileStorage-sd-3
    Device = FileStorage-sd-4
    Device = FileStorage-sd-5
    Device = FileStorage-sd-6
    Device = FileStorage-sd-7
    Device = FileStorage-sd-8
    Device = FileStorage-sd-9
    Device = FileStorage-sd-10
    Device = FileStorage-sd-11
    Device = FileStorage-sd-12
    Device = FileStorage-sd-13
    Device = FileStorage-sd-14
    Device = FileStorage-sd-15
    Device = FileStorage-sd-16
    Device = FileStorage-sd-17
    Device = FileStorage-sd-18
    Device = FileStorage-sd-19
    Device = FileStorage-sd-20

}


Autochanger {
  Name = "FileStorage-restore"
  Changer Device = /dev/null
  Changer Command = ""
    Device = FileStorage-sd-restore-0
    Device = FileStorage-sd-restore-1
    Device = FileStorage-sd-restore-2
    Device = FileStorage-sd-restore-3
    Device = FileStorage-sd-restore-4
    Device = FileStorage-sd-restore-5
    Device = FileStorage-sd-restore-6
    Device = FileStorage-sd-restore-7
    Device = FileStorage-sd-restore-8
    Device = FileStorage-sd-restore-9
    Device = FileStorage-sd-restore-10
    Device = FileStorage-sd-restore-11
    Device = FileStorage-sd-restore-12
    Device = FileStorage-sd-restore-13
    Device = FileStorage-sd-restore-14
    Device = FileStorage-sd-restore-15
    Device = FileStorage-sd-restore-16
    Device = FileStorage-sd-restore-17
    Device = FileStorage-sd-restore-18
    Device = FileStorage-sd-restore-19
    Device = FileStorage-sd-restore-20

}

Backup Drives like this:

Device {
  Name = FileStorage-sd-0 # Add a hyphen to SD/autochanger name &amp; match 
with drive index
  Device Type = File
  Media Type = File #unique to each archive device path, different path, 
different mediatype
  Archive Device = /bacula/data01
  AutomaticMount = yes
  AlwaysOpen = yes
  RemovableMedia = yes
  Autochanger = yes
  Drive Index = 0
  Maximum Concurrent Jobs = 1
  Volume Poll Interval = 5
  LabelMedia = yes
  Spool Directory = /bacula/spool01
  Autoselect = yes
  Maximum Network Buffer Size = 65536
}

… 18 more…

Device {
  Name = FileStorage-sd-20 # Add a hyphen to SD/autochanger name &amp; match 
with drive index
  Device Type = File
  Media Type = File #unique to each archive device path, different path, 
different mediatype
  Archive Device = /bacula/data01
  AutomaticMount = yes
  AlwaysOpen = yes
  RemovableMedia = yes
  Autochanger = yes
  Drive Index = 20
  Maximum Concurrent Jobs = 1
  Volume Poll Interval = 5
  LabelMedia = yes
  Spool Directory = /bacula/spool01
  Autoselect = yes
  Maximum Network Buffer Size = 65536
}

Restore Drives like this:

Device {
  Name = FileStorage-sd-restore-0 # Add a hyphen to SD/autochanger name &amp; 
match with drive index
  Device Type = File
  Media Type = File #unique to each archive device path, different path, 
different mediatype
  Archive Device = /bacula/data01
  AutomaticMount = yes
  AlwaysOpen = yes
  RemovableMedia = yes
  Autochanger = yes
  Drive Index = 0
  Maximum Concurrent Jobs = 1
  Volume Poll Interval = 5
  LabelMedia = yes
  Spool Directory = /bacula/spool01
  Autoselect = no
  Maximum Network Buffer Size = 65536
}

Any idea what’s causing the bacula-sd crash ? how can be debug further ?

Regards,
Robert


------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users

<Prev in Thread] Current Thread [Next in Thread>