Bacula-users

[Bacula-users] Bacula SD 5.2.13 crash - Mutex lock failure. ERR=Invalid argument

2015-03-03 04:00:56
Subject: [Bacula-users] Bacula SD 5.2.13 crash - Mutex lock failure. ERR=Invalid argument
From: Robert Heinzmann <r.heinzmann AT freelancer.traviangames DOT com>
To: "bacula-users AT lists.sourceforge DOT net" <bacula-users AT lists.sourceforge DOT net>
Date: Tue, 3 Mar 2015 08:37:44 +0000

Hello,

 

we are using Bacula 5.2.13-18 on CentOS6 and from time to time bacula-sd crashes with, causing all backups to fail until bacula-sd is started again:

 

Mar  3 06:59:00 XXXX bacula-sd: XXXX:storage:default: ABORTING due to ERROR in lockmgr.c:100#012Mutex lock failure. ERR=Invalid argument

Mar  3 06:59:00 XXXX bacula-sd: Bacula interrupted by signal 6: IOT trap

 

Setup:

 

3 Servers:

  1 Bacula Director (extra machine)

  1 Bacula Catalog Server (extra machine)

  1 Bacula Storage Deamon (extra machine)

 

We have ~573 Jobs (some TB, all Full Backups) to backup each day. Jobs are distributed across the day depending on minimum load of the server, distributed evenly otherwise:

 

Time Jobs

0:00-1:00 35

1:00-2:00 121

2:00-3:00 93

3:00-4:00 60

4:00-5:00 46

5:00-6:00 71

6:00-7:00 60

7:00-8:00 43

8:00-9:00 32

9:00-10:00 12

10:00-11:00 7

11:00-12:00 3

12:00-13:00 5

13:00-14:00 2

14:00-15:00 7

15:00-16:00 8

16:00-17:00 7

17:00-18:00 3

18:00-19:00 2

19:00-20:00 3

20:00-21:00 11

21:00-22:00 14

22:00-23:00 28

23:00-24:00 25

 

Our SD is configured with 20 virtual drives in a  backup2disk setup allowing 20 concurrent backups to disk. Each Backup Job is an individual file in the backend (so full backups can be accessed and restored through bls/bextract). We have an external “scripted” job, which cleans up unused / purged volumes from disk.

 

 

Bacula Director Configuration:

------------------------------

 

Storage {

  Name = "XXXX:storage:default"

  Address = HOSTNAME_OF_THE_SD_MACHINE

  Password = "SECRET"

  Device = "FileStorage"

  Maximum Concurrent Jobs = 20

  Media Type = File

  Heartbeat Interval = 15

  TLS Enable = no

}

 

Pool {

  Name = " HOSTNAME_OF_THE_SD_MACHINE:pool:default"

  Storage = "XXXX:storage:default"

  # All Volumes will have the format standard.date.time to ensure they

  # are kept unique throughout the operation and also aid quick analysis

  # We won't use a counter format for this at the moment.

  Label Format = "BACULA-${Job}.${Year}${Month:p/2/0/r}${Day:p/2/0/r}.${Hour:p/2/0/r}${Minute:p/2/0/r}.${JobId}"

  Pool Type = Backup

  # Clean up any we don't need, and keep them for a maximum of a month (in

  # theory the same time period for weekly backups from the clients)

  # Note the files for the old volumes will still remain on the disk but will

  # be truncated to a zero size.

  Recycle = No

  Auto Prune = Yes

  Action On Purge = Truncate

  Volume Retention = 30 days

  # Don't allow re-use of volumes; one volume per job only

  Maximum Volume Jobs = 1

}

 

Bacula SD Configuration:

------------------------------

 

Autochanger {

  Name = "FileStorage"

  Changer Device = /dev/null

  Changer Command = ""

    Device = FileStorage-sd-0

    Device = FileStorage-sd-1

    Device = FileStorage-sd-2

    Device = FileStorage-sd-3

    Device = FileStorage-sd-4

    Device = FileStorage-sd-5

    Device = FileStorage-sd-6

    Device = FileStorage-sd-7

    Device = FileStorage-sd-8

    Device = FileStorage-sd-9

    Device = FileStorage-sd-10

    Device = FileStorage-sd-11

    Device = FileStorage-sd-12

    Device = FileStorage-sd-13

    Device = FileStorage-sd-14

    Device = FileStorage-sd-15

    Device = FileStorage-sd-16

    Device = FileStorage-sd-17

    Device = FileStorage-sd-18

    Device = FileStorage-sd-19

    Device = FileStorage-sd-20

 

}

 

 

Autochanger {

  Name = "FileStorage-restore"

  Changer Device = /dev/null

  Changer Command = ""

    Device = FileStorage-sd-restore-0

    Device = FileStorage-sd-restore-1

    Device = FileStorage-sd-restore-2

    Device = FileStorage-sd-restore-3

    Device = FileStorage-sd-restore-4

    Device = FileStorage-sd-restore-5

    Device = FileStorage-sd-restore-6

    Device = FileStorage-sd-restore-7

    Device = FileStorage-sd-restore-8

    Device = FileStorage-sd-restore-9

    Device = FileStorage-sd-restore-10

    Device = FileStorage-sd-restore-11

    Device = FileStorage-sd-restore-12

    Device = FileStorage-sd-restore-13

    Device = FileStorage-sd-restore-14

    Device = FileStorage-sd-restore-15

    Device = FileStorage-sd-restore-16

    Device = FileStorage-sd-restore-17

    Device = FileStorage-sd-restore-18

    Device = FileStorage-sd-restore-19

    Device = FileStorage-sd-restore-20

 

}

 

Backup Drives like this:

 

Device {

  Name = FileStorage-sd-0 # Add a hyphen to SD/autochanger name &amp; match with drive index

  Device Type = File

  Media Type = File #unique to each archive device path, different path, different mediatype

  Archive Device = /bacula/data01

  AutomaticMount = yes

  AlwaysOpen = yes

  RemovableMedia = yes

  Autochanger = yes

  Drive Index = 0

  Maximum Concurrent Jobs = 1

  Volume Poll Interval = 5

  LabelMedia = yes

  Spool Directory = /bacula/spool01

  Autoselect = yes

  Maximum Network Buffer Size = 65536

}

 

… 18 more…

 

Device {

  Name = FileStorage-sd-20 # Add a hyphen to SD/autochanger name &amp; match with drive index

  Device Type = File

  Media Type = File #unique to each archive device path, different path, different mediatype

  Archive Device = /bacula/data01

  AutomaticMount = yes

  AlwaysOpen = yes

  RemovableMedia = yes

  Autochanger = yes

  Drive Index = 20

  Maximum Concurrent Jobs = 1

  Volume Poll Interval = 5

  LabelMedia = yes

  Spool Directory = /bacula/spool01

  Autoselect = yes

  Maximum Network Buffer Size = 65536

}

 

Restore Drives like this:

 

Device {

  Name = FileStorage-sd-restore-0 # Add a hyphen to SD/autochanger name &amp; match with drive index

  Device Type = File

  Media Type = File #unique to each archive device path, different path, different mediatype

  Archive Device = /bacula/data01

  AutomaticMount = yes

  AlwaysOpen = yes

  RemovableMedia = yes

  Autochanger = yes

  Drive Index = 0

  Maximum Concurrent Jobs = 1

  Volume Poll Interval = 5

  LabelMedia = yes

  Spool Directory = /bacula/spool01

  Autoselect = no

  Maximum Network Buffer Size = 65536

}

 

Any idea what’s causing the bacula-sd crash ? how can be debug further ?

 

Regards,

Robert

 

------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users
<Prev in Thread] Current Thread [Next in Thread>