Bacula-users

[Bacula-users] 2.4.4 brings up problems with automatic volume recycling

2009-01-27 03:37:44
Subject: [Bacula-users] 2.4.4 brings up problems with automatic volume recycling
From: Andreas Unterkircher <unki AT netshadow DOT at>
To: bacula-users AT lists.sourceforge DOT net
Date: Tue, 27 Jan 2009 09:18:11 +0100
Hello,

We were using the 2.4.3 version of Bacula on Debian Etch for a long
time and two days ago upgraded to the 2.4.4 version.

Since then we encountered now the second day a problem with the
automatic volume recycling.

We use a backup to disk and have defined a pool as following so that
every job ends up on its separate volume:

Pool { 
  Name = DiskPool
  Pool Type = Backup
  Recycle = yes
  AutoPrune = yes
  Volume Retention = 40 days
  Maximum Volume Jobs = 1
  Label Format = BAC-
}

The default job-definition which is used by all our clients is then
using this DiskPool.

While this was working before, since upgrading to 2.4.4 suddendly all
scheduled jobs seem to end up on the same volume. So this is the status
of the director:



bacula-srv-dir Version: 2.4.4 (28 December 2008) i486-pc-linux-gnu
debian 4.0
Daemon started 26-Jan-09 16:52, 54 Jobs run since started.
 Heap: heap=13,135,872 smbytes=750,907 max_bytes=857,133 bufs=4,950
max_bufs=5,224

Scheduled Jobs:
Level          Type     Pri  Scheduled          Name
Volume
===================================================================================
Incremental    Backup    10  27-Jan-09 20:05    job_srv1 BAC-1280
Incremental    Backup    10  27-Jan-09 20:05    job_srv2 BAC-1280
Incremental    Backup    10  27-Jan-09 20:05    job_srv3 BAC-1280
Incremental    Backup    10  27-Jan-09 20:05    job_srv4 BAC-1280
Incremental    Backup    10  27-Jan-09 20:05    job_srv5 BAC-1280
Incremental    Backup    10  27-Jan-09 20:05    job_srv6 BAC-1280
Incremental    Backup    10  27-Jan-09 20:05    job_srv7 BAC-1280
...




As you can see, all jobs from now run into volume BAC-1280.

Furthermore, SD starts to block, when to many jobs are waiting for the
volume.

Running Jobs:
 JobId Level   Name                       Status
======================================================================
  9449 Increme  job_srv1.2009-01-27_04.05.00.46 is waiting on Storage
File
  9450 Increme  job_srv2.2009-01-27_04.05.00.47 is waiting on Storage
File
  9451 Increme  job_srv3.2009-01-27_04.05.00.48 is waiting on Storage
File
  9452 Increme  job_srv4.2009-01-27_04.05.00.49 is waiting on Storage
File
  9453 Increme  job_srv5.2009-01-27_04.05.00.50 is waiting on Storage
File
...

Further more jobs are then waiting "on max Storage jobs" (as it is
currently limited to 20 concurrent jobs).



When I strace bacula-sd, it just hangs arond:

child_stack=0xb690b4c4,
flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID|CLONE_DETACHED,
parent_tidptr=0xb690bbf8, {entry_number:6, base_addr:0xb690bbb0,
limit:1048575, seg_32bit:1, contents:0, read_exec_only:0,
limit_in_pages:1, seg_not_present:0, useable:1},
child_tidptr=0xb690bbf8) = 5400
[pid  8762] select(5, [4], NULL, NULL, NULL <unfinished ...>
[pid  5400] gettimeofday({1233041312, 545094}, {4294967236, 0}) = 0
[pid  5400] read(24, "", 4)             = 0
[pid  5400] time(NULL)                  = 1233041312
[pid  5400] stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=801,
...}) = 0
[pid  5400] close(24)                   = 0
[pid  5400] gettimeofday({1233041312, 545604}, {4294967236, 0}) = 0
[pid  5400] clock_gettime(CLOCK_REALTIME, {1233041312, 545703646}) = 0
[pid  5400] futex(0x80c5d3c, FUTEX_WAIT, 427, {1, 454296354} <unfinished
...>
[pid  8765] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed
out)
[pid  8765] futex(0x80c6b8c, FUTEX_WAKE, 1) = 0
[pid  8765] time(NULL)                  = 1233041313
[pid  8765] gettimeofday({1233041313, 691712}, {4294967236, 0}) = 0
[pid  8765] time(NULL)                  = 1233041313
[pid  8765] clock_gettime(CLOCK_REALTIME, {1233041313, 691982419}) = 0
[pid  8765] futex(0x80c5ac4, FUTEX_WAIT, 3783, {29, 999729581}
<unfinished ...>
[pid  5400] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed
out)
[pid  5400] futex(0x80c5d20, FUTEX_WAKE, 1) = 0
[pid  5400] _exit(0)                    = ?
Process 5400 detached
[pid  8765] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed
out)
[pid  8765] futex(0x80c6b8c, FUTEX_WAKE, 1) = 0
[pid  8765] time(NULL)                  = 1233041343
[pid  8765] gettimeofday({1233041343, 699190}, {4294967236, 0}) = 0
[pid  8765] time(NULL)                  = 1233041343
[pid  8765] clock_gettime(CLOCK_REALTIME, {1233041343, 699274056}) = 0
[pid  8765] futex(0x80c5ac4, FUTEX_WAIT, 3785, {29, 999915944}) = -1
ETIMEDOUT (Connection timed out)
[pid  8765] futex(0x80c6b8c, FUTEX_WAKE, 1) = 0
[pid  8765] time(NULL)                  = 1233041373
[pid  8765] gettimeofday({1233041373, 703431}, {4294967236, 0}) = 0
[pid  8765] time(NULL)                  = 1233041373
[pid  8765] clock_gettime(CLOCK_REALTIME, {1233041373, 703683524}) = 0
[pid  8765] futex(0x80c5ac4, FUTEX_WAIT, 3787, {29, 999747476}) = -1
ETIMEDOUT (Connection timed out)
[pid  8765] futex(0x80c6b8c, FUTEX_WAKE, 1) = 0


It has many open connections from bacula-dir and just two connection to
clients, which are already in a CLOSED_WAIT state.


bacula-sd 8762 bacula    4u  IPv4  502281              TCP
bacula-srv:bacula-sd (LISTEN)
bacula-sd 8762 bacula    5u  IPv4 1230722              TCP
bacula-srv:bacula-sd->bacula-srv:49618 (CLOSE_WAIT)
bacula-sd 8762 bacula    6u  IPv4 1230056              TCP
bacula-srv:bacula-sd->bacula-srv:49615 (ESTABLISHED)
bacula-sd 8762 bacula    7u  IPv4 1229112              TCP
bacula-srv:bacula-sd->bacula-srv:49608 (ESTABLISHED)
bacula-sd 8762 bacula    8u  IPv4 1230581              TCP
bacula-srv:bacula-sd->bacula-srv:49617 (ESTABLISHED)
bacula-sd 8762 bacula    9u  IPv4 1231672              TCP
bacula-srv:bacula-sd->bacula-srv:49627 (ESTABLISHED)
bacula-sd 8762 bacula   10u  IPv4 1229191              TCP
bacula-srv:bacula-sd->bacula-srv:49612 (ESTABLISHED)
bacula-sd 8762 bacula   11u  sock     0,5          1231512 can't
identify protocol
bacula-sd 8762 bacula   12u  IPv4 1229168              TCP
bacula-srv:bacula-sd->bacula-srv:49611 (ESTABLISHED)
bacula-sd 8762 bacula   13u  IPv4 1230075              TCP
bacula-srv:bacula-sd->bacula-srv:49616 (ESTABLISHED)
bacula-sd 8762 bacula   14u  IPv4 1231692              TCP
bacula-srv:bacula-sd->bacula-srv:49628 (ESTABLISHED)
bacula-sd 8762 bacula   15u  IPv4 1229203              TCP
bacula-srv:bacula-sd->bacula-srv:49613 (ESTABLISHED)
bacula-sd 8762 bacula   16u  IPv4 1474050              TCP
bacula-srv:bacula-sd->bacula-srv:44644 (ESTABLISHED)
bacula-sd 8762 bacula   17u  IPv4 1229603              TCP
bacula-srv:bacula-sd->bacula-srv:49614 (ESTABLISHED)
bacula-sd 8762 bacula   18u  IPv4 1474101              TCP
bacula-srv:bacula-sd->bacula-srv:44646 (ESTABLISHED)
bacula-sd 8762 bacula   19u  IPv4 1231538              TCP
bacula-srv:bacula-sd->bacula-srv:49626 (ESTABLISHED)
bacula-sd 8762 bacula   20u  IPv4 1228175              TCP
bacula-srv:bacula-sd->bacula-srv:49601 (ESTABLISHED)
bacula-sd 8762 bacula   21u  IPv4 1231523              TCP
bacula-srv:bacula-sd->srv1:51353 (CLOSE_WAIT)
bacula-sd 8762 bacula   22u  IPv4 1231524              TCP
bacula-srv:bacula-sd->srv2:56544 (CLOSE_WAIT)
bacula-sd 8762 bacula   31u  IPv4 1228181              TCP
bacula-srv:bacula-sd->bacula-srv:49602 (ESTABLISHED)
bacula-sd 8762 bacula   32u  IPv4 1228182              TCP
bacula-srv:bacula-sd->bacula-srv:49603 (ESTABLISHED)
bacula-sd 8762 bacula   33u  IPv4 1228184              TCP
bacula-srv:bacula-sd->bacula-srv:49604 (ESTABLISHED)
bacula-sd 8762 bacula   34u  IPv4 1228187              TCP
bacula-srv:bacula-sd->bacula-srv:49605 (ESTABLISHED)
bacula-sd 8762 bacula   35u  IPv4 1228190              TCP
bacula-srv:bacula-sd->bacula-srv:49606 (ESTABLISHED)
bacula-sd 8762 bacula   37u  IPv4 1228193              TCP
bacula-srv:bacula-sd->bacula-srv:49607 (ESTABLISHED)


Is it possible, that those dead-lock fixes introduced in 2.4.4 brought
up some other problems?

Regards,
Andreas

------------------------------------------------------------------------------
This SF.net email is sponsored by:
SourcForge Community
SourceForge wants to tell your story.
http://p.sf.net/sfu/sf-spreadtheword
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users

<Prev in Thread] Current Thread [Next in Thread>