Hello,
We were using the 2.4.3 version of Bacula on Debian Etch for a long
time and two days ago upgraded to the 2.4.4 version.
Since then we encountered now the second day a problem with the
automatic volume recycling.
We use a backup to disk and have defined a pool as following so that
every job ends up on its separate volume:
Pool {
Name = DiskPool
Pool Type = Backup
Recycle = yes
AutoPrune = yes
Volume Retention = 40 days
Maximum Volume Jobs = 1
Label Format = BAC-
}
The default job-definition which is used by all our clients is then
using this DiskPool.
While this was working before, since upgrading to 2.4.4 suddendly all
scheduled jobs seem to end up on the same volume. So this is the status
of the director:
bacula-srv-dir Version: 2.4.4 (28 December 2008) i486-pc-linux-gnu
debian 4.0
Daemon started 26-Jan-09 16:52, 54 Jobs run since started.
Heap: heap=13,135,872 smbytes=750,907 max_bytes=857,133 bufs=4,950
max_bufs=5,224
Scheduled Jobs:
Level Type Pri Scheduled Name
Volume
===================================================================================
Incremental Backup 10 27-Jan-09 20:05 job_srv1 BAC-1280
Incremental Backup 10 27-Jan-09 20:05 job_srv2 BAC-1280
Incremental Backup 10 27-Jan-09 20:05 job_srv3 BAC-1280
Incremental Backup 10 27-Jan-09 20:05 job_srv4 BAC-1280
Incremental Backup 10 27-Jan-09 20:05 job_srv5 BAC-1280
Incremental Backup 10 27-Jan-09 20:05 job_srv6 BAC-1280
Incremental Backup 10 27-Jan-09 20:05 job_srv7 BAC-1280
...
As you can see, all jobs from now run into volume BAC-1280.
Furthermore, SD starts to block, when to many jobs are waiting for the
volume.
Running Jobs:
JobId Level Name Status
======================================================================
9449 Increme job_srv1.2009-01-27_04.05.00.46 is waiting on Storage
File
9450 Increme job_srv2.2009-01-27_04.05.00.47 is waiting on Storage
File
9451 Increme job_srv3.2009-01-27_04.05.00.48 is waiting on Storage
File
9452 Increme job_srv4.2009-01-27_04.05.00.49 is waiting on Storage
File
9453 Increme job_srv5.2009-01-27_04.05.00.50 is waiting on Storage
File
...
Further more jobs are then waiting "on max Storage jobs" (as it is
currently limited to 20 concurrent jobs).
When I strace bacula-sd, it just hangs arond:
child_stack=0xb690b4c4,
flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID|CLONE_DETACHED,
parent_tidptr=0xb690bbf8, {entry_number:6, base_addr:0xb690bbb0,
limit:1048575, seg_32bit:1, contents:0, read_exec_only:0,
limit_in_pages:1, seg_not_present:0, useable:1},
child_tidptr=0xb690bbf8) = 5400
[pid 8762] select(5, [4], NULL, NULL, NULL <unfinished ...>
[pid 5400] gettimeofday({1233041312, 545094}, {4294967236, 0}) = 0
[pid 5400] read(24, "", 4) = 0
[pid 5400] time(NULL) = 1233041312
[pid 5400] stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=801,
...}) = 0
[pid 5400] close(24) = 0
[pid 5400] gettimeofday({1233041312, 545604}, {4294967236, 0}) = 0
[pid 5400] clock_gettime(CLOCK_REALTIME, {1233041312, 545703646}) = 0
[pid 5400] futex(0x80c5d3c, FUTEX_WAIT, 427, {1, 454296354} <unfinished
...>
[pid 8765] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed
out)
[pid 8765] futex(0x80c6b8c, FUTEX_WAKE, 1) = 0
[pid 8765] time(NULL) = 1233041313
[pid 8765] gettimeofday({1233041313, 691712}, {4294967236, 0}) = 0
[pid 8765] time(NULL) = 1233041313
[pid 8765] clock_gettime(CLOCK_REALTIME, {1233041313, 691982419}) = 0
[pid 8765] futex(0x80c5ac4, FUTEX_WAIT, 3783, {29, 999729581}
<unfinished ...>
[pid 5400] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed
out)
[pid 5400] futex(0x80c5d20, FUTEX_WAKE, 1) = 0
[pid 5400] _exit(0) = ?
Process 5400 detached
[pid 8765] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed
out)
[pid 8765] futex(0x80c6b8c, FUTEX_WAKE, 1) = 0
[pid 8765] time(NULL) = 1233041343
[pid 8765] gettimeofday({1233041343, 699190}, {4294967236, 0}) = 0
[pid 8765] time(NULL) = 1233041343
[pid 8765] clock_gettime(CLOCK_REALTIME, {1233041343, 699274056}) = 0
[pid 8765] futex(0x80c5ac4, FUTEX_WAIT, 3785, {29, 999915944}) = -1
ETIMEDOUT (Connection timed out)
[pid 8765] futex(0x80c6b8c, FUTEX_WAKE, 1) = 0
[pid 8765] time(NULL) = 1233041373
[pid 8765] gettimeofday({1233041373, 703431}, {4294967236, 0}) = 0
[pid 8765] time(NULL) = 1233041373
[pid 8765] clock_gettime(CLOCK_REALTIME, {1233041373, 703683524}) = 0
[pid 8765] futex(0x80c5ac4, FUTEX_WAIT, 3787, {29, 999747476}) = -1
ETIMEDOUT (Connection timed out)
[pid 8765] futex(0x80c6b8c, FUTEX_WAKE, 1) = 0
It has many open connections from bacula-dir and just two connection to
clients, which are already in a CLOSED_WAIT state.
bacula-sd 8762 bacula 4u IPv4 502281 TCP
bacula-srv:bacula-sd (LISTEN)
bacula-sd 8762 bacula 5u IPv4 1230722 TCP
bacula-srv:bacula-sd->bacula-srv:49618 (CLOSE_WAIT)
bacula-sd 8762 bacula 6u IPv4 1230056 TCP
bacula-srv:bacula-sd->bacula-srv:49615 (ESTABLISHED)
bacula-sd 8762 bacula 7u IPv4 1229112 TCP
bacula-srv:bacula-sd->bacula-srv:49608 (ESTABLISHED)
bacula-sd 8762 bacula 8u IPv4 1230581 TCP
bacula-srv:bacula-sd->bacula-srv:49617 (ESTABLISHED)
bacula-sd 8762 bacula 9u IPv4 1231672 TCP
bacula-srv:bacula-sd->bacula-srv:49627 (ESTABLISHED)
bacula-sd 8762 bacula 10u IPv4 1229191 TCP
bacula-srv:bacula-sd->bacula-srv:49612 (ESTABLISHED)
bacula-sd 8762 bacula 11u sock 0,5 1231512 can't
identify protocol
bacula-sd 8762 bacula 12u IPv4 1229168 TCP
bacula-srv:bacula-sd->bacula-srv:49611 (ESTABLISHED)
bacula-sd 8762 bacula 13u IPv4 1230075 TCP
bacula-srv:bacula-sd->bacula-srv:49616 (ESTABLISHED)
bacula-sd 8762 bacula 14u IPv4 1231692 TCP
bacula-srv:bacula-sd->bacula-srv:49628 (ESTABLISHED)
bacula-sd 8762 bacula 15u IPv4 1229203 TCP
bacula-srv:bacula-sd->bacula-srv:49613 (ESTABLISHED)
bacula-sd 8762 bacula 16u IPv4 1474050 TCP
bacula-srv:bacula-sd->bacula-srv:44644 (ESTABLISHED)
bacula-sd 8762 bacula 17u IPv4 1229603 TCP
bacula-srv:bacula-sd->bacula-srv:49614 (ESTABLISHED)
bacula-sd 8762 bacula 18u IPv4 1474101 TCP
bacula-srv:bacula-sd->bacula-srv:44646 (ESTABLISHED)
bacula-sd 8762 bacula 19u IPv4 1231538 TCP
bacula-srv:bacula-sd->bacula-srv:49626 (ESTABLISHED)
bacula-sd 8762 bacula 20u IPv4 1228175 TCP
bacula-srv:bacula-sd->bacula-srv:49601 (ESTABLISHED)
bacula-sd 8762 bacula 21u IPv4 1231523 TCP
bacula-srv:bacula-sd->srv1:51353 (CLOSE_WAIT)
bacula-sd 8762 bacula 22u IPv4 1231524 TCP
bacula-srv:bacula-sd->srv2:56544 (CLOSE_WAIT)
bacula-sd 8762 bacula 31u IPv4 1228181 TCP
bacula-srv:bacula-sd->bacula-srv:49602 (ESTABLISHED)
bacula-sd 8762 bacula 32u IPv4 1228182 TCP
bacula-srv:bacula-sd->bacula-srv:49603 (ESTABLISHED)
bacula-sd 8762 bacula 33u IPv4 1228184 TCP
bacula-srv:bacula-sd->bacula-srv:49604 (ESTABLISHED)
bacula-sd 8762 bacula 34u IPv4 1228187 TCP
bacula-srv:bacula-sd->bacula-srv:49605 (ESTABLISHED)
bacula-sd 8762 bacula 35u IPv4 1228190 TCP
bacula-srv:bacula-sd->bacula-srv:49606 (ESTABLISHED)
bacula-sd 8762 bacula 37u IPv4 1228193 TCP
bacula-srv:bacula-sd->bacula-srv:49607 (ESTABLISHED)
Is it possible, that those dead-lock fixes introduced in 2.4.4 brought
up some other problems?
Regards,
Andreas
------------------------------------------------------------------------------
This SF.net email is sponsored by:
SourcForge Community
SourceForge wants to tell your story.
http://p.sf.net/sfu/sf-spreadtheword
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users
|