Bacula-users

[Bacula-users] Storage is stuck at "Device is BLOCKED waiting to create a volume"

2017-03-29 10:44:02
Subject: [Bacula-users] Storage is stuck at "Device is BLOCKED waiting to create a volume"
From: Zdeněk Bělehrádek <zdenek.belehradek AT economia DOT cz>
To: bacula-users AT lists.sourceforge DOT net
Date: Wed, 29 Mar 2017 16:42:22 +0200
Hi,

We are using Bacula to back up our company's data. All storages are
ordinary Debian Jessie Linux servers with spinning disks, we don't use
tapes. Bacula version is 7.0.5+dfsg-4~bpo80+1 and
7.4.3+dfsg-1+sid1~bpo8+1 (we tried both).

We need 2 copies of each backup placed in separate datacenters, so we
run periodic Copy jobs to mirror data between storages. We want to use
odd-numbered storages to make a backup, and then copy it to
even-numbered storage.

Our current configuration suffers from occasional deadlocks, when Bacula
tries to read and write from single storage. I thought it is probably
caused by mistakes in config, where storages have he same Media Type (as
documented at
http://www.bacula.org/7.4.x-manuals/en/main/Migration_Copy.html#SECTION002830000000000000000
).

For this reason we decided to create new config where every storage have
different type from every other. When I tested this new config in
testing environment, jobs got stuck and never finished.
status storage=bacst2-stor showed:

    Device is BLOCKED waiting to create a volume for:
       Pool:        zdenek-test-pp_old-full-pool-mirror
       Media type:  File-storspec-mirror
    Available Space=5.323 GB

and never making progress - the device is unusable for all jobs (they
simply wait). I tried mount and label a new volume, it didnẗ made any
difference.  The only thig that helps is to restart the storage daemon,
which makes the stuck job fail.

Strace of storage daemon on bacst2 revealed that director connects to
it, both authenticate to each other and storage sends "\0\0\0\0223000 OK
Hello 305\n" to director. Storage then reads from socket and never gets
any reply - thread just blocks in read() syscall indefinitely.

Strace of director confirms this - thread connects to storage,
authenticates, reads Hello and then never reply. Instead it opens
communication with bacst1 and starts sending commands. Even after
several minutes (test backups are several KB in size and usually
finishes in few seconds) the network socket to bacst2 is still open and
no communication is taking place.

I verified this with tcpdump and there's nothing suspicious - the
connection works normally, last packet sent is the Hello message
described above. Communication on that four-tuple then simply stops,
nobody sends anything, never closing the connection.
There is no firewall or NAT between the servers - they are connected to
single internal network.

I also tried to upgrade our 7.0 install to latest 7.4 from Debian,
results are exactly the same.

Configuration and strace output are at:
https://drive.google.com/file/d/0B4bjslETcBa-ZHVkOHU4dlZCZ2s/view?usp=sharing

I can reliably replicate the issue by running (on director):

for i in `seq 1 2` ; do
for job in bacst1_storage-job --bacst1_storage-incremental-job-mirror \
--bacst1_storage-full-job-mirror bacdir1_director-job \
--bacdir1_director-incremental-job-mirror \
--bacdir1_director-full-job-mirror ; do
echo "run job=$job yes" | bacula-console ; done ; done

Is this a known problem? Is there any workaround?

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users
<Prev in Thread] Current Thread [Next in Thread>
  • [Bacula-users] Storage is stuck at "Device is BLOCKED waiting to create a volume", Zdeněk Bělehrádek <=