Bacula-users

Re: [Bacula-users] Storage is stuck at "Device is BLOCKED waiting to create a volume"

2017-04-03 12:32:39
Subject: Re: [Bacula-users] Storage is stuck at "Device is BLOCKED waiting to create a volume"
From: Zdeněk Bělehrádek <zdenek.belehradek AT economia DOT cz>
To: bacula-users AT lists.sourceforge DOT net
Date: Mon, 3 Apr 2017 18:31:33 +0200
User optiz0r at irc helped me to get trace files for all daemons, its at
http://filebin.ca/3HoXMMcEo2rv/traces.tar.gz

The configuration used may be slightly different (only difference I can
think of is setting Attribute Spooling = yes).

We noticed following errors:
bacst1-sd.trace:bacst1-sd: device.c:232-1 getvolinfo failed. No new Vol:
Error getting Volume info: 1998 Volume "bacst1_storage-full-vol-0001"
catalog status is Used, but should be Append, Purged or Recycle.

In this run, the error is reported for every volume except
bacdir1_director-full-vol-0005, which is also the only volume that has
other status than Used (is Append). Maybe it is significant?


Dne 29.3.2017 v 16:42 Zdeněk Bělehrádek napsal(a):
> Hi,
> 
> We are using Bacula to back up our company's data. All storages are
> ordinary Debian Jessie Linux servers with spinning disks, we don't use
> tapes. Bacula version is 7.0.5+dfsg-4~bpo80+1 and
> 7.4.3+dfsg-1+sid1~bpo8+1 (we tried both).
> 
> We need 2 copies of each backup placed in separate datacenters, so we
> run periodic Copy jobs to mirror data between storages. We want to use
> odd-numbered storages to make a backup, and then copy it to
> even-numbered storage.
> 
> Our current configuration suffers from occasional deadlocks, when Bacula
> tries to read and write from single storage. I thought it is probably
> caused by mistakes in config, where storages have he same Media Type (as
> documented at
> http://www.bacula.org/7.4.x-manuals/en/main/Migration_Copy.html#SECTION002830000000000000000
> ).
> 
> For this reason we decided to create new config where every storage have
> different type from every other. When I tested this new config in
> testing environment, jobs got stuck and never finished.
> status storage=bacst2-stor showed:
> 
>     Device is BLOCKED waiting to create a volume for:
>        Pool:        zdenek-test-pp_old-full-pool-mirror
>        Media type:  File-storspec-mirror
>     Available Space=5.323 GB
> 
> and never making progress - the device is unusable for all jobs (they
> simply wait). I tried mount and label a new volume, it didnẗ made any
> difference.  The only thig that helps is to restart the storage daemon,
> which makes the stuck job fail.
> 
> Strace of storage daemon on bacst2 revealed that director connects to
> it, both authenticate to each other and storage sends "\0\0\0\0223000 OK
> Hello 305\n" to director. Storage then reads from socket and never gets
> any reply - thread just blocks in read() syscall indefinitely.
> 
> Strace of director confirms this - thread connects to storage,
> authenticates, reads Hello and then never reply. Instead it opens
> communication with bacst1 and starts sending commands. Even after
> several minutes (test backups are several KB in size and usually
> finishes in few seconds) the network socket to bacst2 is still open and
> no communication is taking place.
> 
> I verified this with tcpdump and there's nothing suspicious - the
> connection works normally, last packet sent is the Hello message
> described above. Communication on that four-tuple then simply stops,
> nobody sends anything, never closing the connection.
> There is no firewall or NAT between the servers - they are connected to
> single internal network.
> 
> I also tried to upgrade our 7.0 install to latest 7.4 from Debian,
> results are exactly the same.
> 
> Configuration and strace output are at:
> https://drive.google.com/file/d/0B4bjslETcBa-ZHVkOHU4dlZCZ2s/view?usp=sharing
> 
> I can reliably replicate the issue by running (on director):
> 
> for i in `seq 1 2` ; do
> for job in bacst1_storage-job --bacst1_storage-incremental-job-mirror \
> --bacst1_storage-full-job-mirror bacdir1_director-job \
> --bacdir1_director-incremental-job-mirror \
> --bacdir1_director-full-job-mirror ; do
> echo "run job=$job yes" | bacula-console ; done ; done
> 
> Is this a known problem? Is there any workaround?
> 

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users
<Prev in Thread] Current Thread [Next in Thread>