Bacula-users

Re: [Bacula-users] Storage is stuck at "Device is BLOCKED waiting to create a volume"

2017-04-03 12:52:03
Subject: Re: [Bacula-users] Storage is stuck at "Device is BLOCKED waiting to create a volume"
From: Kern Sibbald <kern AT sibbald DOT com>
To: Zdeněk Bělehrádek <zdenek.belehradek AT economia DOT cz>, bacula-users AT lists.sourceforge DOT net
Date: Mon, 3 Apr 2017 18:51:03 +0200
Hello,

The error you are getting should never happen, which means that 
something is seriously wrong with your Bacula installation.  A few of 
the multiple possibilities are:

1. Your DIR and SDs are not on the same version.  They *must* all be the 
same. With the little information you provided, for the moment this 
seems to be the most likely problem.

2. Your catalog is damaged.

3. Your Retention periods are too short and records are being removed 
from the catalog.

4. You have manually modified your catalog, so that now the records are 
not consistent.

5. Your catalog does not correspond to the Bacula Director version you 
are running.  This should be detected, but perhaps the catalog was later 
manually modified.

6. Either manually or some program is removing Volume records from the 
catalog or changing them (this point is probably a duplication of point 4)

Best regards,

Kern



On 04/03/2017 06:31 PM, Zdeněk Bělehrádek wrote:
> User optiz0r at irc helped me to get trace files for all daemons, its at
> http://filebin.ca/3HoXMMcEo2rv/traces.tar.gz
>
> The configuration used may be slightly different (only difference I can
> think of is setting Attribute Spooling = yes).
>
> We noticed following errors:
> bacst1-sd.trace:bacst1-sd: device.c:232-1 getvolinfo failed. No new Vol:
> Error getting Volume info: 1998 Volume "bacst1_storage-full-vol-0001"
> catalog status is Used, but should be Append, Purged or Recycle.
>
> In this run, the error is reported for every volume except
> bacdir1_director-full-vol-0005, which is also the only volume that has
> other status than Used (is Append). Maybe it is significant?
>
>
> Dne 29.3.2017 v 16:42 Zdeněk Bělehrádek napsal(a):
>> Hi,
>>
>> We are using Bacula to back up our company's data. All storages are
>> ordinary Debian Jessie Linux servers with spinning disks, we don't use
>> tapes. Bacula version is 7.0.5+dfsg-4~bpo80+1 and
>> 7.4.3+dfsg-1+sid1~bpo8+1 (we tried both).
>>
>> We need 2 copies of each backup placed in separate datacenters, so we
>> run periodic Copy jobs to mirror data between storages. We want to use
>> odd-numbered storages to make a backup, and then copy it to
>> even-numbered storage.
>>
>> Our current configuration suffers from occasional deadlocks, when Bacula
>> tries to read and write from single storage. I thought it is probably
>> caused by mistakes in config, where storages have he same Media Type (as
>> documented at
>> http://www.bacula.org/7.4.x-manuals/en/main/Migration_Copy.html#SECTION002830000000000000000
>> ).
>>
>> For this reason we decided to create new config where every storage have
>> different type from every other. When I tested this new config in
>> testing environment, jobs got stuck and never finished.
>> status storage=bacst2-stor showed:
>>
>>      Device is BLOCKED waiting to create a volume for:
>>         Pool:        zdenek-test-pp_old-full-pool-mirror
>>         Media type:  File-storspec-mirror
>>      Available Space=5.323 GB
>>
>> and never making progress - the device is unusable for all jobs (they
>> simply wait). I tried mount and label a new volume, it didnẗ made any
>> difference.  The only thig that helps is to restart the storage daemon,
>> which makes the stuck job fail.
>>
>> Strace of storage daemon on bacst2 revealed that director connects to
>> it, both authenticate to each other and storage sends "\0\0\0\0223000 OK
>> Hello 305\n" to director. Storage then reads from socket and never gets
>> any reply - thread just blocks in read() syscall indefinitely.
>>
>> Strace of director confirms this - thread connects to storage,
>> authenticates, reads Hello and then never reply. Instead it opens
>> communication with bacst1 and starts sending commands. Even after
>> several minutes (test backups are several KB in size and usually
>> finishes in few seconds) the network socket to bacst2 is still open and
>> no communication is taking place.
>>
>> I verified this with tcpdump and there's nothing suspicious - the
>> connection works normally, last packet sent is the Hello message
>> described above. Communication on that four-tuple then simply stops,
>> nobody sends anything, never closing the connection.
>> There is no firewall or NAT between the servers - they are connected to
>> single internal network.
>>
>> I also tried to upgrade our 7.0 install to latest 7.4 from Debian,
>> results are exactly the same.
>>
>> Configuration and strace output are at:
>> https://drive.google.com/file/d/0B4bjslETcBa-ZHVkOHU4dlZCZ2s/view?usp=sharing
>>
>> I can reliably replicate the issue by running (on director):
>>
>> for i in `seq 1 2` ; do
>> for job in bacst1_storage-job --bacst1_storage-incremental-job-mirror \
>> --bacst1_storage-full-job-mirror bacdir1_director-job \
>> --bacdir1_director-incremental-job-mirror \
>> --bacdir1_director-full-job-mirror ; do
>> echo "run job=$job yes" | bacula-console ; done ; done
>>
>> Is this a known problem? Is there any workaround?
>>
> ------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> Bacula-users mailing list
> Bacula-users AT lists.sourceforge DOT net
> https://lists.sourceforge.net/lists/listinfo/bacula-users


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users
<Prev in Thread] Current Thread [Next in Thread>