Bacula-users

Re: [Bacula-users] Storage is stuck at "Device is BLOCKED waiting to create a volume"

2017-04-10 06:57:29
Subject: Re: [Bacula-users] Storage is stuck at "Device is BLOCKED waiting to create a volume"
From: Zdeněk Bělehrádek <zdenek.belehradek AT economia DOT cz>
To: bacula-users AT lists.sourceforge DOT net
Date: Mon, 10 Apr 2017 12:28:15 +0200
Hi,

1. it is a 5 concurrently started copies of:
 a) 2 backup jobs
 b) 2 Copy jobs that copies Full backups from a)
 c) 2 Copy jobs that copies Incremental backups from a)
I can sometimes replicate the problem with just a 2 copies of the above,
but this has been about 90 % reliable. The problem doesn't occur every
time, only when there is a lot of jobs at once.

2. this commands instructs your Docker daemon to download, but not run,
my containers  from Docker Hub:

docker pull lyco/debug-bacdb:2017-04-07
docker pull lyco/debug-bacdir:2017-04-07
docker pull lyco/debug-bacst1:2017-04-07
docker pull lyco/debug-bacst:2017-04-07

you can skip this and just run them as described in point 4, because
Docker daemon will download them for you.

Of course you have to install docker package first, detailed guide is e.
g. at
https://docs.docker.com/engine/installation/linux/ubuntu/#install-using-the-repository

3. 150 to 250 MB each. They should require only few megabytes of RAM,
except the bacdb container: PostgreSQL in it is configured with 256 MB
of shared_buffers. I am pretty sure it will be runnable at any
reasonable developer machine.

4.
docker run -d --network bactest --network-alias bacdb1.cent
lyco/debug-bacdb:2017-04-07
docker run -d --network bactest --network-alias bacdir1.cent
lyco/debug-bacdir:2017-04-07
docker run -d --network bactest --network-alias bacst1.cent
lyco/debug-bacst1:2017-04-07
docker run -d --network bactest --network-alias bacst2.cent
lyco/debug-bacst:2017-04-07

Now you have the containers running in background. If you want to run
any command in container, you have to know ID of running container:

docker ps

then you can run an interactive shell in container:

docker exec -it <container_id> bash

You might want to install strace, gdb etc in it.
When container exits, Docker normally throws away any changs made in it.

Explanation of flags:
-d: detach (run in background)
--network: connect containers to this virtual network (name is arbitrary)
--network-alias: give container DNS name in the virtual network
-it: interactive, allocate tty

5. no, this is a misunderstanding. While Google does have some services
that can run containers, you don't need them (and you would have to pay
for it).

Basically, container is "super chroot" - you run a process in isolated
environment using your own kernel (i. e. it's not virtualized), but
maybe with different libraries, config files, network setup or mounted
disks. Because of this, you gain reproducibility - you can run the same
binary in the same environment, no matter what underlaying system is  -
as long as it is reasonably new Linux. You only need the container
images - files with filesystem and metadata needed to run them. This is
what I uploaded to Docker Hub, and what is named
lyco/debug-bacst:2017-04-07 etc.

What I uploaded at Google Drive is a set of scripts and files that you
need if you want to recreate my images. It is for you to see what I did
to make them, what exact software etc. I used, and to make easier to
test any changes.

P. S.: I totally meant to post this to bacula-users and original poster
too. Sorry, reposting to list.

Dne 7.4.2017 v 20:41 Kern Sibbald napsal(a):
> Hello,
> 
> Well, it sounds like you have been working hard.
> 
> I am doing my development on a Ubuntu 16.04 machine, so I imagine it can
> handle docker containers as well as a lot of other stuff. However, I
> have never used a container, and I am assuming that you want me to do
> so.  I am willing to try, but here are a few questions:
> 
> 1. It seems like you need four images and there are apparently 5 Bacula
> jobs I need to start. Is that correct?
> 
> 2. What is the command I would use to get the images downloaded to my
> machine?
> 
> 3. Approximately how big is their total size?
> 
> 4. Once I have them here, what is the command(s) I use to start them?
> 
> 5. You seem to say that I can run them on a google drive.  How do I do
> that?
> 
> I am a bit concerned.  This seems to be a very big setup -- that is not
> something simple.  I'll take a look at it, but if it is overly complex,
> please don't count on me.  I don't have much spare time, and I don't do
> support work (takes much too long), but if I can clearly see a bug,
> there is a good chance that I can fix it.
> 
> The typical test situation that I deal with is anything similar to the
> test files in <bacula-source>/regress/tests.  Your setup for the moment
> seems to be more complicated (hopefully I am mistaken).
> 
> Best regards,
> 
> Kern
> 
> 
> 
> On 04/07/2017 05:00 PM, Zdeněk Bělehrádek wrote:
>> Hi,
>>
>> I managed to replicate the problem in a set of Docker containers  based
>> upon the semiofficial Debian Jessie container and jessie-backports
>> packages.
>>
>> Images are:
>> REPOSITORY                 TAG                 IMAGE ID
>> lyco/debug-bacdb           2017-04-07          345265e86294
>> lyco/debug-bacst2          2017-04-07          6e355bf8c0ba
>> lyco/debug-bacst1          2017-04-07          f9ff4567bd26
>> lyco/debug-bacdir          2017-04-07          e1565bff29ec
>>
>> Strat them with command
>> ./run 2017-04-07
>> from tarball below (or see note), exec bash into bacdir and run
>>
>> for i in `seq 1 5` ; do \
>> for job in bacst1_storage-job \
>> --bacst1_storage-incremental-job-mirror \
>> --bacst1_storage-full-job-mirror bacdir1_director-job \
>> --bacdir1_director-incremental-job-mirror \
>> --bacdir1_director-full-job-mirror ; do \
>> echo "run job=$job yes" | bacula-console ; done ; done
>>
>> This is meant to simulate situation when long running backup delays
>> other jobs until Copy jobs start running too. This kind of situation
>> happens in our production too and is source of problems that forced me
>> to write this new config I am trying to debug.
>>
>> If you want to recreate the containers yourselves (e. g. to check there
>> isn't any problem with my packages etc.), you can download the scripts
>> and configs that I am using to create these containers as a tarball:
>>
>> https://drive.google.com/file/d/0B4bjslETcBa-c0M4N3hueDg2OEE/view?usp=sharing
>>
>>
>> The configuration is copied from testing environment, looks like the
>> config that I would like to use in production, and has been changes only
>> minimally (enabled logging to files, enabled access to DB not based on
>> hostnames). The containers themselves aren't exactly best practices
>> showcase (things like using shell instead of init), but it shouldn't
>> matter for Bacula.
>>
>> Don't worry about passwords, I already changed them in my setup.
>>
>> Note: the run command just runs the images in common network with DNS
>> names bacdir1.cent, bacdb1.cent, bacst1.cent and bacst2.cent.
>>
>> Dne 4.4.2017 v 16:09 Kern Sibbald napsal(a):
>>> Hello,
>>>
>>> Well, I am out of ideas.
>>>
>>> Yes, Bacula has a bugs database, and you can report it, but at this
>>> point it appears unlikely that it is a bug otherwise someone else would
>>> have the same problem.  I will need to have a way to reproduce the
>>> problem. You can try turning on level 200 debug in the Director, and
>>> when the problem arises, do an llist on all volumes (note that is llist
>>> with double l).  Also provide your bacula-dir.conf and bacula-sd.conf.
>>> That may show some problem. The main point is for you to prove that
>>> there are other suitable Volumes that are available.  If doing those
>>> things does not uncover a problem, and I cannot reproduce it (currently
>>> the case), there will not be much more that I can do.
>>>
>>> Best regards,
>>>
>>> Kern
>>>
>>>
>>> On 04/04/2017 01:48 PM, Zdeněk Bělehrádek wrote:
>>>> Hi, thanks for your reply.
>>>>
>>>> Ad 1: they are the same, specifically 7.4.3+dfsg-1+sid1~bpo8+1 from
>>>> jessie-backports (I just verified it). For this test, even the FDs were
>>>> this version.
>>>>
>>>> Ad 2: I worked with clean catalog:
>>>>    - stop director and storages
>>>>    - psql: drop database bacula
>>>>    - psql: create database bacula owner bacula
>>>>    - PGPASSWORD=XXXXX db_name=bacula
>>>> /usr/share/bacula-director/make_postgresql_tables -U bacula -h
>>>> bacdb1.cent -d bacula
>>>>    - start director and storages, enable trace
>>>>
>>>> To be sure, i checked PostgreSQL logs, and there is only one error,
>>>> repeating every time bacula runs a job:
>>>> Apr  3 20:00:01 bacdb1 postgres[10867]: [24-1] 2017-04-03 20:00:01 CEST
>>>> [10867-43] bacula@bacula ERROR:  table "delcandidates" does not exist
>>>> Apr  3 20:00:01 bacdb1 postgres[10867]: [24-2] 2017-04-03 20:00:01 CEST
>>>> [10867-44] bacula@bacula STATEMENT:  DROP TABLE DelCandidates
>>>>
>>>> I don't know why bacula tries to delete nonexistent tables, but looking
>>>> to the source code, this query is used only when pruning jobs to clean
>>>> up temporary tables. I think it is harmless.
>>>>
>>>> I ran dbcheck against my catalog, and it found 2 orphaned clients (one
>>>> is not accessible in testing env and not needed, one have it's job
>>>> stuck) and 2 orphaned filesets (both have jobs that didn't run yet). So
>>>> no errors there either.
>>>>
>>>> The server is OpenStack virtual server running on our infrastructure,
>>>> there were no crashes nor any problems I know of.
>>>>
>>>> Is there any other way to check for catalog damage?
>>>>
>>>> Ad 3: I run the jobs manualy after setting up new catalog, it takes
>>>> only
>>>> few minutes. My retention periods are 7 days minimum.
>>>>
>>>> Ad 4: I do not edit the catalog manually. I was using bacula-web to
>>>> display contents of the catalog, so to be sure I just re-run the test
>>>> with clean catalog and bacula-web disabled and the bug is still here.
>>>>
>>>> Ad 5: I created it fresh by running make_postgresql_tables (from bacula
>>>> package) in empty database.
>>>>
>>>> root@bacdir1:~# dpkg -S
>>>> /usr/share/bacula-director/make_postgresql_tables
>>>> bacula-director-pgsql:
>>>> /usr/share/bacula-director/make_postgresql_tables
>>>> root@bacdir1:~# dpkg -l bacula-director-pgsql | grep "^ii"
>>>> ii  bacula-director-pgsql                     7.4.3+dfsg-1+sid1~bpo8+1
>>>> amd64                     network backup service - PostgreSQL storage
>>>> for Director
>>>> [PREP]root@bacdir1:~# grep
>>>> /usr/share/bacula-director/make_postgresql_tables -e Version
>>>> INSERT INTO Version (VersionId) VALUES (15);
>>>>
>>>> Ad 6: there are 3 programs that could do it automatically: bacula
>>>> director, bacula-web (I disabled it) and nagios check (we don't run
>>>> nagios in test environment). I am quite sure nobody except bacula
>>>> can do
>>>> it. And yes, I am sure no of my co-workers could mess with catalog
>>>> either, I did ask.
>>>>
>>>>
>>>> Looking at the above, I am starting to think it may be a bug in Bacula.
>>>> Should i report it? Where?
>>>>
>>>> With regards,
>>>> Zdeněk Bělehrádek
>>>>
>>>> Dne 3.4.2017 v 18:51 Kern Sibbald napsal(a):
>>>>> Hello,
>>>>>
>>>>> The error you are getting should never happen, which means that
>>>>> something is seriously wrong with your Bacula installation.  A few of
>>>>> the multiple possibilities are:
>>>>>
>>>>> 1. Your DIR and SDs are not on the same version.  They *must* all
>>>>> be the
>>>>> same. With the little information you provided, for the moment this
>>>>> seems to be the most likely problem.
>>>>>
>>>>> 2. Your catalog is damaged.
>>>>>
>>>>> 3. Your Retention periods are too short and records are being removed
>>>>> from the catalog.
>>>>>
>>>>> 4. You have manually modified your catalog, so that now the records
>>>>> are
>>>>> not consistent.
>>>>>
>>>>> 5. Your catalog does not correspond to the Bacula Director version you
>>>>> are running.  This should be detected, but perhaps the catalog was
>>>>> later
>>>>> manually modified.
>>>>>
>>>>> 6. Either manually or some program is removing Volume records from the
>>>>> catalog or changing them (this point is probably a duplication of
>>>>> point 4)
>>>>>
>>>>> Best regards,
>>>>>
>>>>> Kern
>>>>>
> 

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users
<Prev in Thread] Current Thread [Next in Thread>