Bacula-users

Re: [Bacula-users] Device is BLOCKED - renamed Bugged or not?

2015-04-27 02:12:51
Subject: Re: [Bacula-users] Device is BLOCKED - renamed Bugged or not?
From: Kern Sibbald <kern AT sibbald DOT com>
To: "Ana Emília M. Arruda" <emiliaarruda AT gmail DOT com>
Date: Mon, 27 Apr 2015 08:07:14 +0200
Hello Ana,

One of the big race conditions that is not yet solved, because it takes a major rewrite, that is waiting on me having some free time is the case where two jobs attempt to use the same drive at the same time for different Volumes.  This leads to a BLOCKED condition on one of the jobs until the other job finishes.

The workaround for that problem is for jobs that can contend for the same drive but use different Volumes (pools), ensure that they do not all start at the same time.  That is if you start 50-100 jobs at the same time, and there are 20 that run concurrently in the SD, then you increase the changes of a initial drive assignment conflict.

If instead you start those jobs with 1-2 minute intervals, you will not have that particular issue.  Generally, it just requires slightly different schedules.

Best regards,
Kern

On 27.04.2015 03:04, Ana Emília M. Arruda wrote:

​I'm glad to read so good news. Thank you Kern.​

​I have been ​trying to understand this issue that a Bacula user has been facing. As Kern said, it is really difficult to replicate it. We noticed that his backups worked fine for days and suddenly a "DEVICE is blocked" appeared. Some details about his configuration:

1) 3 pools being used by 20 or more concurrent jobs;
2) an autochanger with 10 drives (to avoid interleaving, each device was configured with maximum concurrent jobs = 1)
3) jobs with different priorities and various scheduled times.
4) groups of jobs using different pools

He noticed that he was having issues with slot mess. That is, before his backups started, he had the output from mtx-changer listall showing the media/slots information as it was in Bacula's Catalog. Then, after a day of backup jobs run he noticed that mtx-changer listall show different information from the Catalog. 

The issue here seemed to be the autochanger timeout configuration. He had an autochanger with a 900 seconds timeout. So we configured the maximum changer/rewind/open wait directives configured for 900 seconds and the mtx-changer script. It seems that this solved the problem with the slot mess.

We thought that this was causing the issue with DEVICE is blocked. But we cannot confirme that by now.

Also he did some schedules and pools modifications. Now all the jobs have the same priority, same time schedule and will use just one pool in a specific day.

We are going to monitor this new configuration and maybe we can post here the results.

Best regards,
Ana

On Sat, Apr 25, 2015 at 2:50 AM, Kern Sibbald <kern AT sibbald DOT com> wrote:
In my last email, I did forget to mention that as you point out, the
problem can also result from a design issue.  And the resolution of
those problems from design issues fall into my point 2.  If we have a
good test case that shows the problem, even if it results from a design
decision, most of the time we can find a solution -- in some cases, we
have added new directives, but in most cases, a bit more
programming/logic can fix the problem.

One of the biggest issues that I have with the current SD algorithm is
that during the drive(s) reservation process (prior to starting the SD
job) once a write drive is assigned, it cannot be changed.  Changing a
drive when multiple simultaneous jobs are writing is a non-trivial
problem.  There are solutions, but they require rather profound changes
to the SD, which I have been planning for at least 5 years -- all the
underlying code and algorithms now exist so it is a matter of time.

Best regards,
Kern

On 24.04.2015 22:07, Josh Fisher wrote:
> I guess it is semantics, but I was just pointing out that it was not a
> coding issue, but rather a design issue/choice.
>
> You can divide the jobs into different pools and then give jobs in the
> same pools different priorities. The pools allow multiple jobs (from
> different pools) to run concurrently, while the priorities serialize the
> jobs within each pool. Far from desirable, but it does work.
>
> In any case, I agree that all of the ways of using multiple drives
> concurrently seem unwieldy.  It would be nice if both device and volume
> assignment were done as a single atomic operation every time that a job
> selected a volume. In other words, when the job needs a volume, it looks
> for both an AVAILABLE volume and an AVAILABLE device at the same time,
> and only one job at a time can make a volume-device selection. That is
> easier said than done, of course.
>
> On 4/24/2015 1:09 PM, Clark, Patricia A. wrote:
>> To avoid hijacking the question and to address whether it's a bug or not:
>>
>> Why it's a bug - request for media that is unavailable because it is
>> already in use whether for a backup or recovery by a new backup job is a
>> bug when other perfectly good media is available.  One should not need to
>> create separate pools otherwise you will need a separate pool for each job
>> to ensure this situation never happens.  The real issue here is how and
>> when the communication happens between the director and the storage
>> daemon.  If both of these jobs start within a short period of each other
>> (usually on the same schedule), that's when the second job will request
>> media that has already been assigned by the SD, but not communicated to
>> the director prior to the second job starting.  That gap is what creates
>> the contention for media.  I have also had tapes pulled out from
>> underneath a job resulting in "NULL" volume name and failed jobs.  So, if
>> not separate pools, then there's using separate schedules for each job,
>> also not desireable.  I have used offset schedules for groups of jobs in
>> order to reduce the number of contentions.  If nothing else, if media is
>> not available within a reasonable period of time of the request, the
>> director and/or the SD should decide to look for another.
>>
>> Patti Clark
>> Linux System Administrator
>> R&D Systems Support Oak Ridge National Laboratory
>>
>>
>>
>> On 4/24/15, 11:02 AM, "Josh Fisher" <jfisher AT pvct DOT com> wrote:
>>
>>> On 4/24/2015 9:14 AM, Clark, Patricia A. wrote:
>>>> This is a known bug that has been reported, but still exists.  The job
>>>> wants the tape in use by another job that is using it in drive 0.
>>> I'm not convinced that this is a bug. By design, Bacula allows more than
>>> one job to simultaneously write to the same volume. When a job looks for
>>> the next volume to write on, it cannot exclude volumes that are already
>>> in use by another job. Note that this is not just at job start up, but
>>> any time a volume is needed. What causes the catch-22 is that each job
>>> is assigned a single device (tape drive) only once at job start up. If
>>> two jobs, each writing to a different device, require the same volume,
>>> then one job must wait until the volume can be moved into its assigned
>>> device. So it is not a bug in the implementation, but rather a design
>>> choice.
>>>
>>>  From the perspective of using a multiple drive changer it would seem
>>> that it is a bug to allow multiple jobs to simultaneously write to the
>>> same volume, but Bacula must work with all kinds of hardware. If the
>>> implementation were changed to disallow simultaneous writes to the same
>>> volume, then concurrent jobs with a single drive changer would be
>>> impossible.
>>>
>>> Bacula does allow resolving this issue through the use of pools. By
>>> segregating jobs that are to be run concurrently into different pools,
>>> the situation where two jobs want the same volume at the same time is
>>> avoided altogether.  So is this a bug, or is it a configuration error?
>>>
>>>
>
> ------------------------------------------------------------------------------
> One dashboard for servers and applications across Physical-Virtual-Cloud
> Widest out-of-the-box monitoring support with 50+ applications
> Performance metrics, stats and reports that give you Actionable Insights
> Deep dive visibility with transaction tracing using APM Insight.
> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
> _______________________________________________
> Bacula-users mailing list
> Bacula-users AT lists.sourceforge DOT net
> https://lists.sourceforge.net/lists/listinfo/bacula-users
>


------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users


------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users