Re: [Bacula-users] Disk based backup using vchanger, volumes being marked as Error
2014-08-05 08:15:28
On 8/5/2014 1:36 AM, Kern Sibbald
wrote:
Hello Josh,
Please see below ...
On 08/04/2014 06:43 PM, Josh Fisher wrote:
On 8/1/2014 12:27 PM, Joseph
Dickson wrote:
Greetings :-)
I've run into this problem with Bacula in a previous
installation, and I can't seem to recall if there was ever
a resolution.. I'm using Bacula for disk based backups
only, and I am using vchanger to manage my virtual
library.
I've configured a vchanger library with 100 slots and 8
drives, and have set a Maximum Volume Bytes of 100G on the
pool definition that I am using, to limit each slot in the
library to 100G. I have also set a Maximum Concurrent
Jobs = 2 setting on each of the virtual tape drive devices
in my storage director config, so that only two jobs can
write to a device at a time to minimize interleaving.
Everything works perfectly as long as I only kick a few
jobs off at a time.. however, when my main backup windows
run and 30 or 40 backup jobs kick off, I often end up with
jobs that output the following sequence in the logs:
Have you set PreferMountedVolumes=no in the Job resource in
bacula-dir.conf? If 3 jobs start and want to write to volumes in
the same pool, then all three can be assigned the same volume.
In fact, if PreferMountedVolumes=yes, (the default), then all
three WILL be assigned the same volume unless the pool restricts
the max number of jobs that the volume may contain. However,
your device (drive) restricts the max concurrent jobs to 2.
Therefore one of those three jobs will not be able to select the
drive where the volume is mounted and will be forced to select
another unused drive. That third job will nevertheless select
the same volume as the other two and attempt to move the volume
from the drive it is in into the drive that it has been assigned
to. The configuration has a built-in race condition.
I have recently done quite a bit of work to try to avoid race
conditions such as the one you describe above. Does this still
happen on version 7.0.x? I ask because there is now code that
*should* detect this and explicitly makes the third job (as you
describe above) wait. Now it is possible that there is some code
path in the SD where the new code does not apply, so I cannot
exclude problems, but if any exist in 7.0.x I would like to know
so I can work on it some more. With the new code, the Volume will
be moved around, but at least it should be done correctly without
some deadlock or failure.
I haven't had a chance to update to 7.0.x yet, so I can't say. My
thought is that the volume itself should have a "Maximum Concurrent
Jobs" setting, in addition to the SD Device. Better still, it could
be automated by forcing the volume's max concurrency to that of the
SD device at mount time. That should eliminate the need for "Prefer
Mounted Volumes" altogether, since once the "Maximum Concurrent
Jobs" have selected the volume, subsequent jobs would reject it as
unavailable and so see the drive it is mounted in as unavailable at
drive selection time. Once a drive is selected, that volume would be
viewed as unavailable and rejected during volume selection, at least
until one of the jobs using the volume ends. So by setting "Max
Concurrent Jobs" to 1, one could guarantee a volume would never be
selected by more than one job at a time.
Best regards,
Kern
Setting PreferMountedVolumes=no causes the three jobs to select
a drive that is NOT already mounted with a volume from the pool.
This allows jobs writing to the same pool to select different
volumes from the pool, rather than all selecting the same next
available volume. This has its own caveats. It doesn't
necessarily prevent two jobs from selecting the same volume in
some cases, meaning that they will want to swap the volume back
and forth between drives, which is another type of race
condition. I have used this method successfully for a pool
containing full backups only by setting PreferMountedVolumes=no
in the job resource and setting MaximumVolumeJobs=1 in the pool
resource. Since Bacula selects the volume for a job in an atomic
manner, this forces an exclusive set of volumes for each job,
thus preventing the race condition. This means that concurrency
is limited only by the number of drives, but at the "expense" of
creating a greater number of smaller volume files. I quote
"expense" because on a disk vchanger it isn't usually a big
issue to have more volume files. Doing this with a tape
autochanger would use a lot more tapes and be truly more
expensive. Of course unlimited concurrency is theoretical, since
the hardware limits the USEFUL concurrency.
31-Jul
21:00 bacula1-dir JobId 692: Start Backup JobId 692,
Job=job-evolvereports-main.2014-07-31_21.00.00_48
31-Jul
21:00 bacula1-dir JobId 692: Using Device "chg1-drive-1"
to write.
31-Jul
21:00 evolvereports-fd JobId 692: DIR and FD clocks
differ by 50 seconds, FD automatically compensating.
31-Jul
21:05 bacula1-sd JobId 692: 3307 Issuing autochanger
"unload slot 74, drive 1" command.
31-Jul
21:06 bacula1-sd JobId 692: Warning: Volume "chg1_0001_0066"
wanted on "chg1-drive-1" (/var/lib/bacula/chg1/1/drive1)
is in use by device "chg1-drive-3"
(/var/lib/bacula/chg1/3/drive3)
31-Jul
21:06 bacula1-sd JobId 692: Warning: Volume "chg1_0001_0066"
not on file device "chg1-drive-1"
(/var/lib/bacula/chg1/1/drive1).
31-Jul
21:06 bacula1-sd JobId 692: Marking Volume "chg1_0001_0066"
in Error in
Catalog.
31-Jul
21:06 bacula1-sd JobId 692: Warning: Volume "chg1_0001_0066"
not on file device "chg1-drive-1"
(/var/lib/bacula/chg1/1/drive1).
31-Jul
21:06 bacula1-sd JobId 692: Marking Volume "chg1_0001_0066"
in Error in
Catalog.
31-Jul
21:06 bacula1-sd JobId 692: Warning: mount.c:212 Open of
file device "chg1-drive-1" (/var/lib/bacula/chg1/1/drive1) Volume"chg1_0001_0066"
failed: ERR=file_dev.c:172 Could not
open(/var/lib/bacula/chg1/1/drive1,OPEN_READ_WRITE,0640):
ERR=No such file or directory
31-Jul
21:06 bacula1-sd JobId 692: 3307 Issuing autochanger
"unload slot 71, drive 2" command.
31-Jul
21:06 bacula1-sd JobId 692: 3304 Issuing autochanger
"load slot 71, drive 1" command.
31-Jul
21:06 bacula1-sd JobId 692: 3305 Autochanger "load slot
71, drive 1", status is OK.
31-Jul
21:06 bacula1-sd JobId 692: Volume "chg1_0001_0071"
previously written, moving to end of data.
31-Jul
21:06 bacula1-sd JobId 692: Ready to append to end of Volume "chg1_0001_0071"
size=8,003,988,010
This
ends up marking my perfectly usable volume as Error in
the catalog. Is this something that everyone runs into?
Is there any fix? As I recall when I looked into it a
few years back, the issue was the order and timing of
volume and device selection, but it's definitely been a
while.
My bacula-sd.conf file is here:
Any guidance would be appreciated!
Thanks,
|
------------------------------------------------------------------------------
Infragistics Professional
Build stunning WinForms apps today!
Reboot your WinForms applications with our WinForms controls.
Build a bridge from your legacy apps to the future.
http://pubads.g.doubleclick.net/gampad/clk?id=153845071&iu=/4140/ostg.clktrk _______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users
|
|
|