Re: [Bacula-users] Disk based backup using vchanger, volumes being marke

On 8/5/2014 1:36 AM, Kern Sibbald wrote:

Hello Josh,

Please see below ...

On 08/04/2014 06:43 PM, Josh Fisher wrote:

On 8/1/2014 12:27 PM, Joseph Dickson wrote:

Greetings :-)

I've run into this problem with Bacula in a previous installation, and I can't seem to recall if there was ever a resolution.. I'm using Bacula for disk based backups only, and I am using vchanger to manage my virtual library.

I've configured a vchanger library with 100 slots and 8 drives, and have set a Maximum Volume Bytes of 100G on the pool definition that I am using, to limit each slot in the library to 100G. I have also set a Maximum Concurrent Jobs = 2 setting on each of the virtual tape drive devices in my storage director config, so that only two jobs can write to a device at a time to minimize interleaving.

Everything works perfectly as long as I only kick a few jobs off at a time.. however, when my main backup windows run and 30 or 40 backup jobs kick off, I often end up with jobs that output the following sequence in the logs:

Have you set PreferMountedVolumes=no in the Job resource in bacula-dir.conf? If 3 jobs start and want to write to volumes in the same pool, then all three can be assigned the same volume. In fact, if PreferMountedVolumes=yes, (the default), then all three WILL be assigned the same volume unless the pool restricts the max number of jobs that the volume may contain. However, your device (drive) restricts the max concurrent jobs to 2. Therefore one of those three jobs will not be able to select the drive where the volume is mounted and will be forced to select another unused drive. That third job will nevertheless select the same volume as the other two and attempt to move the volume from the drive it is in into the drive that it has been assigned to. The configuration has a built-in race condition.

I have recently done quite a bit of work to try to avoid race conditions such as the one you describe above. Does this still happen on version 7.0.x? I ask because there is now code that *should* detect this and explicitly makes the third job (as you describe above) wait. Now it is possible that there is some code path in the SD where the new code does not apply, so I cannot exclude problems, but if any exist in 7.0.x I would like to know so I can work on it some more. With the new code, the Volume will be moved around, but at least it should be done correctly without some deadlock or failure.

I haven't had a chance to update to 7.0.x yet, so I can't say. My thought is that the volume itself should have a "Maximum Concurrent Jobs" setting, in addition to the SD Device. Better still, it could be automated by forcing the volume's max concurrency to that of the SD device at mount time. That should eliminate the need for "Prefer Mounted Volumes" altogether, since once the "Maximum Concurrent Jobs" have selected the volume, subsequent jobs would reject it as unavailable and so see the drive it is mounted in as unavailable at drive selection time. Once a drive is selected, that volume would be viewed as unavailable and rejected during volume selection, at least until one of the jobs using the volume ends. So by setting "Max Concurrent Jobs" to 1, one could guarantee a volume would never be selected by more than one job at a time.

Best regards,
Kern

Setting PreferMountedVolumes=no causes the three jobs to select a drive that is NOT already mounted with a volume from the pool. This allows jobs writing to the same pool to select different volumes from the pool, rather than all selecting the same next available volume. This has its own caveats. It doesn't necessarily prevent two jobs from selecting the same volume in some cases, meaning that they will want to swap the volume back and forth between drives, which is another type of race condition. I have used this method successfully for a pool containing full backups only by setting PreferMountedVolumes=no in the job resource and setting MaximumVolumeJobs=1 in the pool resource. Since Bacula selects the volume for a job in an atomic manner, this forces an exclusive set of volumes for each job, thus preventing the race condition. This means that concurrency is limited only by the number of drives, but at the "expense" of creating a greater number of smaller volume files. I quote "expense" because on a disk vchanger it isn't usually a big issue to have more volume files. Doing this with a tape autochanger would use a lot more tapes and be truly more expensive. Of course unlimited concurrency is theoretical, since the hardware limits the USEFUL concurrency.

31-Jul 21:00 bacula1-dir JobId 692: Start Backup JobId 692, Job=job-evolvereports-main.2014-07-31_21.00.00_48
31-Jul 21:00 bacula1-dir JobId 692: Using Device "chg1-drive-1" to write.
31-Jul 21:00 evolvereports-fd JobId 692: DIR and FD clocks differ by 50 seconds, FD automatically compensating.
31-Jul 21:05 bacula1-sd JobId 692: 3307 Issuing autochanger "unload slot 74, drive 1" command.
31-Jul 21:06 bacula1-sd JobId 692: Warning: Volume "chg1_0001_0066" wanted on "chg1-drive-1" (/var/lib/bacula/chg1/1/drive1) is in use by device "chg1-drive-3" (/var/lib/bacula/chg1/3/drive3)
31-Jul 21:06 bacula1-sd JobId 692: Warning: Volume "chg1_0001_0066" not on file device "chg1-drive-1" (/var/lib/bacula/chg1/1/drive1).
31-Jul 21:06 bacula1-sd JobId 692: Marking Volume "chg1_0001_0066" in Error in Catalog.
31-Jul 21:06 bacula1-sd JobId 692: Warning: Volume "chg1_0001_0066" not on file device "chg1-drive-1" (/var/lib/bacula/chg1/1/drive1).
31-Jul 21:06 bacula1-sd JobId 692: Marking Volume "chg1_0001_0066" in Error in Catalog.
31-Jul 21:06 bacula1-sd JobId 692: Warning: mount.c:212 Open of file device "chg1-drive-1" (/var/lib/bacula/chg1/1/drive1) Volume"chg1_0001_0066" failed: ERR=file_dev.c:172 Could not open(/var/lib/bacula/chg1/1/drive1,OPEN_READ_WRITE,0640): ERR=No such file or directory

31-Jul 21:06 bacula1-sd JobId 692: 3307 Issuing autochanger "unload slot 71, drive 2" command.
31-Jul 21:06 bacula1-sd JobId 692: 3304 Issuing autochanger "load slot 71, drive 1" command.
31-Jul 21:06 bacula1-sd JobId 692: 3305 Autochanger "load slot 71, drive 1", status is OK.
31-Jul 21:06 bacula1-sd JobId 692: Volume "chg1_0001_0071" previously written, moving to end of data.
31-Jul 21:06 bacula1-sd JobId 692: Ready to append to end of Volume "chg1_0001_0071" size=8,003,988,010

This ends up marking my perfectly usable volume as Error in the catalog. Is this something that everyone runs into? Is there any fix? As I recall when I looked into it a few years back, the issue was the order and timing of volume and device selection, but it's definitely been a while.

My bacula-sd.conf file is here:

http://pastebin.com/2eDFPDmE

Any guidance would be appreciated!

Thanks,

Joe

------------------------------------------------------------------------------
Infragistics Professional
Build stunning WinForms apps today!
Reboot your WinForms applications with our WinForms controls. 
Build a bridge from your legacy apps to the future.
http://pubads.g.doubleclick.net/gampad/clk?id=153845071&iu=/4140/ostg.clktrk

_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Re: [Bacula-users] Disk based backup using vchanger, volumes being marked as Error