Bacula-users

Re: [Bacula-users] bacula mixing up slots and volumes with virtual multiple drive autochanger running on Amazon AWS storage gateway

2014-11-19 14:28:12
Subject: Re: [Bacula-users] bacula mixing up slots and volumes with virtual multiple drive autochanger running on Amazon AWS storage gateway
From: "Kelley, Jared" <jkelley AT popcap DOT com>
To: "heitor AT bacula.com DOT br" <heitor AT bacula.com DOT br>
Date: Wed, 19 Nov 2014 18:51:34 +0000
Thank you for your reply.   I removed that setting and separated my jobs into 3 different pools.  All my jobs are running smoothly now.  The prefer mounted volumes was the issue.
Now I am experiencing interleaving on a single volume/tape drive with multiple jobs scheduled at the same time, in the same pool, as opposed to bacula using different volumes and different drives for jobs scheduled at the same time for the same pool.

Here is my issue:

I have 10 jobs that kick off  at the same time every day.  Each job takes anywhere form 10 minutes to 30 minutes.  The pool behind these jobs has 4 tape drives and 4 volumes(virtual tapes) assigned to it.  The 10 jobs all kick off at the same time and they all write to the same volume, apparently(assuming) interleaving the data.   See my paste below, Ive bolded start times, end times, pools and tapes.  Once can see these jobs all completed near each other and ran for over an hour.  All to the same tape/volume.   I assume if I had the "prefer mounted volume=no" for these jobs they would use all 4 drives in the pool and write to separate volumes as opposed to interleaving to a single drive and volume.  But based on the manual, reported problems and experienced problems with ‘prefer mounted volumes = no’ I hesitate to use that.  

My question:
Is there another way to get multiple jobs scheduled at the same time to run concurrently using different drives and volumes as assigned to the pool versus all writing at the same time and interleaving the data to a single drive and volume?    Or is prefer mounted volumes = no the only way to solve this?

any help is greatly appreciated

19-Nov 10:06 backup02-sd JobId 925: Job write elapsed time = 01:01:23, Transfer rate = 1.489 M Bytes/second

19-Nov 10:06 backup02-dir JobId 925: Bacula backup02-dir 5.2.6 (21Feb12):

  Build OS:               x86_64-pc-linux-gnu ubuntu 14.04

  JobId:                  925

  Job:                    labsdb1.2014-11-19_09.05.00_21

  Backup Level:           Full (upgraded from Differential)

  Client:                 "labsdb1" 2.4.4 (28Dec08) x86_64-pc-linux-gnu,debian,lenny/sid

  FileSet:                "Database Backup" 2014-11-07 22:02:50

  Pool:                   "database" (From Job resource)

  Catalog:                "MyCatalog" (From Client resource)

  Storage:                "Tape3" (From Pool resource)

  Scheduled time:         19-Nov-2014 09:05:00

  Start time:             19-Nov-2014 09:05:02

  End time:               19-Nov-2014 10:06:26

  Elapsed time:           1 hour 1 min 24 secs

  Priority:               10

  FD Files Written:       2

  SD Files Written:       2

  FD Bytes Written:       5,486,802,781 (5.486 GB)

  SD Bytes Written:       5,486,802,988 (5.486 GB)

  Rate:                   1489.4 KB/s

  Software Compression:   None

  VSS:                    no

  Encryption:             no

  Accurate:               no

  Volume name(s):         AAAACF5F6A

  Volume Session Id:      155

  Volume Session Time:    1415816791

  Last Volume Bytes:      1,348,311,508,992 (1.348 TB)

  Non-fatal FD errors:    0

  SD Errors:              0

  FD termination status:  OK

  SD termination status:  OK

  Termination:            Backup OK


19-Nov 10:06 backup02-dir JobId 925: Begin pruning Jobs older than 1 year .

19-Nov 10:06 backup02-dir JobId 925: No Jobs found to prune.

19-Nov 10:06 backup02-dir JobId 925: Begin pruning Files.

19-Nov 10:06 backup02-dir JobId 925: No Files found to prune.

19-Nov 10:06 backup02-dir JobId 925: End auto prune.


19-Nov 10:41 backup02-sd JobId 919: Job write elapsed time = 01:36:26, Transfer rate = 2.570 M Bytes/second

19-Nov 10:41 backup02-dir JobId 919: Bacula backup02-dir 5.2.6 (21Feb12):

  Build OS:               x86_64-pc-linux-gnu ubuntu 14.04

  JobId:                  919

  Job:                    shopd1.2014-11-19_09.05.00_15

  Backup Level:           Full (upgraded from Differential)

  Client:                 "shopdb1" 5.0.2 (28Apr10) x86_64-pc-linux-gnu,debian,squeeze/sid

  FileSet:                "varlibDatabase Backup" 2014-11-06 20:20:52

  Pool:                   "database" (From Job resource)

  Catalog:                "MyCatalog" (From Client resource)

  Storage:                "Tape3" (From Pool resource)

  Scheduled time:         19-Nov-2014 09:05:00

  Start time:             19-Nov-2014 09:05:00

  End time:               19-Nov-2014 10:41:26

  Elapsed time:           1 hour 36 mins 26 secs

  Priority:               10

  FD Files Written:       3

  SD Files Written:       3

  FD Bytes Written:       14,875,191,024 (14.87 GB)

  SD Bytes Written:       14,875,192,011 (14.87 GB)

  Rate:                   2570.9 KB/s

  Software Compression:   None

  VSS:                    no

  Encryption:             yes

  Accurate:               no

  Volume name(s):         AAAACF5F6A

  Volume Session Id:      149

  Volume Session Time:    1415816791

  Last Volume Bytes:      1,375,507,574,784 (1.375 TB)

  Non-fatal FD errors:    0

  SD Errors:              0

  FD termination status:  OK

  SD termination status:  OK

  Termination:            Backup OK


19-Nov 10:41 backup02-dir JobId 919: Begin pruning Jobs older than 7 years .

19-Nov 10:41 backup02-dir JobId 919: No Jobs found to prune.

19-Nov 10:41 backup02-dir JobId 919: Begin pruning Files.

19-Nov 10:41 backup02-dir JobId 919: No Files found to prune.

19-Nov 10:41 backup02-dir JobId 919: End auto prune.



19-Nov 11:10 backup02-sd JobId 917: Job write elapsed time = 02:05:32, Transfer rate = 3.894 M Bytes/second

19-Nov 11:10 backup02-dir JobId 917: Bacula backup02-dir 5.2.6 (21Feb12):

  Build OS:               x86_64-pc-linux-gnu ubuntu 14.04

  JobId:                  917

  Job:                    pushdb.2014-11-19_09.05.00_13

  Backup Level:           Full (upgraded from Differential)

  Client:                 "pushdb1" 5.0.2 (28Apr10) x86_64-pc-linux-gnu,debian,6.0.6

  FileSet:                "varlibDatabase Backup" 2014-11-06 20:20:52

  Pool:                   "database" (From Job resource)

  Catalog:                "MyCatalog" (From Client resource)

  Storage:                "Tape3" (From Pool resource)

  Scheduled time:         19-Nov-2014 09:05:00

  Start time:             19-Nov-2014 09:05:00

  End time:               19-Nov-2014 11:10:33

  Elapsed time:           2 hours 5 mins 33 secs

  Priority:               10

  FD Files Written:       2

  SD Files Written:       2

  FD Bytes Written:       29,334,385,684 (29.33 GB)

  SD Bytes Written:       29,334,385,894 (29.33 GB)

  Rate:                   3894.1 KB/s

  Software Compression:   None

  VSS:                    no

  Encryption:             no

  Accurate:               no

  Volume name(s):         AAAACF5F6A

  Volume Session Id:      147

  Volume Session Time:    1415816791

  Last Volume Bytes:      1,397,835,371,520 (1.397 TB)

  Non-fatal FD errors:    1

  SD Errors:              0

  FD termination status:  OK

  SD termination status:  OK

  Termination:            Backup OK 


19-Nov 11:10 backup02-dir JobId 917: Begin pruning Jobs older than 1 year .

19-Nov 11:10 backup02-dir JobId 917: No Jobs found to prune.

19-Nov 11:10 backup02-dir JobId 917: Begin pruning Files.

19-Nov 11:10 backup02-dir JobId 917: No Files found to prune.

19-Nov 11:10 backup02-dir JobId 917: End auto prune.


19-Nov 11:26 backup02-sd JobId 915: Job write elapsed time = 02:21:25, Transfer rate = 3.655 M Bytes/second

19-Nov 11:26 backup02-dir JobId 915: Bacula backup02-dir 5.2.6 (21Feb12):

  Build OS:               x86_64-pc-linux-gnu ubuntu 14.04

  JobId:                  915

  Job:                    ecommdb1.2014-11-19_09.05.00_11

  Backup Level:           Differential, since=2014-11-08 09:05:03

  Client:                 "ecommdb" 2.4.4 (28Dec08) x86_64-pc-linux-gnu,debian,lenny/sid

  FileSet:                "Database Backup" 2014-11-07 22:02:50

  Pool:                   "database" (From Job resource)

  Catalog:                "MyCatalog" (From Client resource)

  Storage:                "Tape3" (From Pool resource)

  Scheduled time:         19-Nov-2014 09:05:00

  Start time:             19-Nov-2014 09:05:00

  End time:               19-Nov-2014 11:26:27

  Elapsed time:           2 hours 21 mins 27 secs

  Priority:               10

  FD Files Written:       1

  SD Files Written:       1

  FD Bytes Written:       31,016,183,168 (31.01 GB)

  SD Bytes Written:       31,016,183,295 (31.01 GB)

  Rate:                   3654.6 KB/s

  Software Compression:   None

  VSS:                    no

  Encryption:             no

  Accurate:               no

  Volume name(s):         AAAACF5F6A

  Volume Session Id:      145

  Volume Session Time:    1415816791

  Last Volume Bytes:      1,407,107,681,280 (1.407 TB)

  Non-fatal FD errors:    0

  SD Errors:              0

  FD termination status:  OK

  SD termination status:  OK

  Termination:            Backup OK




From: "heitor AT bacula.com DOT br" <heitor AT bacula.com DOT br>
Date: Thursday, November 13, 2014 at 12:24 PM
To: "Kelley, Jared" <jkelley AT popcap DOT com>
Cc: "bacula-users AT lists.sourceforge DOT net" <bacula-users AT lists.sourceforge DOT net>
Subject: Re: [Bacula-users] bacula mixing up slots and volumes with virtual multiple drive autochanger running on Amazon AWS storage gateway


My settings for each job has prefer mounted volumes = no so each job should kick off until there are 10 jobs running and all 10 drives are in use. But this is sometimes not the case.

Mr. Jared: I think you should never use prefer mounted volumes = no. It's not a perfect Bacula functionality, eventually will lead you to glitches and this warning is in the Bacula web manual.
If you want to use multiple volume writing, go for submitting jobs at the same time to different pools.

Regards,
=========================================================================
Heitor Medrado de Faria
Faltam poucos dias - Treinamento Telepresencial Bacula: http://www.bacula.com.br/?p=2174
==========================================================================


De: "Jared Kelley" <jkelley AT popcap DOT com>
Para: bacula-users AT lists.sourceforge DOT net
Enviadas: Quinta-feira, 13 de novembro de 2014 17:39:31
Assunto: [Bacula-users] bacula mixing up slots and volumes with virtual multiple drive autochanger running on Amazon AWS storage gateway

Curious if anyone has any knowledge of this or if maybe I’ve come upon a bug with bacula.
bacula Version: 5.2.6
OS info

cat /etc/debian_version 

jessie/sid


lsb_release -a

Distributor ID:Ubuntu

Description:Ubuntu 14.04.1 LTS

Release:14.04

Codename:trusty


I’m using amazon AWS storage gateway for my auto changer and tape drives.

This allows for 1600 slots, 10 drives and 1 auto changer.


I am able to run up to 10 jobs at one time, writing to 10 different virtual tapes on 10 different drives, concurrently.

The problem is, after many jobs have ran successfully over a period of a few days I start getting errors such as:



12-Nov 20:05 backup02-sd JobId 774: 3304 Issuing autochanger "load slot 1, drive 1" command.

12-Nov 20:10 backup02-sd JobId 774: Fatal error: 3992 Bad autochanger "load slot 1, drive 1": ERR=Child exited with code 1.

Results=Loading media from Storage Element 1 into drive 1...Source Element Address 20000 is Empty

12-Nov 20:10 svc3-02-fd JobId 774: Fatal error: job.c:1817 Bad response to Append Data command. Wanted 3000 OK data

 got 3903 Error append data


OR

12-Nov 19:10 backup02-sd JobId 772: Please mount Volume "AAAACF5F6A" or label a new one for:

    Job:          nfs:tape.2014-11-12_18.43.30_05

    Storage:      "Drive-2" (/dev/tape/by-path/ip-10.<IP>-iscsi-iqn.1997-05.com.amazon:sgw-20b85d49-tapedrive-02-lun-0-nst)

    Pool:         Default

    Media type:   LTO


12-Nov 19:18 backup02-sd JobId 772: Warning: Volume "AAAACF5F6A" wanted on "Drive-2" (/dev/tape/by-path/ip-10.<IP>:3260-iscsi-iqn.1997-05.com.amazon:sgw-20b85d49-tapedrive-02-lun-0-nst) is in use by device "Drive-1" (/dev/tape/by-path/ip-10.<IP>-iscsi-iqn.1997-05.com.amazon:sgw-20b85d49-tapedrive-01-lun-0-nst)

12-Nov 19:18 backup02-sd JobId 772: Warning: mount.c:217 Open device "Drive-2" (/dev/tape/by-path/ip-10.5.66.22:3260-iscsi-iqn.1997-05.com.amazon:sgw-20b85d49-tapedrive-02-lun-0-nst) Volume "AAAACF5F6A" failed: ERR=dev.c:506 Unable to open device "Drive-2" (/dev/tape/by-path/ip-10.<IP>-iscsi-iqn.1997-05.com.amazon:sgw-20b85d49-tapedrive-02-lun-0-nst): ERR=No medium found



The above 2 jobs failed with an Error, but this job, with similar warnings in the log file, ended up completing successfully with no intervention from myself.



Nov 08:35 backup02-sd JobId 785: Warning: mount.c:217 Open device "Drive-6" (/dev/tape/by-path/ip-10.<IP>:3260-iscsi-iqn.1997-05.com.amazon:sgw-20b85d49-tapedrive-06-lun-0-nst) Volume "AAAAC95F6C" failed: ERR=dev.c:506 Unable to open device "Drive-6" (/dev/tape/by-path/ip-10.<IP>:3260-iscsi-iqn.1997-05.com.amazon:sgw-20b85d49-tapedrive-06-lun-0-nst): ERR=No medium found


13-Nov 08:35 backup02-sd JobId 785: Please mount Volume "AAAAC95F6C" or label a new one for:

    Job:          rvsdb1-02:daily.2014-11-13_08.05.00_26

    Storage:      "Drive-6" (/dev/tape/by-path/ip-10.5.66.22:3260-iscsi-iqn.1997-05.com.amazon:sgw-20b85d49-tapedrive-06-lun-0-nst)

    Pool:         Default

    Media type:   LTO

13-Nov 08:40 backup02-sd JobId 785: 3307 Issuing autochanger "unload slot 5, drive 2" command.

13-Nov 08:40 backup02-sd JobId 785: 3304 Issuing autochanger "load slot 5, drive 5" command.

13-Nov 08:40 backup02-sd JobId 785: 3305 Autochanger "load slot 5, drive 5", status is OK.

13-Nov 08:40 backup02-sd JobId 785: Volume "AAAAD55F70" previously written, moving to end of data.

13-Nov 08:40 backup02-sd JobId 785: Ready to append to end of Volume "AAAAD55F70" at file=985.


It appears this last job, ID 785, was able to load slot 5 in a different drive and continue on writing to a different volume, AAAAD55F70,  than it was originally wanting to write to, AAAAC95F6C.

My settings for each job has prefer mounted volumes = no so each job should kick off until there are 10 jobs running and all 10 drives are in use. But this is sometimes not the case.

I have seen jobs fail as is the case for JobId 772 and 774 as well as some complete such as Job 785.

I run a list volumes and there are times where 2 separate volumes are shown using the same slot.

I then run update slots scan and one duplicate slot becomes a 0.  There isn’t a 0 slot #, so I run update slots again and the correct slots and volumes are now listed in console as compared to the output of:

/etc/bacula/scripts/mtx-changer /dev/sg12 listall


It seems bacula is mixing up the slots and which volumes are in each slot all by itself and my only conclusion is over time, bacula is unable to successfully manage 1600 slots with 10 drives and concurrent jobs using the first 10 of those 1600 slots with the 10 drives.  Of course all tapes are virtual tapes being written to AmazonAWS and the virtual changer is a storage gateway running on vmaware with iscsi targets used by bacula running on a bare-metal server successfully, up to a point.


Any help or input on this issue would be greatly appreciated.  If I can provide any further information that would be helpful please let me know that as well.


Thanks in advance


Jk



------------------------------------------------------------------------------
Comprehensive Server Monitoring with Site24x7.
Monitor 10 servers for $9/Month.
Get alerted through email, SMS, voice calls or mobile push notifications.
Take corrective actions from your mobile device.
http://pubads.g.doubleclick.net/gampad/clk?id=154624111&iu=/4140/ostg.clktrk
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users

------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users
<Prev in Thread] Current Thread [Next in Thread>