Re: [Bacula-users] Large backup to tape?

In the message dated: Thu, 08 Mar 2012 09:38:33 PST,
The pithy ruminations from Erich Weiler on 
<Re: [Bacula-users] Large backup to tape?> were:
=> Thanks for the suggestions!
=> 
=> We have a couple more questions that I hope have easy answers.  So, it's 
=> been strongly suggested by several folks now that we back up our 200TB 
=> of data in smaller chunks.  This is our structure:
=> 
=> We have our 200TB in one directory.  From there we have about 10,000 
=> subdirectories that each have two files in it, ranging in size between 
=> 50GB and 300GB (an estimate).  All of those 10,000 directories adds to 
=> up about 200TB.  It will grow to 3 or so petabytes in size over the next 
=> few years.


Hmmm...maybe I'm misunderstanding your data structure. Let me do a
little math:

        10000 directories * 2 files each = 20000 files
        200TB / 20000 files = 10GB/file

        3PB(projected size) / 10GB(per file) = 314572 files

        314572 files / 2 (files per directory) = 157286 directories

What filesystem are you using? Many filesystems have serious performance
problems when there are 10K objects (files, subdirectories) in a
directory.

I am absolutely not a DBA, but I'd take a close look at the data you are
using for bacula, and open a discussion with the developers regarding
tables, indices, and performance with such a wide & shallow directory
tree.

=> 
=> Does anyone have an idea of how to break that up logically within 
=> bacula, such that we could just do a bunch of smaller "Full" backups of 


Sure. I'm assuming that your directories have some kind of logical
naming convention:

        /data/AAA000123
        /data/AAA000124
             :
             :
        /data/ZZZ999999

In that case, you can create multiple logical filesets within bacula,
for example:

        /data/AAA0[0-4]*
        /data/AAA0[5-9]*

As each of these filesets would be backed up from the same fileserver
(unless you're using a clustered filesystem), then you'd want to restrict
backup concurrency to avoid running too many jobs at once.

You could do:

        # of filesets =
                
        acceptable backup window (in hours) * backup rate (GB/hr)
                 /  20GB (from earlier calculation of average 20GB
                          per subdirectory)

The regular expression used to determine the filesets would need to be
based on both the current subdirectory names and the subdirectories to be
added in the future, in order to keep the fileset size balanced. In other
words, if the new data subdirectories will all be in the range AAABBB*,
then you'll need to do something to split that range into reasonable
sized chunks, not 3PB.


=> smaller chunks of the data?  The data will never change, and will just 
=> be added to.  As in, we will be adding more subdirectories with 2 files 
=> in them to the main directory, but will never delete or change any of 
=> the old data.

=> 
=> Is there a way to tell bacula to "back up all this, but do it in small 
=> 6TB chunks" or something?  So we would avoid the massive 200TB single 
=> backup job + hundreds of (eventual) small incrementals?  Or some other idea?

You could use bacula's mechanism to use filesets that are dynamically
generated from an external program. The external program could use a
knapsack algorithm to take all the directories and divide them into sets,
with each set sized to meet your acceptable backup window. The algorithm
would need to be 'stable', so that directory "AAA000123" is placed with
the same other subdirectories each time.

See the description of a "file-list" in:

        
http://www.bacula.org/5.2.x-manuals/en/main/main/Configuring_Director.html#SECTION002370000000000000000

Mark

=> 
=> Thanks again for all the feedback!  Please "reply-all" to this email 
=> when replying.
=> 
=> -erich
=> 
=> On 3/1/12 10:18 AM, mark.bergman AT uphs.upenn DOT edu wrote:
=> > In the message dated: Wed, 29 Feb 2012 20:23:14 PST,
=> > The pithy ruminations from Erich Weiler on
=> > <[Bacula-users] Large backup to tape?>  were:
=> > =>  Hey Y'all,
=> > =>
=> > =>  So I have a Dell ML6010 tape library that holds 41 LTO-5 tapes, all
=> >
=> > I've got a Dell ML6010, so I can offer some specific suggestions.
=> >
=> >    [SNIP!]
=> >
=> > =>
=> > =>  The fileset I'm backing up is about 200TB large total (each file is
=> > =>  about 300GB big).  So, not only will it use every tape in the tape
=> > =>  library (41 tapes), but we'll have to refill the tape library about 6
=> > =>  times to get the whole thing backed up.  After that I want to just do
=> >
=> > I agree with the other suggestions to break up the dataset into smaller
=> > chunks.
=> >
=> >
=> >    [SNIP!]
=> >
=> > =>
=> > =>  So, I guess a have a couple basic questions.  When it uses all the 
tapes
=> > =>  in the library in a single job (200TB! 41 tapes only hold 60TB), will 
it
=> >
=> > It'll depend a lot on the compressibility of your data.
=> >
=> > =>  simply pause, send me an email saying it's waiting for new media, then 
I
=> > =>  load 41 new tapes?  Then tell it to resume, and it uses the next 41, ad
=> > =>  nauseum?
=> >
=> > Yes, sort of.
=> >
=> > You'll get lots of mail from bacula about needing to change tapes.
=> >
=> > In my experience, changing tapes in the library while a backup is running 
must
=> > be done very carefully. I suggest that you not use the native ML6010 tools
=> > (touch pad on the library or web interface) to move tapes to-and-from the
=> > mailbox slots. Our procedure is:
=> >
=> >    use mtx to transfer full tapes from library slots to the mailbox slots
=> >
=> >    remove the full tapes from the mailbox slots
=> >
=> >    add new tapes to the mailbox slots
=> >
=> >    allow the library to scan the new tapes, the choose to add them
=> >    to "partition 1" (or whatever you have named your non-system partition
=> >    within the library)
=> >
=> >    use mtx to transfer the new tapes from the mailbox slots to available
=> >    slots in the library
=> >
=> >    when complete, run "update slots" from within the Bacula 'bconsole'
=> >
=> >    if the tapes have never been used within Bacula before, run "label
=> >    barcodes" from within 'bconsole'
=> >    
=> > =>
=> > =>  And, if I want to make 2 copies of the tapes, can I simply configure 2
=> > =>  differently named jobs that each backup the same fileset?
=> > =>
=> > =>  Also, do I need to manually "label" the tapes (electronically) as I 
load
=> > =>  them, or will the fact that the autoloader automatically reads the new
=> > =>  barcodes be enough?
=> >
=> > You will need to logically label the tapes (writing a Bacula header to each
=> > tape). This can be done automatically with "label barcodes".
=> > =>
=> > =>  Thanks for any hints.  And, if you know any "gotchas" I should watch 
for
=> > =>  during this process, please let me know!  I don't want bacula expiring
=> > =>  the tapes ever, or re-using them, as the data will never change and we
=> > =>  need to keep it forever.
=> >
=> > Set the file/volume/job retention times to something really long. For us, 
"10
=> > years" =~ "infinite", under the theory that after 10 years we'll have 
moved to
=> > different tape hardware and the old data will need to be transferred to the
=> > new media somehow.
=> >
=> > Make a backup of the Bacula database as soon as the backup is complete. 
Save
=> > that to both a backup tape and to some other media (external hard drive?
=> > multiple Blueray discs? punch cards?) so that you can recover data if 
there's
=> > ever a problem with the database--you do NOT want to be in a position of
=> > needing to "bscan" ~100x LTO5 tapes in order to rebuild the database.
=> >
=> > Mark
=> >
=> > =>
=> > =>  Many thanks,
=> > =>  erich
=> > =>
=> 
=> 
------------------------------------------------------------------------------
=> Virtualization & Cloud Management Using Capacity Planning
=> Cloud computing makes use of virtualization - but cloud computing 
=> also focuses on allowing computing to be delivered as a service.
=> http://www.accelacomm.com/jaw/sfnl/114/51521223/
=> _______________________________________________
=> Bacula-users mailing list
=> Bacula-users AT lists.sourceforge DOT net
=> https://lists.sourceforge.net/lists/listinfo/bacula-users
=> 



------------------------------------------------------------------------------
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users