Bacula-users

Re: [Bacula-users] Bacula tape format vs. rsync on deduplicated file systems

2010-05-28 13:33:21
Subject: Re: [Bacula-users] Bacula tape format vs. rsync on deduplicated file systems
From: Robert LeBlanc <robert AT leblancnet DOT us>
To: Eric Bollengier <eric.bollengier AT baculasystems DOT com>
Date: Fri, 28 May 2010 11:24:03 -0600
On Fri, May 28, 2010 at 10:48 AM, Eric Bollengier <eric.bollengier AT baculasystems DOT com> wrote:

First, thank you for the kind replies, this is helping me to ensure I see the big picture.

Le vendredi 28 mai 2010 16:42:01, Robert LeBlanc a écrit :
> On Fri, May 28, 2010 at 12:32 AM, Eric Bollengier <
>
> eric.bollengier AT baculasystems DOT com> wrote:
> > Hello Robert,
> > What would be the result if you do Incremental backup instead of full
> > backup ?
> > Imagine that you have 1% changes by day, it will give something like
> > total_size = 30GB + 30GB*0.01 * nb_days
> > (instead of 30GB * nb_days)
> >
> > I'm quite sure it will give a "compression" like 19:1 for 20 backups...
> >
> > This kind of comparison is the big argument of dedup companies, "do 20
> > full backup, and you will have 20:1 dedup ratio", but do 19 incremental
> > + 1 full and this ratio will fall down to 1:1... (It's not exactly true
> > neither because
> > you can save space with multiple systems having same data)
>
> The idea was to in some ways simulate a few things all at once. This kind
> of test could show how multiple similar OSes could dedupe (20 Windows OS
> for example, you only have to store those bits once for any number of
> Windows machines), using Bacula's incrementals, you have to store the bits
> once per machine

In this particular case, you can use the BaseJob file level deduplication that
allows you to store only one version of each OS. (But I admit that if the
system can do it automatically, it's better)

I agree, I haven't looked into BaseJobs yet because they are not the easiest thing to understand. Since I've very pressed for time, I don't have a lot of time to commit to reading. I plan on understanding it, but when a system can do it automatically and transparently, I like that a lot.
 
> and then again when you do your next full each week or month.

Why do you want to schedule Full backup every weeks? With Accurate option, you
can adopt the Incremental forever (Differential can limit the number of
incremental to use for restore)

If it's to have multiple copies of a particular file (what I like to advise
when using tapes), since the deduplication will turn multiple copies to a
single instance, I think that it's very similar.

We are using accurate jobs on a few machines, however, I have not scheduled the roll-ups yet as I haven't had time to read the manual enough. I need to do it soon as I have months of incrementals without any fulls in between. I do like having multiple copies of my files on tape, on disk not so much. The reason is I've had tapes go bad, with disk, I have a lot of redundancy built in.

> It also was to show how much you could save when doing your fulls
> each week or month, a similar effect would happen for the differentials
> too. It wasn't meant to be all inclusive, but just to show some trends
> that I was interested in.

Yes, but comparing 20 full backup with 20 full copies with deduplication is
like comparing apples and oranges... At least, it should appear somewhere that
you choose the worst case for bacula and the best case for deduplication :-)

Please remember that the bacula tape files were on a lessfs file system, so the same amount of data was written using rsync and bacula, just different formats on lessfs. So best case senario is that they should have had the same dedupe rate. The idea was to see how both formats faired on lessfs.
 
> In our environment, since everything is virtual,
> we don't save the OS data, and only try to save the minimum that we need,
> that doesn't work for everyone though.

Yes, this is an other very common way to do, and I agree that sometime you
can't do that.

It's also very practical to just rsync the whole disk and let LessFS do it's
job. If you want to browse the backup, it's just a directory. With Bacula, as
incremental/full/differential are presented in a virtual tree, it's not
needed.

Understandable, in a disaster recovery instance with Bacula, if the on disk format was a tree, you could browse to the lastest backup of your catalog and import it and off you go. Right now, I have no clue which of the 100 tapes I have has the latest catalog backup, I would have to scan them all, and if the backup spans tapes, I have to figure out what order to scan the tapes to recover the back-up, that could take forever. Now, that I've thought about it, I think it's time for a new pool for catalog backups, sigh.

> I think in some ways, each dedupe file system can work very well with each
> file as it's own instead of being in a stream. That way the start of the
> file is always on a boundary that the deduplication file system uses. I
> think you might be able to use sparse files for a stream and always sparse
> up the block alignment,

I'm not very familiar with sparse file, but I'm pretty sure that the "sparse
unit" is a block. So, if a block is empty ok, but if you have some bytes used
inside this block, it will take 4KB.

I'm not an expert with sparse files, so I'm not sure what the limitations are. My experience is with VM where a sparse file is created. The file has all the space allocated to it, but does not actually take space on the fs. How much "fast-forwarding" you can do in a sparse file, I'm not sure, but quite a bit as evidenced by it's use with VMs. I'm thinking of the Bacula sparse file format to be like a VM sparse disk. I guess you could put an FS in the sparse file and that should handle aligment to a point, but is seems like a lot of overhead to just encapsulate the data.

> that would make the stream file look really large
> compared to what it actually uses on a non deduped file system. I still
> think if Bacula lays the data down in the same file structure as on the
> client organized by jobID with some small bacula files to hold permissions,
> etc I think it would be the most flexible for all dedupe file systems
> because it would be individual files like they are expecting.

Yes, this was a way to do, but we still have the problem for alignment and
free space in blocks. If I'm remember well, LessFS uses LZO to compress data,
so we can imagine that a 4KB block with only 200 bytes should be very small at
the end. This could be a very interesting test, just write X blocks with 200
bytes (random), and see if it takes X*4KB or ~ X*compress(200bytes).

It will allows also to store metadata in special blocs. So the basic
modification will be to start all new file data stream in a new block :)

I don't know the details, but maybe a lessfs guy could clarify this some.
 
> > The compression on disk is better, on the network layer and the remote IO
> > disk
> > system, this is an other story. BackupPC is smarter on this part (but
> > have problems with big set of files).
>
> I'm not sure I understand exactly what you mean. I understand that BacupPC
> can cause a file system to not mount because it exhausts the number of hard
> links the fs can support.

Yes, this is true (at least on ext3). What I'm saying is that rsync to a new
directory, you will have to read the entire disk (30GB in your case), and
transmit it over the network. With an incremental, you just read and transfer
modified data. (1 to 10% of the 30GB)

I'm not sure for backuppc, but it can certainly avoid to transfer file if they
have not changed.

Yes, in real world, I would not rsync into a new directory, it was just to have a similarity with the full backup that Bacula was doing and see how well both methods would dedup compared to each other. 

> Luckly, with deduplication file system, you don't
> have this problem, because you just copy the bits and the fs does the work
> of finding the duplicates. A dedupe fs can even only store a small part of
> a file (if most of the file is duplicate and only a small part is unique)
> where BackupPC would have to write that whole file.

Yes, for sure. Did you have an idea of which kind of file have only few bytes
that change over the time ? (Database file, C/C++ files, ...). For example,
big openoffice file are compressed, and data can change almost everywhere.

Mostly database type files (logs especially), system type log files, uncompressed tiff files (we have a lot of those), large DNA sequences, etc. Most files will change a significant portion of the file when modified.

> One of the reasons I mentioned if it could be implemented. If there is
> anything I know about OSS, is that there are some amazing people with an
> ability to think so outside the box that these things have not been able to
> stop the progress of OSS.

One thing can stop progress of OSS, it's Software Patents... By chance, most
of the Bacula code is written in Europe and the copyright is owned by FSF
Europe where Software Patents are not valid, but who knows what software lobby
can do...

I have recently been torn with a company who has some good innovations and hardware, but who's political agenda is to send a take-down for any little patent that is infringed (and not to everyone, but is targeting certain companies). I like the hardware, but I don't like their stance on patents when they have stolen their fair share in the past. It's a double standard that really bugs me.
 
Robert LeBlanc
Life Sciences & Undergraduate Education Computer Support
Brigham Young University


------------------------------------------------------------------------------

_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users