Bacula-users

Re: [Bacula-users] Bacula tape format vs. rsync on deduplicated file systems

2010-05-28 12:50:01
Subject: Re: [Bacula-users] Bacula tape format vs. rsync on deduplicated file systems
From: Eric Bollengier <eric.bollengier AT baculasystems DOT com>
To: Robert LeBlanc <robert AT leblancnet DOT us>
Date: Fri, 28 May 2010 18:48:46 +0200
Le vendredi 28 mai 2010 16:42:01, Robert LeBlanc a écrit :
> On Fri, May 28, 2010 at 12:32 AM, Eric Bollengier <
> 
> eric.bollengier AT baculasystems DOT com> wrote:
> > Hello Robert,
> > What would be the result if you do Incremental backup instead of full
> > backup ?
> > Imagine that you have 1% changes by day, it will give something like
> > total_size = 30GB + 30GB*0.01 * nb_days
> > (instead of 30GB * nb_days)
> > 
> > I'm quite sure it will give a "compression" like 19:1 for 20 backups...
> > 
> > This kind of comparison is the big argument of dedup companies, "do 20
> > full backup, and you will have 20:1 dedup ratio", but do 19 incremental
> > + 1 full and this ratio will fall down to 1:1... (It's not exactly true
> > neither because
> > you can save space with multiple systems having same data)
> 
> The idea was to in some ways simulate a few things all at once. This kind
> of test could show how multiple similar OSes could dedupe (20 Windows OS
> for example, you only have to store those bits once for any number of
> Windows machines), using Bacula's incrementals, you have to store the bits
> once per machine

In this particular case, you can use the BaseJob file level deduplication that 
allows you to store only one version of each OS. (But I admit that if the 
system can do it automatically, it's better)


> and then again when you do your next full each week or month.

Why do you want to schedule Full backup every weeks? With Accurate option, you 
can adopt the Incremental forever (Differential can limit the number of 
incremental to use for restore)

If it's to have multiple copies of a particular file (what I like to advise 
when using tapes), since the deduplication will turn multiple copies to a 
single instance, I think that it's very similar.

> It also was to show how much you could save when doing your fulls
> each week or month, a similar effect would happen for the differentials
> too. It wasn't meant to be all inclusive, but just to show some trends
> that I was interested in.

Yes, but comparing 20 full backup with 20 full copies with deduplication is 
like comparing apples and oranges... At least, it should appear somewhere that 
you choose the worst case for bacula and the best case for deduplication :-)

> In our environment, since everything is virtual,
> we don't save the OS data, and only try to save the minimum that we need,
> that doesn't work for everyone though.

Yes, this is an other very common way to do, and I agree that sometime you 
can't do that.

It's also very practical to just rsync the whole disk and let LessFS do it's 
job. If you want to browse the backup, it's just a directory. With Bacula, as
incremental/full/differential are presented in a virtual tree, it's not 
needed.

> 
> > > [image: backup.png]
> > > 
> > > This chart shows that using the sync method, the data's compression
> > > grew
> > 
> > in
> > 
> > > almost a linear fashion, while the Bacula data stayed close to 1x
> > > compression. My suspicion is that since the Bacula tape format inserts
> > 
> > job
> > 
> > > information regularly into the stream file and lessfs uses a fixed
> > > block size, lessfs is not able to find much unique data in the Bacula
> > > stream.
> > 
> > You are right, we have a current project to add a new device format that
> > will
> > be able to be compatible with dedup layer. I don't know yet how it will
> > work
> > because I can imagine that each dedup system works differently, and
> > finding a
> > common denominator won't be easy. A first proof of concept will certainly
> > use
> > LessFS (It is already in my radar scope). But as you said, depending on
> > block
> > size, alignment, etc... it's not so easy.
> 
> I think in some ways, each dedupe file system can work very well with each
> file as it's own instead of being in a stream. That way the start of the
> file is always on a boundary that the deduplication file system uses. I
> think you might be able to use sparse files for a stream and always sparse
> up the block alignment,

I'm not very familiar with sparse file, but I'm pretty sure that the "sparse 
unit" is a block. So, if a block is empty ok, but if you have some bytes used 
inside this block, it will take 4KB.

> that would make the stream file look really large
> compared to what it actually uses on a non deduped file system. I still
> think if Bacula lays the data down in the same file structure as on the
> client organized by jobID with some small bacula files to hold permissions,
> etc I think it would be the most flexible for all dedupe file systems
> because it would be individual files like they are expecting.

Yes, this was a way to do, but we still have the problem for alignment and 
free space in blocks. If I'm remember well, LessFS uses LZO to compress data, 
so we can imagine that a 4KB block with only 200 bytes should be very small at 
the end. This could be a very interesting test, just write X blocks with 200 
bytes (random), and see if it takes X*4KB or ~ X*compress(200bytes).

It will allows also to store metadata in special blocs. So the basic 
modification will be to start all new file data stream in a new block :)


> > > Although Data Domain's variable block size feature allows it much
> > > better compression of Bacula data, rsync still achieved an almost 2x
> > > greater compression over Bacula.
> > 
> > The compression on disk is better, on the network layer and the remote IO
> > disk
> > system, this is an other story. BackupPC is smarter on this part (but
> > have problems with big set of files).
> 
> I'm not sure I understand exactly what you mean. I understand that BacupPC
> can cause a file system to not mount because it exhausts the number of hard
> links the fs can support.

Yes, this is true (at least on ext3). What I'm saying is that rsync to a new 
directory, you will have to read the entire disk (30GB in your case), and 
transmit it over the network. With an incremental, you just read and transfer 
modified data. (1 to 10% of the 30GB)

I'm not sure for backuppc, but it can certainly avoid to transfer file if they 
have not changed.

> Luckly, with deduplication file system, you don't
> have this problem, because you just copy the bits and the fs does the work
> of finding the duplicates. A dedupe fs can even only store a small part of
> a file (if most of the file is duplicate and only a small part is unique)
> where BackupPC would have to write that whole file.

Yes, for sure. Did you have an idea of which kind of file have only few bytes 
that change over the time ? (Database file, C/C++ files, ...). For example, 
big openoffice file are compressed, and data can change almost everywhere.

> I don't want Bacula to
> adopt what BackupPC is doing, I think it's a step backwards.
> 
> > > In conclusion, lessfs is a great file system and can benefit from
> > 
> > variable
> > 
> > > block sizes, if it can be added, for both regular data and Bacula data.
> > > Bacula could also greatly benefit by providing a format similar to a
> > 
> > native
> > 
> > > file system on lessfs and even a good benefit on DataDomain.
> > 
> > Yes, variable block size and dynamic alignment seems the edge of the
> > technology, but it's also heavily covered by patents (and those companies
> > are
> > not very friendly). And I can imagine that it's easy to ask for them, and
> > it's
> > a little more complex to implement :-)
> 
> One of the reasons I mentioned if it could be implemented. If there is
> anything I know about OSS, is that there are some amazing people with an
> ability to think so outside the box that these things have not been able to
> stop the progress of OSS.

One thing can stop progress of OSS, it's Software Patents... By chance, most 
of the Bacula code is written in Europe and the copyright is owned by FSF 
Europe where Software Patents are not valid, but who knows what software lobby 
can do...

Bye

> Robert LeBlanc
> Life Sciences & Undergraduate Education Computer Support
> Brigham Young University

-- 
Need professional help and support for Bacula ?
Visit http://www.baculasystems.com

------------------------------------------------------------------------------

_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users