Bacula-users

Re: [Bacula-users] Bacula tape format vs. rsync on deduplicated file systems

2010-05-28 10:44:38
Subject: Re: [Bacula-users] Bacula tape format vs. rsync on deduplicated file systems
From: Robert LeBlanc <robert AT leblancnet DOT us>
To: Eric Bollengier <eric.bollengier AT baculasystems DOT com>
Date: Fri, 28 May 2010 08:42:01 -0600
On Fri, May 28, 2010 at 12:32 AM, Eric Bollengier <eric.bollengier AT baculasystems DOT com> wrote:
Hello Robert,
What would be the result if you do Incremental backup instead of full backup ?
Imagine that you have 1% changes by day, it will give something like
total_size = 30GB + 30GB*0.01 * nb_days
(instead of 30GB * nb_days)

I'm quite sure it will give a "compression" like 19:1 for 20 backups...

This kind of comparison is the big argument of dedup companies, "do 20 full
backup, and you will have 20:1 dedup ratio", but do 19 incremental + 1 full
and this ratio will fall down to 1:1... (It's not exactly true neither because
you can save space with multiple systems having same data)

The idea was to in some ways simulate a few things all at once. This kind of test could show how multiple similar OSes could dedupe (20 Windows OS for example, you only have to store those bits once for any number of Windows machines), using Bacula's�incrementals, you have to store the bits once per machine and then again when you do your next full each week or month. It also was to show how much you could save when doing your fulls each week or month, a similar effect would happen for the�differentials�too. It wasn't meant to be all inclusive, but just to show some trends that I was interested in. In our�environment, since everything is virtual, we don't save the OS data, and only try to save the minimum that we need, that doesn't work for everyone though.


> [image: backup.png]
>
> This chart shows that using the sync method, the data's compression grew in
> almost a linear fashion, while the Bacula data stayed close to 1x
> compression. My suspicion is that since the Bacula tape format inserts job
> information regularly into the stream file and lessfs uses a fixed block
> size, lessfs is not able to find much unique data in the Bacula
> stream.

You are right, we have a current project to add a new device format that will
be able to be compatible with dedup layer. I don't know yet how it will work
because I can imagine that each dedup system works differently, and finding a
common denominator won't be easy. A first proof of concept will certainly use
LessFS (It is already in my radar scope). But as you said, depending on block
size, alignment, etc... it's not so easy.

I think in some ways, each dedupe file system can work very well with each file as it's own instead of being in a stream. That way the start of the file is always on a�boundary�that the deduplication file system uses. I think you might be able to use sparse files for a stream and always sparse up the block alignment, that would make the stream file look really large compared to what it actually uses on a non deduped file system. I still think if Bacula lays the data down in the same file structure as on the client organized by jobID with some small bacula files to hold permissions, etc I think it would be the most flexible for all dedupe file systems because it would be individual files like they are expecting.
> Although Data Domain's variable block size feature allows it much
> better compression of Bacula data, rsync still achieved an almost 2x
> greater compression over Bacula.

The compression on disk is better, on the network layer and the remote IO disk
system, this is an other story. BackupPC is smarter on this part (but have
problems with big set of files).

I'm not sure I understand exactly what you mean. I understand that BacupPC can cause a file system to not mount because it exhausts the number of hard links the fs can support. Luckly, with deduplication file system, you don't have this problem, because you just copy the bits and the fs does the work of finding the duplicates. A dedupe fs can even only store a small part of a file (if most of the file is duplicate and only a small part is unique) where BackupPC would have to write that whole file. I don't want Bacula to adopt what BackupPC is doing, I think it's a step backwards.
> In conclusion, lessfs is a great file system and can benefit from variable
> block sizes, if it can be added, for both regular data and Bacula data.
> Bacula could also greatly benefit by providing a format similar to a native
> file system on lessfs and even a good benefit on DataDomain.

Yes, variable block size and dynamic alignment seems the edge of the
technology, but it's also heavily covered by patents (and those companies are
not very friendly). And I can imagine that it's easy to ask for them, and it's
a little more complex to implement :-)�

One of the reasons I mentioned if it could be implemented. If there is anything I know about OSS, is that there are some amazing people with an ability to think so outside the box that these things have not been able to stop the progress of OSS.

Robert LeBlanc
Life Sciences & Undergraduate Education Computer Support
Brigham Young University
------------------------------------------------------------------------------

_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users