Bacula-users

Re: [Bacula-users] Bacula tape format vs. rsync on deduplicated file systems

2010-05-28 02:36:58
Subject: Re: [Bacula-users] Bacula tape format vs. rsync on deduplicated file systems
From: Eric Bollengier <eric.bollengier AT baculasystems DOT com>
To: bacula-users AT lists.sourceforge DOT net
Date: Fri, 28 May 2010 08:32:38 +0200
Hello Robert,

On Thursday 27 May 2010 23:18:12 Robert LeBlanc wrote:
> Spurred by the discussion last month on the Bacula mailing list about
> needing a new archive format when storing Bacula data on disks, I decided
> to do a little test.
> 
> The test set-up:
> * One lightly used file system ~30GB of mostly unchanging data, a good mix
> of documents, executables, images, videos etc.
> * Snapshot the file system using LVM, then use rsync and Bacula to backup
> the data.
> * The deduplication file system of choice was lessfs on EXT4 since it is
> available to anyone.
> * Three different block sizes for lessfs (16K, 32K and 64K) to see how much
> difference there would be between each block size
> * Bacula archive size was set to 10G so that one backup would span multiple
> volumes, very common in our environment
> * Test the final results with our DataDomain box
> 
> I took six snapshots of the original file system over the course of about
> 2.5 weeks, I then rsynced a copy to a non deduplication file system, then
> rsynced it to each of the three less files systems (one for each block
> size) in a folder specified by the date. This would cause rsync to create
> a new copy of the data each time since each rsync was in it's own folder
> (named by the date of the rsync) instead of just syncing the changes after
> the first time. Bacula would do a full backup of the file system snapshot
> every time as well to a non deduplication file system and those were
> rsynced to three lessfs files systems without any folder structure so that
> only one copy of a volume would exist on each lessfs file system. At the
> conclusion of the test, I decided to dump the final raw rsync and bacula
> data onto our DataDomain box as a comparison.

What would be the result if you do Incremental backup instead of full backup ? 
Imagine that you have 1% changes by day, it will give something like
total_size = 30GB + 30GB*0.01 * nb_days
(instead of 30GB * nb_days)

I'm quite sure it will give a "compression" like 19:1 for 20 backups...

This kind of comparison is the big argument of dedup companies, "do 20 full 
backup, and you will have 20:1 dedup ratio", but do 19 incremental + 1 full 
and this ratio will fall down to 1:1... (It's not exactly true neither because
you can save space with multiple systems having same data)

> [image: backup.png]
> 
> This chart shows that using the sync method, the data's compression grew in
> almost a linear fashion, while the Bacula data stayed close to 1x
> compression. My suspicion is that since the Bacula tape format inserts job
> information regularly into the stream file and lessfs uses a fixed block
> size, lessfs is not able to find much unique data in the Bacula
> stream.

You are right, we have a current project to add a new device format that will 
be able to be compatible with dedup layer. I don't know yet how it will work 
because I can imagine that each dedup system works differently, and finding a 
common denominator won't be easy. A first proof of concept will certainly use 
LessFS (It is already in my radar scope). But as you said, depending on block 
size, alignment, etc... it's not so easy.

> Although Data Domain's variable block size feature allows it much
> better compression of Bacula data, rsync still achieved an almost 2x
> greater compression over Bacula.

The compression on disk is better, on the network layer and the remote IO disk 
system, this is an other story. BackupPC is smarter on this part (but have 
problems with big set of files).

> In conclusion, lessfs is a great file system and can benefit from variable
> block sizes, if it can be added, for both regular data and Bacula data.
> Bacula could also greatly benefit by providing a format similar to a native
> file system on lessfs and even a good benefit on DataDomain.

Yes, variable block size and dynamic alignment seems the edge of the 
technology, but it's also heavily covered by patents (and those companies are 
not very friendly). And I can imagine that it's easy to ask for them, and it's 
a little more complex to implement :-)

Bye

> Robert LeBlanc
> Life Sciences & Undergraduate Education Computer Support
> Brigham Young University

------------------------------------------------------------------------------

_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users