BackupPC-users

Re: [BackupPC-users] improving the deduplication ratio

2008-04-14 14:33:12
Subject: Re: [BackupPC-users] improving the deduplication ratio
From: Tino Schwarze <backuppc.lists AT tisc DOT de>
To: backuppc-users AT lists.sourceforge DOT net
Date: Mon, 14 Apr 2008 20:20:36 +0200
On Mon, Apr 14, 2008 at 10:09:57AM +0200, Ludovic Drolez wrote:

> > How long are you willing to have your backups and restores take? If  
> > you do more processing on the backed up files, you'll take a greater  
> 
> Not true :
> - working with fixed size chunks may improve speed, because algorithms 
> could be optimized for 1 chunk size (md5, compression, etc)
> - if you implement block level deduplication to backup only the last
> 64kb of a log file, instead of the full 5 mb file, do you think it
> will take longer to write 64 kb than 5 mb ?
> 
> File + block level deduplication will improve both BackupPC's
> performance, and space savings.

Hm. Rsync has a --block-size option, so this should be doable. Of
course, you shouldn't underestimate the cost of managing a lot of small
files (my pool has about 5 million files, some of them are pretty
large), so the pool will have even more files which means more seeking
and looking up file blocks.

IIRC, currently the rsync-style backup works like this: The remove file
is rsync'ed against a version of that file from the same host. After
backup is done, BackupPC_link will look through the received files and
either link them into the pool if they're new or remove them and
hardlink them from the pool.

Introducing file chunking would introduce a new abstraction layer - a
file would need to be split into chunks and recreated for restore. You
can currently go to a host's backup directory, take a file and use it
directly if it is uncompressed. If it's compressed, you've got to use
BackupPC_zcat anyway. Whether a file in the pool is still used by some
backup is currently tracked by the file system via hardlink count.
Either we drop that hardlink scheme altogether (pool cleanup would
become very expensive) or we need to invent some way to hardlink the
31250 chunks of a 2 GB file into a directory in a sane way. And there
are files a lot larger than 2 GB around here - I've got some VMware
images in backup (which shouldn't be there, I know), and I'm not fond of
having another 500000 files on the file system just because the image is
split into 64k chunks.

But this is just my guessing - one of the developers needs to think this
through and layout the consequences.

Bye,

Tino.

-- 
„Es gibt keinen Weg zum Frieden. Der Frieden ist der Weg.” (Mahatma Gandhi)

www.craniosacralzentrum.de
www.forteego.de

-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
_______________________________________________
BackupPC-users mailing list
BackupPC-users AT lists.sourceforge DOT net
List:    https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:    http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/