Re: [BackupPC-users] Backing up a BackupPC server

Tino Schwarze wrote:
> 
>> The first thing needed would be to demonstrate that there would be an 
>> advantage to a database approach - like some benchmarks showing an 
>> improvement in throughput in the TB size range and measurements of the 
>> bandwidth needed for remote replication.
> 
> In my experience, BackupPC is mainly I/O bound. It produces a lot of
> seeks within the block device system (for directory and hash lookup).
> This might actually benefit from a relational database - you'd just do
> the appropiate SELECT, have some indices in place etc. Of course,
> there's still that "how to store and query the directory hierarchies
> efficiently" problem.

Yes, you are asking for magic that doesn't exist here.  A skilled DBA 
can work a little bit of magic by placing tables that need concurrent 
access on different physical drives, but not everyone will have either a 
large number of drives or a DBA available for the task.

> Maybe someone should propose a real design, then we may check how to map
> BackupPC's access patterns to the database structure. It might turn out
> to become really complex - I'm just wondering how to store files,
> directories, attributes, the pool, a particular backup number. We
> currently create the directory structure for each backup, so we may
> store the attrib file (to keep track of deleted files, at least). We'd
> have to do that for the database, too. There's no other solution, IMO.
> 
> I suppose, you could only benchmark something after implementing a
> sufficiently complex part of the problem to solve.

Or, benchmark some simple approximation handling the expected amount of 
data.  If it turns out to be impractically slow (as I suspect it 
will...) then you don't need to consider it any more.

> Another idea: Do we have performance metrics of BackupPC? It might be
> useful to check what operations take most of the time. Is it pool
> lookups? File decompression? Directory traversal for incrementals?

I think it is pretty well balanced most of the time.  But you have to 
consider the operation.  Worst case will probably be handling large 
files with small changes (like database dumps, mailboxes or growing 
logfiles) where rsync will end up transferring just the differences but 
the server will reconstruct the entire file copy, compress it, and make 
a new pool entry that is unlikely to be reused.

> If, for example, we figure out, that hash lookups and checksum reading
> of hash files etc. are expensive, a little database (actually a
> hashtable) might suffice, sort of a memcached which keeps track of pool
> files, their size and checksum. This might be doable (maybe disabled by
> default if it requires additional setup) and work like a cache.

I think the hashing scheme is already pretty efficient.

>> Personally I think the way to make things better would be to have a 
>> filesystem that does block-level de-duplication internally. Then most of 
>> what backuppc does won't even be necessary.   There were some 
>> indications that this would be added to ZFS at some point, but I don't 
>> know how the Oracle acquisition will affect those plans.
> 
> I don't think that belongs into the file system. In my opinion, a file
> system should be tuned for one purpose: Managing space and files. It
> should not care for file contents in any way, IMO.

 From the outside it wouldn't care about the contents - it just wouldn't 
use duplicate space to store duplicate contents.  Think of it as 
copy-on-write space just like memory works (except for actively looking 
for matches).  The same sort of content hashing scheme that backuppc 
uses to match files would be used at the block level.  You might not 
want this on every filesystem because of the overhead, but consider the 
advantage in the case of backups of growing logfiles.

>> Meanwhile, if someone has time to kill doing benchmark measurements, 
>> using ZFS with incremental send/receive to maintain a remote filesystem 
>> snapshot would be interesting.  Or perhaps making a vmware vmdk disk 
>> with many small (say 1 or 2 gig) elements and running backuppc in a 
>> virtual machine.  Then for replication, stop the virtual machine and 
>> rsync the directory containing the disk image files.  This might even be 
>> possible without stopping if you can figure out how vmware snapshots work.
> 
> You don't want heavy I/O in Vmware without direct SAN attached or
> similarly expensive setups.

You can afford to waste a little CPU these days - throw something fast 
at it.

> I'd rather propose a patch to rsync adding --threat-blockdev-as-files .
> This would require block-level checksum generation on _both_ sides,
> though, so it's rather I/O and CPU intensive. 

Also, rsync normally builds a new copy, so you need twice the space at 
the remote side - or if you let it rebuild in place you have a likely 
scenario where the site disaster you were trying to protect against 
happens mid-copy, leaving you with no working versions.  But disk space 
is cheap too - you could image-copy your archive to a local file, then 
rsync that to a remote file on a filesystem with enough space for both 
copies.

> Then, DRDB might be the
> way to go - they already take note of changed parts of the disk (but
> that's a guess).

Not sure how well this works over remote networks - might be worth a 
try, but again a live copy is likely to be corrupted along with the 
master unless you can cycle between two remote copies.

-- 
   Les Mikesell
    lesmikesell AT gmail DOT com


------------------------------------------------------------------------------
OpenSolaris 2009.06 is a cutting edge operating system for enterprises 
looking to deploy the next generation of Solaris that includes the latest 
innovations from Sun and the OpenSource community. Download a copy and 
enjoy capabilities such as Networking, Storage and Virtualization. 
Go to: http://p.sf.net/sfu/opensolaris-get
_______________________________________________
BackupPC-users mailing list
BackupPC-users AT lists.sourceforge DOT net
List:    https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:    http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/