BackupPC-users

Re: [BackupPC-users] Renaming files causes retransfer?

2011-04-20 14:02:31
Subject: Re: [BackupPC-users] Renaming files causes retransfer?
From: Holger Parplies <wbppc AT parplies DOT de>
To: backuppc users list <backuppc-users AT lists.sourceforge DOT net>
Date: Wed, 20 Apr 2011 20:01:28 +0200
Hi,

martin f krafft wrote on 2011-04-17 16:43:07 +0200 [Re: [BackupPC-users] 
Renaming files causes retransfer?]:
> also sprach John Rouillard <rouilj-backuppc AT renesys DOT com> 
> [2011.04.17.1625 +0200]:
> > > In terms of backuppc, this means that the files will have to be
> > > transferred again, completely, right?
> > 
> > Correct.
> 
> Actually, I just did a test, using iptables to count bytes between
> the two hosts, and then renamed a 33M file. backuppc, using rsync,
> only transferred 370k. Hence I think that it actually does *not*
> transfer the whole file.

it always feels strange to contradict reality, but, in theory, there is no way
to get around transferring the file.

For the rsync algorithm to work, you need a local reference copy of the
file you want to transfer. While you and I know that there *is* a local copy,
BackupPC would need to know (a) that there is and (b) where to find it. The
only available information at the point in time where this decision needs to
be made is the (new) file name. For this, there is no candidate in the
reference backup (or any other backup, for that matter). So the file needs to
be transferred in full.

We'd all like to be able to choose an existing *pool file* as reference - this
would save us transfers of *any* file already existing in the pool (e.g. from
other hosts). Unfortunately, this is technically not possible without a
specialized BackupPC client.

> (btw, I also think that what I wrote in
> http://comments.gmane.org/gmane.comp.sysutils.backup.backuppc.general/24352
> is wrong, but I shall follow up on this when I have verified my
> findings).

Is that a backuppc-users thread I somehow missed? I see where your question
is going now, so I'll go into a bit more detail (not sure if any of this was
already mentioned in that thread).

1.) BackupPC uses already existing transfer methods for the sake of not
    needing to install anything non-mainstream on the clients. In your case,
    that is probably ssh + rsync.
    Consequentially, BackupPC is limited to what the rsync protocol will
    allow, which does *not* include, "hey, send me the 1st and 8th 128kB
    chunk of the file before I'll tell you the checksum I have on my side".
    Such a request just doesn't make any sense for standalone rsync. We need
    to select a candidate before we can start transferring blocks that don't
    match (and skip blocks that do). It's really quite obvious, if you think
    about it, and it only gets more complicated (but doesn't change) if you go
    into the details of which rsync end plays which role in the file delta
    exchange.

    The same is basically true for tar and smb, respectively. The remote end
    decides what data to transfer (which is whole file or nothing), and you
    can take it or ignore it, but you can't prevent it from being transferred.

2.) BackupPC reads the first 1MB into memory. It needs to do so to determine
    the pool file name. That should not be a problem memory-wise.

3.) BackupPC cannot, obviously, read any arbitrary size file into memory. It
    also wants to avoid unnecessary (possibly extremely large) writes to the
    pool FS. So it does this:
    - Determine pool file candidates (possibly several, in case of pool
      collisions).
    - Read pool file candidates in parallel with the network transfer.
    - As soon as something doesn't match, discard the respective candidate.
    - If that was the last available candidate, copy everything so far (which
      *did* match) from that candidate to a new file.
      We need to get this content from somewhere, and the network stream is,
      obviously, not seekable, so we can't re-get it from there (but then, we
      don't need to and wouldn't want to, because, hopefully, our local disk
      is faster ;-).
    - If the whole candidate file matched our complete network stream, we
      have a pool match and only need to link to that.

4.) There *was* an attempt to write a specialized BackupPC client (BackupPCd)
    quite a while back. I believe this was given up for lack of human
    resources. I always found this matter rather interesting, but I've never
    gotten around to even taking a look at the code, let alone do anything
    with it.

I hope that clears things up a bit.

Regards,
Holger

------------------------------------------------------------------------------
Benefiting from Server Virtualization: Beyond Initial Workload 
Consolidation -- Increasing the use of server virtualization is a top
priority.Virtualization can reduce costs, simplify management, and improve 
application availability and disaster protection. Learn more about boosting 
the value of server virtualization. http://p.sf.net/sfu/vmware-sfdev2dev
_______________________________________________
BackupPC-users mailing list
BackupPC-users AT lists.sourceforge DOT net
List:    https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:    http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/