BackupPC-users

Re: [BackupPC-users] Why does backuppc transfer files already in the pool

2010-08-29 11:20:06
Subject: Re: [BackupPC-users] Why does backuppc transfer files already in the pool
From: martin f krafft <madduck AT madduck DOT net>
To: "General list for user discussion, questions and support" <backuppc-users AT lists.sourceforge DOT net>
Date: Sun, 29 Aug 2010 17:16:43 +0200
[Replying to multiple messages]

also sprach Jeffrey J. Kosowsky <backuppc AT kosowsky DOT org> [2010.08.29.0358 
+0200]:
> You are ignoring pool collisions which are very real and not
> altogether infrequent. For example a constant length log or database
> file could easily have the same 1st and 8th 128k block but still be
> different.

Good example. So let's forget about this case and concentrate on my
optimisation suggestion:

>  > 2. Assuming that the two 128k block checksums and the file size are
>  >    not collision-free (they probably aren't), backuppc should really
>  >    uncompress the pool file and employ rsync's rolling checksum to
>  >    update the file (in memory). If there were any changes, then it
>  >    should write out the NewFile to disk; in the absence of changes,
>  >    it should create the hardlink.
> 
> While I don't understand all the details of rsync checksums, you seem
> to be missing the fact that when using rsync on the cpool, the actual
> block and full-file rsync checksums are appended to the end of the
> cpool file. Therefore, it is not necessary to always uncompress the
> file but rather it is sufficient just to read out the stored checksums
> (though with checksum caching you can choose to have a predetermined
> fraction of the files checked each time). Note I may not be describing
> this totally accurately but hopefully you get the point.

Yeah, I get the point, but the bottleneck in my case is not the
uncompression, but the fact that the peer must send the entire file
over a slow link, even though it's already present remotely.



also sprach Les Mikesell <lesmikesell AT gmail DOT com> [2010.08.29.0102 +0200]:
> On 8/28/10 3:22 PM, martin f krafft wrote:
> > also sprach Les Mikesell<lesmikesell AT gmail DOT com>  [2010.08.28.2151 
> > +0200]:
> >> If it is one or a few files or constrained to a directory that
> >> you know you already have backed up locally, why not just
> >> exclude it on the remote machines?
> >
> > It happens regularly.
> 
> But if it is under your control, you might arrange it to be under
> an excluded directory.

I am dealing with u.s.e.r.s. That stands for: "unpredictable
sometimes emotionally regressive species". So no. ;)

> > Don't you think BackupPC could be optimised *iff* rsyncp could
> > ask the peer mid-transfer to calculate the whole file checksum
> > (it could just ask that anyway, but that would increase the
> > client load)?
> 
> You are working with a stock rsync on the other end, so I don't
> think that's an option - and rsync's checksums aren't the same as
> the hash used to build the pool filenames.

Okay, let's assume the for a moment that we cannot ask the peer to
calculate the hashsum mid-transfer. What else could we do?

To recap: I would like to avoid having to transfer an entire file if
chances are high that it's already in the pool.

What I think BackupPC is doing right now is:

1. It starts receiving a file

2. After a certain time, it has enough information to take a guess
   at the corresponding pool file, and opens it.

3. What seems to happen now is weird: the FileIO method fileDeltaRxNext is
   called repeatedly, but at the same time, the client keeps sending data, not
   checksums.

See the following demonstration: I created a backup host with just a TESTFILE,
1.5Mb of hex "aabbccddeeff…" and backed it up. I then copied that
file to NEWFILE and ran another full backup.

  391f184ac1937f245a19652816d10d0e  NEWFILE
  391f184ac1937f245a19652816d10d0e  TESTFILE

The following is the strace output of the second run, grepped like
this:

  egrep 'log (Receiving: |tmp/backuppc-test/NEWFILE)|cpool'

and interspersed with my comments:

    3932  write(8, "log tmp/backuppc-test/NEWFILE: s"..., 65 <unfinished ...>
    3928  <... read resumed> "log tmp/backuppc-test/NEWFILE: s"..., 65536) = 65

# The first bytes are arriving, I don't know what 0xfc0f0007 is, but
# it appears all over the place.

    3932  write(8, "log Receiving: fc0f0007aabbccdde"..., 4096) = 4096
    3928  <... read resumed> "log Receiving: fc0f0007aabbccdde"..., 65536) = 
8192
    3932  write(8, "log tmp/backuppc-test/NEWFILE: b"..., 79) = 79
    3932  write(8, "log tmp/backuppc-test/NEWFILE: o"..., 126 <unfinished ...>
    3932  write(8, "log tmp/backuppc-test/NEWFILE: w"..., 60 <unfinished ...>
    3932  write(8, "log Receiving: fc0f0007ccddeeffa"..., 4096 <unfinished ...>
    3928  <... read resumed> "log tmp/backuppc-test/NEWFILE: b"..., 65536) = 
8457
    3932  write(8, "log tmp/backuppc-test/NEWFILE: b"..., 79) = 79
    3932  write(8, "log tmp/backuppc-test/NEWFILE: w"..., 60) = 60
    3932  write(8, "log tmp/backuppc-test/NEWFILE: b"..., 79 <unfinished ...>
    3932  write(8, "log tmp/backuppc-test/NEWFILE: w"..., 60 <unfinished ...>
    3932  write(8, "log Receiving: fc0f0007aabbccdde"..., 4096 <unfinished ...>
    3928  <... read resumed> "log tmp/backuppc-test/NEWFILE: b"..., 65536) = 
4374
    3932  write(8, "log tmp/backuppc-test/NEWFILE: b"..., 79) = 79
    3932  write(8, "log tmp/backuppc-test/NEWFILE: w"..., 60) = 60
    3932  write(8, "log tmp/backuppc-test/NEWFILE: b"..., 79) = 79
    3932  write(8, "log tmp/backuppc-test/NEWFILE: w"..., 60) = 60
    3932  write(8, "log Receiving: fc0f0007eeffaabbc"..., 4096 <unfinished ...>
    3928  <... read resumed> "log tmp/backuppc-test/NEWFILE: b"..., 65536) = 
12427

    […]

# A few dozen, equivalent lines later:

    3932  write(8, "log Receiving: fc0f0007aabbccdde"..., 4096) = 4096
    3932  write(8, "log tmp/backuppc-test/NEWFILE: b"..., 79 <unfinished ...>
    3932  write(8, "log tmp/backuppc-test/NEWFILE: w"..., 60) = 60
    3928  <... read resumed> "log tmp/backuppc-test/NEWFILE: b"..., 65536) = 
65536
    3932  write(8, "log tmp/backuppc-test/NEWFILE: b"..., 79) = 79
    3932  write(8, "log tmp/backuppc-test/NEWFILE: w"..., 60) = 60

# Oh look, we found a possibly match in the pool:

    3932  
stat("/var/lib/backuppc/cpool/1/b/5/1b56172076f0f811087ed07b4c7dda9b", 
{st_mode=S_IFREG|0600, st_size=29415, ...}) = 0
    3932  
open("/var/lib/backuppc/cpool/1/b/5/1b56172076f0f811087ed07b4c7dda9b", 
O_RDONLY) = 6
    3932  
stat("/var/lib/backuppc/cpool/1/b/5/1b56172076f0f811087ed07b4c7dda9b_0",  
<unfinished ...>

# But we keep receiving data (note: "eeffaabbccddeeffaabbccddeeff"),
# not checksums:

    3928  read(6, "log tmp/backuppc-test/NEWFILE: b"..., 65536) = 139
    3932  write(8, "log Receiving: fc0f0007eeffaabbc"..., 4096) = 4096
    3928  <... read resumed> "log Receiving: fc0f0007eeffaabbc"..., 65536) = 
8192
    3932  write(8, "log tmp/backuppc-test/NEWFILE: b"..., 79) = 79
    3932  write(8, "log tmp/backuppc-test/NEWFILE: w"..., 60) = 60
    3932  write(8, "log tmp/backuppc-test/NEWFILE: b"..., 79) = 79
    3932  write(8, "log tmp/backuppc-test/NEWFILE: w"..., 60) = 60
    3932  write(8, "log Receiving: fc0f0007ccddeeffa"..., 4096) = 4096
    3928  <... read resumed> "log tmp/backuppc-test/NEWFILE: b"..., 65536) = 
37142
    3932  write(8, "log tmp/backuppc-test/NEWFILE: b"..., 79) = 79
    3932  write(8, "log tmp/backuppc-test/NEWFILE: w"..., 60) = 60
    3932  write(8, "log tmp/backuppc-test/NEWFILE: b"..., 79) = 79
    3932  write(8, "log tmp/backuppc-test/NEWFILE: w"..., 60) = 60

# More data: "eeffaabbc"

    3932  write(8, "log Receiving: fc0f0007aabbccdde"..., 4096) = 4096
    3928  <... read resumed> "log tmp/backuppc-test/NEWFILE: b"..., 65536) = 
65536
    3932  write(8, "log tmp/backuppc-test/NEWFILE: b"..., 79) = 79
    3932  write(8, "log tmp/backuppc-test/NEWFILE: w"..., 60 <unfinished ...>
    3932  write(8, "log tmp/backuppc-test/NEWFILE: b"..., 79) = 79
    3932  write(8, "log tmp/backuppc-test/NEWFILE: w"..., 60) = 60

    […]

# Hundreds of lines later, even more data:

    3932  write(8, "log Receiving: fc0f0007aabbccdde"..., 4096) = 4096
    3932  write(8, "log tmp/backuppc-test/NEWFILE: b"..., 79) = 79
    3932  write(8, "log tmp/backuppc-test/NEWFILE: w"..., 60) = 60
    3932  write(8, "log tmp/backuppc-test/NEWFILE: b"..., 79) = 79
    3932  write(8, "log tmp/backuppc-test/NEWFILE: w"..., 60) = 60
    3932  write(8, "log Receiving: fc0f0007eeffaabbc"..., 4096) = 4096
    3932  write(8, "log tmp/backuppc-test/NEWFILE: b"..., 79) = 79
    3932  write(8, "log tmp/backuppc-test/NEWFILE: w"..., 60) = 60
    3932  write(8, "log tmp/backuppc-test/NEWFILE: b"..., 79) = 79
    3932  write(8, "log tmp/backuppc-test/NEWFILE: w"..., 60) = 60
    3932  write(8, "log Receiving: fc0f0007ccddeeffa"..., 4096) = 4096
    3932  write(8, "log tmp/backuppc-test/NEWFILE: b"..., 79) = 79
    3932  write(8, "log tmp/backuppc-test/NEWFILE: w"..., 60) = 60
    3932  write(8, "log tmp/backuppc-test/NEWFILE: b"..., 79) = 79
    3932  write(8, "log tmp/backuppc-test/NEWFILE: w"..., 60) = 60
    3932  write(8, "log tmp/backuppc-test/NEWFILE: b"..., 79) = 79
    3932  write(8, "log tmp/backuppc-test/NEWFILE: w"..., 60) = 60
    3932  write(8, "log tmp/backuppc-test/NEWFILE: b"..., 75) = 75

# And now, 2Mb have been transferred, so we can finally discard all
# the received data and hardlink instead.

    3928  read(6, "log Receiving: fc0f0007ccddeeffa"..., 65536) = 65536
    3932  write(8, "log tmp/backuppc-test/NEWFILE go"..., 111) = 111
    3932  
link("/var/lib/backuppc/cpool/1/b/5/1b56172076f0f811087ed07b4c7dda9b", 
"/srv/backuppc/pc/charade.madduck.net/new/f%2f/ftmp/fbackuppc-test/fNEWFILE") = 0




Do you see what I mean?



also sprach Jeffrey J. Kosowsky <backuppc AT kosowsky DOT org> [2010.08.29.0404 
+0200]:
> But the rsync block and file md4 checksums (and yes it's md4 for
> the rsync <30 protocol required by perl-File-RsyncP) are appended
> to the end of each cpool file.

Here's what I think should happen instead:

As soon as at least one candidate file in the pool is found:

1. Keep the received data (2×128k) in memory;

2. Somehow convince the peer that we actually have the file and that
   it should continue sending block checksums, or however the rsync
   protocol actually works;

3. Compute our own checksums and keep going while they match what
   the peer sends;

4. If we reach the file's end, hardlink, and be DONE.

5. If we receive a checksum different from what our pool file has
   (either the checksum cache or by computing checksums over the
   uncompressed file), then it means that the peer's file is
   different from what we have in the pool. In this case, we can
   reconstitute the actual file to this point from the data saved in
   step (1.), the blocks from the pool file with matching checksums,
   and what we have just received.

6. Now we have to convince the peer to send real data again.


Does this make sense?

-- 
martin | http://madduck.net/ | http://two.sentenc.es/
 
"there are two major products that come out of berkeley: lsd and unix."
 one caused me an addiction
                                                             -- fyodor
 
spamtraps: madduck.bogus AT madduck DOT net

Attachment: digital_signature_gpg.asc
Description: Digital signature (see http://martin-krafft.net/gpg/)

------------------------------------------------------------------------------
Sell apps to millions through the Intel(R) Atom(Tm) Developer Program
Be part of this innovative community and reach millions of netbook users 
worldwide. Take advantage of special opportunities to increase revenue and 
speed time-to-market. Join now, and jumpstart your future.
http://p.sf.net/sfu/intel-atom-d2d
_______________________________________________
BackupPC-users mailing list
BackupPC-users AT lists.sourceforge DOT net
List:    https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:    http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/