BackupPC-users

Re: [BackupPC-users] RsyncP problem

2009-12-07 13:31:19
Subject: Re: [BackupPC-users] RsyncP problem
From: "Harald Amtmann" <[email protected]>
To: "General list for user discussion, questions and support" <backuppc-users AT lists.sourceforge DOT net>
Date: Mon, 07 Dec 2009 19:29:07 +0100
So, for anyone who cares (doesn't seem to be anyone on this list who noticed), 
I found this post from 2006 stating and analyzing my exact problem:

http://www.topology.org/linux/backuppc.html
On this site, search for "Design flaw: Avoidable re-transmission of massive 
amounts of data."


For future reference and archiving, I quote here in full:

"2006-6-7:
During the last week while using BackupPC in earnest, I have noticed a very 
serious design flaw which it totally avoidable by making a small change to the 
software. First I will describe the flaw with an example.

   1. First I back up the ryncd "module" home from computer client1 to computer 
server1 using the "rsyncd" method. This uses the following line in the server1 
config.pl file:

      $Conf{RsyncShareName} = ['home'];

   2. Then I do an incremental back-up of module "home" from client1 to 
server1. This back-up correctly sends only the changes in the file-system 
module "home" over the network. So the back-up is very quick.
   3. Now I modify the variable $Conf{RsyncShareName} on server1 as follows:

      $Conf{RsyncShareName} = ['home', 'home1'];

   4. Next, I make an incremental back-up. Naturally, the home module is sent 
very efficiently over the LAN and home1 is sent in full, essentially 
uncompressed. Well, this isn't quite natural. In fact, it's quite avoidable, 
but I'll explain why this is so later.
   5. Now I make a second incremental back-up of home and home1. Since I have 
already backed up these two modules, I expect them both to be very quick. But 
this does not happen. In fact, all of home1 is sent in full over the LAN, which 
in my case takes about 10 hours. This is a real nuisance. This problem occurs 
even if I have this in the config.pl file on server1:

      $Conf{IncrFill} = 1;

   6. Next, I make a full back-up. This sends only the changes to home over the 
LAN, but sends the full contents of home1, uncompressed, over the LAN, even 
though I have already sent this module in full twice.
   7. Now when I make future back-ups, the modules home and home1 are both sent 
efficiently and quickly. 

The design flaw here is crystal clear. Consider a single file home1/xyz.txt. 
The authors has designed the BackupPC system so that the file home1/xyz.txt is 
sent in full from client1 to server1 unless

   1. the file home1/xyz.txt is already on server1 with the identical path in 
the identical module home1, and
   2. the back-up in which home1/xyz.txt exists is a full back-up, not an 
incremental back-up. 

If the above conditions do not both hold, the full file is transmitted by 
rsyncd on client1; then it is discarded by server1 if it is already present on 
server1 in either the same path in an earlier back-up, or in any path at all in 
any other module in any kind of earlier back-up. So the software correctly 
discards duplicate files when they arrive on server1, but they are still 
transmitted anyway.

The cure for this design flaw is very easy indeed, and it would save me several 
days of saturated LAN bandwidth when I make back-ups. It's very sad that the 
authors did not design the software correctly. Here is how the software design 
flaw can be fixed.

   1. When an rsync file-system module module1 is to be transmitted from 
client1 to server1, first transmit the hash (e.g. MD5) of each file from 
client1 to server1. This can be done (a) on a file by file basis, (b) for all 
the files in module1 at the same time, or (c) in bundles of say, a few hundred 
or thousand hashes at a time.
   2. The BackupPC server server1 matches the received file hashes with the 
global hash table of all files on server1, both full back-up files and 
incremenetal back-up files.
   3. Then server1 requests rsyncd on client1 to only transmit the files which 
are not already present on server1. Notice that the files on server1 do not 
have to be in the same path in the same module on server1 in a full back-up, 
which is the case in the current BackupPC software design.
   4. Then client1 sends only the files which are requested, which are the 
files which are not already present on server1. 

The above design concept would make BackupPC much more efficient even under 
normal circumstances where the variable $Conf{RsyncShareName} is unchanging. At 
present, rsyncd will only refrain from sending a file if it is present in the 
same path in the same module in a previous full back-up. If server1 already has 
the same identical file in any other location, the file is sent by rsyncd and 
then discarded after it arrives.

If the above serious design flaw is not fixed, it will not do much harm to 
people whose files are rarely changing and rarely moving. But if, for example, 
you move a directory tree from once place to another, BackupPC will re-send the 
whole lot across the LAN, and then it will discard the files when they arrive 
on the BackupPC server. This will keep on happening until after you have made a 
full back-up of the files in the new location. 
"


-------- Original-Nachricht --------
> Datum: Thu, 22 Oct 2009 22:31:32 +0200
> Von: "Harald Amtmann" <[email protected]>
> An: backuppc-users AT lists.sourceforge DOT net
> Betreff: [BackupPC-users] RsyncP problem

> My problem is still that rsyncP with rsyncd as client still retransmits
> unchanged files. I reduced the testcase:
> 
> 1) Full Backup. All files are transmitted, This is the logoutput from the
> client:
> 
> 2009/10/22 21:35:44 [3820] connect from UNKNOWN (192.168.5.9)
> 2009/10/22 21:35:55 [3820] rsync on . from baggub@unknown (192.168.5.9)
> 2009/10/22 21:35:56 [3820] send unknown [192.168.5.9] docsnsettings
> (baggub) .musikproject/musikCube_u.ini 1913 <f???????
> 2009/10/22 21:35:57 [3820] send unknown [192.168.5.9] docsnsettings
> (baggub) .musikproject/musik_collected_u.db 157696 <f???????
> 2009/10/22 21:39:32 [3820] send unknown [192.168.5.9] docsnsettings
> (baggub) .musikproject/musik_u.db 28868608 <f???????
> 2009/10/22 21:39:32 [3820] sent 28836048 bytes  received 61235 bytes 
> total size 29028217
> 
> As you can see, roughly 30 MB are transmitted.
> 
> 2) Incremental backup:
> 
> 2009/10/22 21:40:46 [3940] 192.168.5.9 is not a known address for
> "localhost": spoofed address?
> 2009/10/22 21:40:46 [3940] connect from UNKNOWN (192.168.5.9)
> 2009/10/22 21:40:57 [3940] rsync on . from baggub@unknown (192.168.5.9)
> 2009/10/22 21:40:57 [3940] sent 212 bytes  received 674 bytes  total size
> 29028217
> 
> Almost nothing is transmitted, as the client only checks the timestamps.
> 
> 3) Another full backup: This looks exactly like the output to 1). All data
> is sent over the wire again. Rsync summary states that about 30MB are
> transmitted.
> 
> 4) Experiment:
> 
> For testing, I added "--checksum" to the {$RsyncArgs}. I rerun a Full
> Backup again:
> 
> 2009/10/22 21:55:09 [2172] rsync on . from baggub@unknown (192.168.5.9)
> 2009/10/22 21:55:10 [2172] send unknown [192.168.5.9] docsnsettings
> (baggub) .musikproject/musikCube_u.ini 1913 <f???????
> 2009/10/22 21:55:11 [2172] send unknown [192.168.5.9] docsnsettings
> (baggub) .musikproject/musik_collected_u.db 157696 <f???????
> 2009/10/22 21:55:11 [2172] sent 158068 bytes  received 762 bytes  total
> size 29028217
> 
> Interestingly, this time, only the two small files get retransmitted, the
> big one is left out.
> 
> I then restored my configuration to include the complete client pc,
> keeping the --checksum parameter. Sadly, now all I get is fileListReceived 
> errors
> on the server, so this didn't help either.
> 
> And for the records, I tried both rsync 2.6.8 and 3.0.4 on the client.
> 
> Craig, is this expected behaviour? Why does the full backup retransmit
> everything everytime?
 

-- 
Jetzt kostenlos herunterladen: Internet Explorer 8 und Mozilla Firefox 3.5 -
sicherer, schneller und einfacher! http://portal.gmx.net/de/go/atbrowser

------------------------------------------------------------------------------
Join us December 9, 2009 for the Red Hat Virtual Experience,
a free event focused on virtualization and cloud computing. 
Attend in-depth sessions from your desk. Your couch. Anywhere.
http://p.sf.net/sfu/redhat-sfdev2dev
_______________________________________________
BackupPC-users mailing list
BackupPC-users AT lists.sourceforge DOT net
List:    https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:    http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/

<Prev in Thread] Current Thread [Next in Thread>