Would appending the file size to the md5 (either literally or
notionally) further decrease the astronomically small chance of a
non-purposely constructed collision?
I don't think so. Also, it would also cause the digests to be different from the full-file digests in rsync 3.X. (In BackupPC 3.x, adding the size helped because only a subset of the file contents were used for files larger than 256K. However, I regret adding the file size at the start of the digest calculation, not the end!)
Also, you note a number of features that should significantly speed up
backups (including rsync 30, c-code, --checksum flag, etc.). Have you
done any benchmarking relative to v3.x?
It depends. For slow clients, and initial backups (which involves a lot of compression cpu time), there isn't much difference. On faster clients I have seen significant speed ups. Other than initial compression, I suspect the server load is significantly lower that 3.X, but I haven't made measurements.
One place where 4.x is slower is that it doesn't implement block checksum caching. So if you go back to --ignore-times fulls, there is a lot more work to do on the server compared to 3.X. That case offsets some or all of the other performance gains. I haven't decided whether it is worth implementing block checksum caching because --checksum is quite convenient.
Here are two examples using --checksum fulls:
- A fast client (MacbookPro with a flash drive): a full is about 3x faster, an incremental is about 4x faster. An initial backup with an empty pool is no faster. An initial backup with a populated pool is maybe 30% faster (since --checksum allows any pool file to be matched), with a big saving in network traffic. The reported BackupPC speed is >100MB/sec for the full (but obviously not much data is being transferred because of --checksum).
- Backing up part of my BackupPC server to the same server (an old Xeon with 3ware RAID): a full is about 30% faster, an incremental is about 3x faster (another backup was running during these tests).
Craig