BackupPC-users

Re: [BackupPC-users] Bad md5sums due to zero size (uncompressed) cpool files - WEIRD BUG

2011-10-05 21:37:32
Subject: Re: [BackupPC-users] Bad md5sums due to zero size (uncompressed) cpool files - WEIRD BUG
From: "Jeffrey J. Kosowsky" <backuppc AT kosowsky DOT org>
To: Holger Parplies <wbppc AT parplies DOT de>
Date: Wed, 05 Oct 2011 21:35:27 -0400
Holger Parplies wrote at about 17:41:48 +0200 on Wednesday, October 5, 2011:
 > Hi,
 > 
 > Jeffrey J. Kosowsky wrote on 2011-10-04 18:58:51 -0400 [[BackupPC-users] Bad 
 > md5sums due to zero size (uncompressed) cpool files - WEIRD BUG]:
 > > After the recent thread on bad md5sum file names, I ran a check on all
 > > my 1.1 million cpool files to check whether the md5sum file names are
 > > correct.
 > > 
 > > I got a total of 71 errors out of 1.1 million files:
 > > [...]
 > > - 68 of the 71 were *zero* sized when decompressed
 > > [...]
 > > Each such cpool file has anywhere from 2 to several thousand links
 > > [...]
 > > It turns out though that none of those zero-length decompressed cpool
 > > files were originally zero length but somehow they were stored in the
 > > pool as zero length with an md5sum that is correct for the original
 > > non-zero length file.
 > > [...]
 > > Now it seems unlikely that the files were corrupted after the backups
 > > were completed since the header and trailers are correct and there is
 > > no way that the filesystem would just happen to zero out the data
 > > while leaving the header and trailers intact (including checksums).
 > > [...]
 > > Also, on my latest full backup a spot check shows that the files are
 > > backed up correctly to the right non-zero length cpool file which of
 > > course has the same (now correct) partial file md5sum. Though as you
 > > would expect, that cpool file has a _0 suffix since the earlier zero
 > > length is already stored (incorrectly) as the base of the chain.
 > > [...]
 > > In summary, what could possibly cause BackupPC to truncate the data
 > > sometime between reading the file/calculating the partial file md5sum
 > > and compressing/writing the file to the cpool?
 > 
 > the first and only thing that springs to my mind is a full disk. In some
 > situations, BackupPC needs to create a temporary file (RStmp, I think) to
 > reconstruct the remote file contents. This file can become quite large, I
 > suppose. Independant of that, I remember there is *at least* an "incorrect
 > size" fixup which needs to copy already written content to a different hash
 > chain (because the hash turns out to be incorrect *after*
 > transmission/compression). Without looking closely at the code, I could
 > imagine (but am not sure) that this could interact badly with a full disk:
 > 
 > * output file is already open, headers have been written
 > * huge RStmp file is written, filling up the disk
 > * received file contents are for some reason written to disk (which doesn't
 >   work - no space left) and read back for writing into the output file 
 > (giving
 >   zero-length contents)
 > * trailing information is written to the output file - this works, because
 >   there is enough space left in the already allocated block for the file
 > * RStmp file gets removed and the rest of the backup continues without
 >   apparent error
 > 
 > Actually, for the case I tried to invent above, this doesn't seem to fit, but
 > the general idea could apply - at least the symptoms are "correct content
 > stored somewhere but read back incorrectly". This would mean the result of a
 > write operation would have to be unchecked by BackupPC somewhere (or handled
 > incorrectly).
 > 
 > So, the question is: have you been running BackupPC with an almost full disk?

Nope - disk has plenty of space...

 > Would there be at least one file in the backup set, of which the
 > *uncompressed* size is large in comparison to the reserved space (->
 > DfMaxUsagePct)?

Nothing large by today's standard - I don't backup any large databases
or video files.

 > 
 > For the moment, that's the most concrete thing I can think of. Of course,
 > writing to a temporary location might be fine an reading could fail (you
 > haven't modified your BackupPC code to use a signal handler for some 
 > arbitrary
 > purposes, have you? ;-). Or your Perl version could have an obscure bug that
 > occasionally trashes the contents of a string. Doesn't sound very likely,
 > though.
 > 
 > What *size* are the original files?

About half are attrib files of normal directories so they are quite
small. One I just checked was a kernel Documentation file of < 20K

 > 
 > Ah, yes. How many backups are (or rather were) you running in parallel? Noone
 > said the RStmp needs to be created by the affected backup ...

I don't run more than 2-3 in parallel.
And again my disk is far from full (about 60% of a 250GB partition)
and the files with errors so far all seem to be small.

I do have the partition mounted over NFS but I'm now using an updated
kernel on both machines (kernel 2.6.32) so it's not the same buggy
stuff I had years ago with an old 2.6.12 kernel.

But still, I would think an NFS error would trash the entire file, not
just the data portion of a compressed file...

Looking at the timestamps of the bad pool files, the errors occurred in
the Feb-April time frame (note this pool was started in February) and
there have been no errors since then. But the errors are sprinkled
across ~10 different days during that time period. So whatever
happened, happened several times. Now I haven't really changed/added
many files since April other than normal daily logs, mail spools, and
a few files that I have been editing so it could be that the rare event
hasn't occurred because I haven't added many new files to the pool.

Finally, remember it's possible that many people are having this
problem but just don't know it, since the only way one would know
would be if one actually computed the partial file md5sums of all the
pool files and/or restored & tested ones backups. Since the error
affects only 71 out of 1.1 million files it's possible that no one has
ever noticed...

It would be interesting if other people would run a test on their
pools to see if they have similar such issues (remember I only tested
my pool in response to the recent thread of the guy who was having
issues with his pool)...

------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and makes
sense of it. Business sense. IT sense. Common sense.
http://p.sf.net/sfu/splunk-d2dcopy1
_______________________________________________
BackupPC-users mailing list
BackupPC-users AT lists.sourceforge DOT net
List:    https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:    http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/