BackupPC-users

Re: [BackupPC-users] BackupPC_dump hangs with: .: size doesn't match (12288 vs 17592185913344)

2011-04-04 12:59:53
Subject: Re: [BackupPC-users] BackupPC_dump hangs with: .: size doesn't match (12288 vs 17592185913344)
From: Holger Parplies <wbppc AT parplies DOT de>
To: John Rouillard <rouilj-backuppc AT renesys DOT com>
Date: Mon, 4 Apr 2011 18:59:00 +0200
Hi,

John Rouillard wrote on 2011-03-31 15:20:23 +0000 [[BackupPC-users] 
BackupPC_dump hangs with: .: size doesn't match (12288 vs 17592185913344)]:
> [...]
> I get a bunch of output (the share being backed up /etc on a centos
> 5.5. box) which ends with:
> 
>   attribSet: dir=f%2fetc exists
>   attribSet(dir=f%2fetc, file=zshrc, size=640, placeholder=1)
>   Starting file 0 (.), blkCnt=134217728, blkSize=131072, remainder=0
>   .: size doesn't match (12288 vs 17592185913344)

at first glance, this would appear to be an indication of something I have
been suspecting for a long time: corruption - caused by whatever - in an
attrib file leading to the SIGALRM abort. If I remember correctly, someone
(presumably File::RsyncP) would ordinarily try to allocate space for the file
(though that doesn't seem to make sense, so I probably remember incorrectly)
and either gives up when that fails or refrains from trying in the first
place, because the amount is obviously insane.

The weird thing in this case is that we're seeing a directory. There is
absolutely no reason (unless I am missing something) to worry about the
*size* of a directory. The value is absolutely file system dependant and
not even necessarily an indication of the *current* amount of entries in
the directory. In any case, you restore the contents of a directory by
restoring the files in it, and you (incrementally) backup a directory by
determining if any files have changed or been added. The *size* of a
directory will not help with that decision.

Then again, the problematic file (or attrib file entry) may or may not be the
last one reported (maybe it's the first one not reported?).

> [...] I have had similar hanging issues before
> but usully scheduling a full backup or removing a prior backup or two
> in the chain will let things work again. However I would like to
> actually get this fixed this time around as it seems to be occurring
> more often recently (on different backuppc servers and against
> different hosts).

I agree with you there. This is probably one of the most frustrating problems
to be encountered with BackupPC, because there is no obvious cause and nothing
obvious to correct (throwing away part of your backup history for no better
reason than "after that it works again" is somewhat unsatisfactory).

The reason not to investigate this matter any further so far seems to have
been that it is usually "solved" by removing the reference backup (I believe
simply running a full backup will encounter the same problem again), because
people tend to want to get their backups back up and running. There are two
things to think about here:

1.) Why does attrib file corruption cause the backup to hang? Is there no
    sane(r) way to deal with the situation?
2.) How does the attrib file get corrupted in the first place?

Presuming it *is* attrib file corruption. Could you please send me a copy of
the attrib file off-list?

> If I dump the root attrib file (where /etc starts) for either last
> successful or the current (partial) failing backup I see:
> 
>   '/etc' => {
>     'uid' => 0,
>     'mtime' => 1300766985,
>     'mode' => 16877,
>     'size' => 12288,
>     'sizeDiv4GB' => 0,
>     'type' => 5,
>     'gid' => 0,
>     'sizeMod4GB' => 12288
>   },

I would expect the interesting part to be the '.' entry in the attrib file for
'/etc' (f%2fetc/attrib of the last successful backup, that is). And I would be
curious about how the attrib file was decoded, because I'd implement decoding
differently from how BackupPC does, though BackupPC's method does appear to be
well tested ;-).

> [...] the last few lines of strace show:
> 
> [...]
>   19368 15:00:38.199634 select(1, [0], [], NULL, {60, 0}) = 0 (Timeout)
>     <59.994597>

I believe this is the result of File::RsyncP having given up on the transfer
because of either a failing malloc() or a suppressed malloc(). I'll have to
find some time to check in more detail. I vaguely remember it was a rather
complicated matter, and there was never really enough evidence to support that
corrupted attrib files were really the cause. But I sure would like to get to
the bottom of this :-).

Regards,
Holger

------------------------------------------------------------------------------
Create and publish websites with WebMatrix
Use the most popular FREE web apps or write code yourself; 
WebMatrix provides all the features you need to develop and 
publish your website. http://p.sf.net/sfu/ms-webmatrix-sf
_______________________________________________
BackupPC-users mailing list
BackupPC-users AT lists.sourceforge DOT net
List:    https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:    http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/

<Prev in Thread] Current Thread [Next in Thread>