BackupPC-users

Re: [BackupPC-users] BackupPC_dump hangs with: .: size doesn't match (12288 vs 17592185913344)

2011-04-06 15:09:31
Subject: Re: [BackupPC-users] BackupPC_dump hangs with: .: size doesn't match (12288 vs 17592185913344)
From: John Rouillard <rouilj-backuppc AT renesys DOT com>
To: backuppc-users AT lists.sourceforge DOT net
Date: Wed, 6 Apr 2011 19:07:28 +0000
Sadly I have some bad news.

I would love to add more info to this, but it appears the
@#^$@&#$#% backup decided to stop failing at 10PM on April
1st. Backups have been working fine since 8-(.  Sort of a
reverse april fools joke I guess.

However my comments are inline below as I have attempted to
gather more info for the next time this happens. 

On Mon, Apr 04, 2011 at 06:59:00PM +0200, Holger Parplies wrote:
> John Rouillard wrote on 2011-03-31 15:20:23 +0000 [[BackupPC-users] 
> BackupPC_dump hangs with: .: size doesn't match (12288 vs 17592185913344)]:
> > [...]
> > I get a bunch of output (the share being backed up /etc on a centos
> > 5.5. box) which ends with:
> > 
> >   attribSet: dir=f%2fetc exists
> >   attribSet(dir=f%2fetc, file=zshrc, size=640, placeholder=1)
> >   Starting file 0 (.), blkCnt=134217728, blkSize=131072, remainder=0
> >   .: size doesn't match (12288 vs 17592185913344)
> 
> at first glance, this would appear to be an indication of
> something I have been suspecting for a long time:
> corruption - caused by whatever - in an attrib file
> leading to the SIGALRM abort. [...]
>
> The weird thing in this case is that we're seeing a
> directory. There is absolutely no reason (unless I am
> missing something) to worry about the *size* of a
> directory. [...] The
> *size* of a directory will not help with that [how to
> restore a directory] decision.
> 
> Then again, the problematic file (or attrib file entry)
> may or may not be the last one reported (maybe it's the
> first one not reported?).
> 
> > [...] I have had similar hanging issues before
> > but usully scheduling a full backup or removing a prior backup or two
> > in the chain will let things work again. However I would like to
> > actually get this fixed this time around as it seems to be occurring
> > more often recently (on different backuppc servers and against
> > different hosts).
> 
> I agree with you there. This is probably one of the most
> frustrating problems to be encountered with BackupPC,
> because there is no obvious cause and nothing obvious to
> correct (throwing away part of your backup history for no
> better reason than "after that it works again" is somewhat
> unsatisfactory).
> 
> The reason not to investigate this matter any further so
> far seems to have been that it is usually "solved" by
> removing the reference backup (I believe simply running a
> full backup will encounter the same problem again),

That is correct, the last good backup was on:

   2011-03-22 22:00 at level 2

the next should have been a level 3 incremental on
2011-03-23. That incremental ran according to the logs but
failed on sigALRM and is missing from the completed backup
list (I assume because it failed and was not retained). Then
the full backup started on the 24th as I expected.

> because people tend to want to get their backups back up
> and running. There are two things to think about here:
> 
> 1.) Why does attrib file corruption cause the backup to
>     hang? Is there no sane(r) way to deal with the situation?
> 2.) How does the attrib file get corrupted in the first
>     place?

The underlying filesystem is a hardware raid 5 (3ware) with
4 1T disks that is scrubbed/verified weekly. Because of the
size of the backup partition 1.9T, I haven't gotten an fsck
to complete on it, but I have no indication of any memory or
disk hardware/raid controller issues that could scramble the
data.

> Presuming it *is* attrib file corruption. Could you please
> send me a copy of the attrib file off-list?

I will send you the attrib file for the top level and the
etc directory from the last completed backup before the
problem started.

> > If I dump the root attrib file (where /etc starts) for either last
> > successful or the current (partial) failing backup I see:
> > 
> >   '/etc' => {
> >     'uid' => 0,
> >     'mtime' => 1300766985,
> >     'mode' => 16877,
> >     'size' => 12288,
> >     'sizeDiv4GB' => 0,
> >     'type' => 5,
> >     'gid' => 0,
> >     'sizeMod4GB' => 12288
> >   },
> 
> I would expect the interesting part to be the '.' entry in
> the attrib file for '/etc' (f%2fetc/attrib of the last
> successful backup, that is).

If I dump .../293/f%2fetc/etc attrib file:

   -rw-r----- 2 backup dev 1094 Mar 22 22:00 attrib

for the incremental from 2011-03-22 (the last working on
prior to the full that started working on the 1st of April),
I don't see any entry for '.'. It looks like only
subdirectories and files are listed there.  There is one
oddity. One of the files listed in the attrib file as:

  'vgconfig_backup_VolGroup00' => {
    'uid' => 0,
    'mtime' => 1300756921,  ## Tue Mar 22 01:22:01 2011
    'mode' => 33152,
    'size' => 2945,
    'sizeDiv4GB' => 0,
    'type' => 0,
    'gid' => 0,
    'sizeMod4GB' => 2945
  },

looks like this in ls -l:

  -rw-r-----  2 backup dev 1085 Mar 23 22:00 fvgconfig_backup_VolGroup00

(note the 23 march mod time). With only two links, I am not
sure what is happening there. That file should be a
different file every day (due to an embedded timestamp).

> And I would be curious about
> how the attrib file was decoded, because I'd implement
> decoding differently from how BackupPC does, though
> BackupPC's method does appear to be well tested ;-).

Yeah I'm afraid I don't know. but the attribPrint script
isn't thowing any errors.

> > [...] the last few lines of strace show:
> > 
> > [...]
> >   19368 15:00:38.199634 select(1, [0], [], NULL, {60, 0}) = 0 (Timeout)
> >     <59.994597>
> 
> I believe this is the result of File::RsyncP having given
> up on the transfer because of either a failing malloc() or
> a suppressed malloc().

Yup that seems reasonable.

> I'll have to find some time to check in more detail. I
> vaguely remember it was a rather complicated matter, and
> there was never really enough evidence to support that
> corrupted attrib files were really the cause. But I sure
> would like to get to the bottom of this :-).

Agreed.

-- 
                                -- rouilj

John Rouillard       System Administrator
Renesys Corporation  603-244-9084 (cell)  603-643-9300 x 111

------------------------------------------------------------------------------
Xperia(TM) PLAY
It's a major breakthrough. An authentic gaming
smartphone on the nation's most reliable network.
And it wants your games.
http://p.sf.net/sfu/verizon-sfdev
_______________________________________________
BackupPC-users mailing list
BackupPC-users AT lists.sourceforge DOT net
List:    https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:    http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/

<Prev in Thread] Current Thread [Next in Thread>