Re: [BackupPC-users] Incremental dumps hanging with 'Can't get rsync dig

Holger Parplies wrote at about 13:57:35 +0100 on Monday, October 27, 2008:
 > Jeffrey J. Kosowsky wrote on 2008-10-27 01:18:24 -0400 [Re: [BackupPC-users] 
 > Incremental dumps hanging with 'Can't get rsync digests' & 'Can't call 
 > method "isCached"']:
 > sorry about that. I experimented on tar backups, which apparently don't
 > (redundantly) store the file type bits in the "mode" entry. rsync obviously
 > does. That's ok (i.e. expected, once you read the code). It's due to the way
 > file type information is transfered in the respective protocols/data streams.
 > I should check the validity of the file type bits in the mode entry 
 > separately.
 > I'll add that soon. For the moment, I suggest you change line 134 to
 > 
 >      or not exists $permmap {$list [$i + 2] & 07777}
 > 
 > (just adding the " & 07777").
Works like a charm!
 > 
 > > I tried doing -X "40755,100644" etc. to exclude these perms but it
 > > didn't seem to have any effect
 > 
 > You probably mean "040755,0100644", but I'll change my eval() to oct(),
 > because I doubt anyone uses non-octal numeric modes anyway. eval() was not
 > a good idea to begin with. The patch should fix these two, but the remark
 > applies to any other values you might want to exclude until I change it.

Ahhhh of course.... it was about 2:30AM my time if that is any excuse ;)

 > > Also, the debug output for the Users list looks good except for the
 > > last 2 elements: 
 > >     0755 (which looks like a perm)
 > 
 > Yes, that is strange, especially the leading zero. You didn't specify a
 > "-u 0755" option, did you? :)
You are good! I may have been guilty of that -- I think I may have used it by
acident once instead of -X -- see above late hour excuse ;)

 > 
 > >     65534 (which seems like the max number for a uid)
 > > Similarly the last element of the group list is: 65534.
 > 
 > That's nobody and nogroup - don't they appear in your /etc/passwd and
 > /etc/group?
Again you are right!

 > Jeffrey J. Kosowsky wrote on 2008-10-27 04:00:27 -0400 [Re: [BackupPC-users] 
 > Incremental dumps hanging with 'Can't get rsync digests' & 'Can't call 
 > method "isCached"']:
 > > Interesting -- I ended up having to reboot (which of course required a
 > > restart of the backuppc service) and the problem went away.
 > > 
 > > This is the second time this has happened to me.
 > > I suspect (in a fuzzy type of way) that somehow this may have been
 > > caused by my rebooting the nfs server (which is mounted on
 > > /var/lib/BackupPC) without doing something like restarting the
 > > backuppc service - the result was that for some time there may have
 > > been a stale nfs link hanging around and it is possible that this
 > > occurred during the middle of a backup.
 > 
 > Normally, rebooting the NFS server should *not* lead to stale NFS mounts. In
 > my experience that happens when device numbers (on the NFS server) change
 > (though I vaguely remember seeing an unexpected instance of that myself
 > lately). Try to fix it and you will save yourself a lot of headaches (like
 > adding a hook to remount the FS, but that's another thread).
 > 
 > This probably means you shouldn't backup to an NFS mounted pool (which you
 > probably shouldn't do for performance reasons anyway).

What is the alternative if you don't have room on your server and if
you can't "afford" something fancier than a SAN?
For me, using NAS is very economical given the cost of drives and the
existence of cheap embedded Linux NAS devices. Maybe I am missing an
easy better alternative.

Since backup speed for me is not that important (the speed of the
network is the primary limiting factor), I really like the flexibility
of using NAS. But do to the rsync reliance on hard links, I'm not sure
of any other simple solution than using NFS (and even for that I had
to recompile the kernel on my NAS to allow true nfs rather than just
user space nfs).

 > What mount options are you using (esp. hard/soft, intr/nointr, tcp/udp)?
noexec,nosuid,nodev,intr,_netdev,async,timeo=25,soft

Note: I am expecting the incidents of 'stale NFS' and reboots in
general of my NAS device to go down now that I am mostly done
configuring and customizing it.
 > 
 > > I also may have killed the BackupPC_dump process using 'kill -9' when I was
 > > unable to kill it from the web interface.
 > 
 > SIGKILL is a bad habit to get into. You should try SIGINT first (though it
 > probably won't work in the "stale NFS file handle" case). If you can't access
 > the pool, killing BackupPC_dump is unlikely to do any additional harm :).

True.
 > 
 > > Still... it would be nice to get some type of email or other warning
 > > when a backup freezes up because conceivably one could be unaware of
 > > this issue for days...
 > 
 > The BackupPC daemon could report backups running for an "unusually long" time
 > (for a configurable value of "unusually long") by e-mail. I would strongly
 > argue against aborting them (like $Conf{ClientTimeout} does), because the
 > daemon has even less control over what is actually happening on the network
 > level than BackupPC_dump, but optionally informing the admin seems 
 > reasonable.
 > It should be possible to turn these warnings off, though.

Ideally, one could think of a slightly better measure than just
"elapsed time" since that would vary a lot based upon size of the
backup and the network speeds between host and client.
A better (and more accurate measure) of the underlying network
situation and backup process may be to specify a minimum backup
"speed" that would measure progress in either files parsed (which may
not work for large files) or in MB of files parsed. This could be
optionally configurable on a host-by-host basis to take into account
different network speeds.
Clearly a hung (or nearly hung) backup would have speed at (or nearly
at) zero. And for me this is what I am trying to prevent -- backups
that seem to be running but in fact are making zero progress.

This would also have the added bonus of alerting you to backups (or
machines) that are particularly slow but not hung so that you could
optimize your network or backup strategy.
 
 > To sum it up, your problem appears to be NFS server related ("stale NFS file
 > handle"), not due to corrupted attrib files (though a crashing NFS server
 > could lead to corruption of an attrib file, I guess). Thank you for the
 > feedback on my script anyway.
 > 
Good summation. I will continue to follow (now that my NFS server is
more stable) to make sure nothing else is going on...

Thanks soooo much for your detailed and thoughtful responses!!!!

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
BackupPC-users mailing list
BackupPC-users AT lists.sourceforge DOT net
List:    https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:    http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/
Re: [BackupPC-users] Incremental dumps hanging with 'Can't get rsync digests' & 'Can't call method "isCached"'