BackupPC-users

Re: [BackupPC-users] An idea to fix both SIGPIPE and memory issues with rsync

2009-12-15 15:30:53
Subject: Re: [BackupPC-users] An idea to fix both SIGPIPE and memory issues with rsync
From: Les Mikesell <lesmikesell AT gmail DOT com>
To: "General list for user discussion, questions and support" <backuppc-users AT lists.sourceforge DOT net>
Date: Tue, 15 Dec 2009 14:28:50 -0600
Robin Lee Powell wrote:
> On Tue, Dec 15, 2009 at 02:33:06PM +0100, Holger Parplies wrote:
>> Robin Lee Powell wrote on 2009-12-15 00:22:41 -0800:
>>> Oh, I agree; in an ideal world, it wouldn't be an issue.  I'm
>>> afraid I don't live there.  :)
>> none of us do, but you're having problems. We aren't. 
> 
> How many of you are backing up trees as large as I am?  So far,
> everyone who has commented on the matter has said it's not even
> close.

I think most other people broke up their large runs on directory 
boundaries already.  I sort-of recall someone posting a script to do it 
dynamically as things changed some time ago.

>> The suggestion that your *software* is probably misconfigured in
>> addition to the *hardware* being flakey makes a lot of sense to
>> me. 
> 
> Certainly possible, but if it is I genuinely have no idea where the
> misconfiguration might be.  Also note that only the incrementals
> seem to fail; the initial fulls ran Just Fine (tm).  One of them
> took 31 hours.

You did mention firewalls in the path, I think.  Is there any 
possibility that the incremental directory scan takes so long before 
finding a change that the firewall times out the connection because 
there is no activity?  If that is happening, turning on ssh keepalives 
might help.

> 
> read(0, "", 8184)                       = 0
> select(2, NULL, [1], [1], {60, 0})      = 1 (out [1], left {60, 0})
> write(1, "K\0\0\10rsync: connection unexpectedly closed (179 bytes received 
> so far) [sender]\n", 79) = -1 EPIPE (Broken pipe)
> --- SIGPIPE (Broken pipe) @ 0 (0) ---

This looks like it thinks the other side closed.

> The really fun part is that the date when the strace exited (was
> doing "strace -p NUM ; date") is 6 hours before the BackupPC server
> claims that the backup aborted.  My ClientTimeout is set to 72000;
> both backups aborted significantly *after* the twenty hour mark.
> It's not relevant anyways, though; the connection was clearly
> broken on the client end long before BackupPC timed out.

Seems reasonable if the intermediate firewall broke the connection with 
a RST to the client and silently dropping thngs toward the server.

> I'm totally willing to accept that the problem might be hardware or
> software config on my end, but:
> 
> 1.  It seems to only happen with incrementals.

That points to long silent periods as a possibility.  What happens if 
you force a full instead of an incremental?  It should be somewhat 
slower because the client must read everything but the checksum xfer 
activity may keep the connection alive.

> 2.  I have no idea even where to look; everything looks fine at a
> system level as I understand it.  I don't have the networking skill
> to debug the networking end (the two machines are seperate RFC 1918
> address ranges, with a load balancer/firewall associated with each
> (2 total) between them, plus a bunch of switches and so on).

tcpdump host otherhost and port 22 (where other host is the server name 
or IP on the client, the client on the server) would show you the ssh 
stream packets.  The only interesting part is how it ends (a packet with 
  the RST flag or just nothing for a very long time).  If you have 
access to the firewall configs, you might also look at how long idle 
connections are permitted to remain open.

> Given that, it seems completely bizarre to me that you all are, I
> dunno, morally offended? that I proposed increases BackupPC's
> resilience to transient errors as a solution.

No, we are just being pragmatic.  Even if such a change can be made, it 
isn't going to happen overnight - perhaps not for years.  So, from this 
side it seems bizarre that you aren't doing the practical things to work 
around your problem.  The most obvious thing might be to move the server 
so it is on the same network as the clients...  Or, breaking on 
directory boundaries or trying the hack I suggested to remove the 
--ignore-times option on fulls to make them fast enough to be practical 
all the time, giving you the saved partials you need.

> 'm certainly not interested in maintaining my own patches or fork.
> I'd like to think that if I made my idea run-time optional y'all
> would roll it in, but the response has been so negative I'm worried.
> Also, it's a lot of work.  -_-

My change would involve commenting out a couple of lines.  Then you'd 
have to put them back if you ever want the checksum compare to verify 
the backup copy.

-- 
   Les Mikesell
    lesmikesell AT gmail DOT com



------------------------------------------------------------------------------
This SF.Net email is sponsored by the Verizon Developer Community
Take advantage of Verizon's best-in-class app development support
A streamlined, 14 day to market process makes app distribution fast and easy
Join now and get one step closer to millions of Verizon customers
http://p.sf.net/sfu/verizon-dev2dev 
_______________________________________________
BackupPC-users mailing list
BackupPC-users AT lists.sourceforge DOT net
List:    https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:    http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/

<Prev in Thread] Current Thread [Next in Thread>