Re: [BackupPC-users] An idea to fix both SIGPIPE and memory issues with

On Tue, Dec 15, 2009 at 02:33:06PM +0100, Holger Parplies wrote:
> Robin Lee Powell wrote on 2009-12-15 00:22:41 -0800:
> > Oh, I agree; in an ideal world, it wouldn't be an issue.  I'm
> > afraid I don't live there.  :)
> 
> none of us do, but you're having problems. We aren't. 

How many of you are backing up trees as large as I am?  So far,
everyone who has commented on the matter has said it's not even
close.

> The suggestion that your *software* is probably misconfigured in
> addition to the *hardware* being flakey makes a lot of sense to
> me. 

Certainly possible, but if it is I genuinely have no idea where the
misconfiguration might be.  Also note that only the incrementals
seem to fail; the initial fulls ran Just Fine (tm).  One of them
took 31 hours.

For what it's worth, here's what a client strace says before things
crack on one of my larger incrementals; commentary welcome.

- ------------------

open("[customer dir]", O_RDONLY|O_NONBLOCK|O_DIRECTORY) = 3
fstat(3, {st_mode=S_IFDIR|0777, st_size=3864, ...}) = 0
fcntl(3, F_SETFD, FD_CLOEXEC)           = 0
getdents(3, /* 12 entries */, 4096)     = 616
lstat("[customer dir]/Sapphire_Pearl_Pendant_selector.jpg", 
{st_mode=S_IFREG|0644, st_size=1363, ...}) = 0
lstat("[customer dir]/Sapphire_Pearl_Pendant_gallery.jpg", 
{st_mode=S_IFREG|0644, st_size=5482, ...}) = 0
lstat("[customer dir]/Sapphire_Pearl_Pendant_shop_banner.jpg", 
{st_mode=S_IFREG|0644, st_size=19358, ...}) = 0
lstat("[customer dir]/Sapphire_Pearl_Pendant_library.jpg", 
{st_mode=S_IFREG|0644, st_size=2749, ...}) = 0
lstat("[customer dir]/Sapphire_Pearl_Pendant_badge.jpg", {st_mode=S_IFREG|0644, 
st_size=8073, ...}) = 0
lstat("[customer dir]/Sapphire_Pearl_Pendant_browse.jpg", 
{st_mode=S_IFREG|0644, st_size=2352, ...}) = 0
lstat("[customer dir]/Sapphire_Pearl_Pendant_display.jpg", 
{st_mode=S_IFREG|0644, st_size=33957, ...}) = 0
lstat("[customer dir]/Sapphire_Pearl_Pendant_segment.jpg", 
{st_mode=S_IFREG|0644, st_size=1152, ...}) = 0
lstat("[customer dir]/Sapphire_Pearl_Pendant.JPG", {st_mode=S_IFREG|0644, 
st_size=88733, ...}) = 0
lstat("[customer dir]/Sapphire_Pearl_Pendant_market_banner.jpg", 
{st_mode=S_IFREG|0644, st_size=21168, ...}) = 0
getdents(3, /* 0 entries */, 4096)      = 0
close(3)                                = 0
gettimeofday({1260864378, 747386}, NULL) = 0
gettimeofday({1260864378, 747429}, NULL) = 0
mmap(NULL, 20398080, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 
0x2ba2eb836000
munmap(0x2ba2eb836000, 20398080)        = 0
select(2, NULL, [1], [1], {60, 0})      = 1 (out [1], left {60, 0})
write(1, 
"\217\2\0\7badge.jpgO/\0\0g\272\341I:3\17shop_banner.jpg\334\332\0\0h\272\341I:3\nbrowse.jpg\10\16\0\0i\272\341I:3\vlibrary.jpg\\\17\0\0h\272\341I:3\fselector.jpg\16\10\0\0j\272\341I:3\21market_banner.jpg\312\346\0\0f\272\341I:2\4.jpgx\255\7\0e\272\341I:2\f_display.jpg\r\235\0\0g\272\341I:3\vsegment.jpgk\6\0\0j\272\341I:$$9642/Banner_-BODIE-Piano_display.jpg\3447\0\0jc\342I\272=\tbadge.jpg\211\30\0\0\272=\vsegment.jpgn\4\0\0\272?\nlector.jpg>\5\0\0:>\16hop_banner.jpg\226d\0\0-$\270J:=\nbrowse.jpgs\10\0\0jc\342I\272=\vlibrary.jpgO\10\0\0\272<\4.jpg\342p\0\0\272<\f_gallery.jpg.\22\0\0\272=\21market_banner.jpg\25\203\0\0:$(2987/Sapphire_Pearl_Pendant_selector.jpgS\5\0\0\344~\341I\272@\vgallery.jpgj\25\0\0:@\17shop_banner.jpg\236K\0\0\343~\341I\272@\vlibrary.jpg\275\n\0\0:@\tbadge.jpg\211\37\0\0\342~\341I:A\trowse.jpg0\t\0\0\344~\341I:@\vdisplay.jpg\245\204\0\0\343~\341I:@\vsegment.jpg\200\4\0\0\344~\341I:?\4.JPG\235Z\1\0\342~\341I\272?\22_market_banner.jpg\260R\0\0\0\0\0\0\0",
 659) = 659
select(1, [0], [], NULL, {60, 0})       = 0 (Timeout)
select(1, [0], [], NULL, {60, 0})       = 0 (Timeout)
select(1, [0], [], NULL, {60, 0})       = 0 (Timeout)
select(1, [0], [], NULL, {60, 0})       = 0 (Timeout)
select(1, [0], [], NULL, {60, 0})       = 0 (Timeout)
select(1, [0], [], NULL, {60, 0})       = 0 (Timeout)
select(1, [0], [], NULL, {60, 0})       = 0 (Timeout)
select(1, [0], [], NULL, {60, 0})       = 0 (Timeout)
select(1, [0], [], NULL, {60, 0})       = 0 (Timeout)
select(1, [0], [], NULL, {60, 0})       = 0 (Timeout)
select(1, [0], [], NULL, {60, 0})       = 0 (Timeout)
select(1, [0], [], NULL, {60, 0})       = 0 (Timeout)
select(1, [0], [], NULL, {60, 0})       = 0 (Timeout)
select(1, [0], [], NULL, {60, 0})       = 0 (Timeout)
[snip lots of timeouts]
select(1, [0], [], NULL, {60, 0})       = 0 (Timeout)
select(1, [0], [], NULL, {60, 0})       = 0 (Timeout)
select(1, [0], [], NULL, {60, 0})       = 0 (Timeout)
select(1, [0], [], NULL, {60, 0})       = 1 (in [0], left {1, 212000})
read(0, "", 8184)                       = 0
select(2, NULL, [1], [1], {60, 0})      = 1 (out [1], left {60, 0})
write(1, "K\0\0\10rsync: connection unexpectedly closed (179 bytes received so 
far) [sender]\n", 79) = -1 EPIPE (Broken pipe)
--- SIGPIPE (Broken pipe) @ 0 (0) ---
write(2, "rsync: writefd_unbuffered failed to write 79 bytes [sender]: Broken 
pipe (32)", 77) = -1 EPIPE (Broken pipe)
--- SIGPIPE (Broken pipe) @ 0 (0) ---
rt_sigaction(SIGUSR1, {SIG_IGN}, NULL, 8) = 0
rt_sigaction(SIGUSR2, {SIG_IGN}, NULL, 8) = 0
write(2, "rsync error: errors with program diagnostics (code 13) at log.c(237) 
[sender=3.0.5]", 83) = -1 EPIPE (Broken pipe)
--- SIGPIPE (Broken pipe) @ 0 (0) ---
rt_sigaction(SIGUSR1, {SIG_IGN}, NULL, 8) = 0
rt_sigaction(SIGUSR2, {SIG_IGN}, NULL, 8) = 0
gettimeofday({1260871590, 977217}, NULL) = 0
select(0, NULL, NULL, NULL, {0, 100000}) = 0 (Timeout)
gettimeofday({1260871591, 76871}, NULL) = 0
exit_group(13)                          = ?

- ------------------

And here's the entirety of the xfer log errors:

- ------------------

incr backup started back to 2009-12-12 00:00:08 (backup #0) for directory /
Running: /usr/bin/ssh -p 8416 -i /engineyard/backuppc/.ssh/id_dsa -o 
StrictHostKeyChecking=no -q -x -l root 65.74.174.196 /usr/bin/rsync --server 
--sender --numeric-ids --perms --owner --group -D --links --times 
--block-size=2048 --recursive . /
Xfer PIDs are now 24552
Got remote protocol 30
Negotiated protocol version 28
Checksum seed is 1260871202
Got checksumSeed 0x4b275e22
Sent include: /[dir]
Sent include: /[dir]/[dir]
Sent include: /[dir]/[dir]/[dir]
Sent include: /[dir]/[dir]/[dir]/[dir]
Sent exclude: /*
Sent exclude: /[dir]/*
Sent exclude: /[dir]/[dir]/*
Sent exclude: /[dir]/[dir]/[dir]/*
Got file list: 5099022 entries
Child PID is 32756
Xfer PIDs are now 24552,32756
Sending csums, cnt = 5099022, phase = 0
Read EOF: 
Tried again: got 0 bytes
Child is aborting
Parent read EOF from child: fatal error!
Sending csums, cnt = 2251037, phase = 1
Done: 0 files, 0 bytes
Got fatal error during xfer (Child exited prematurely)
Backup aborted (Child exited prematurely)

- ------------------

I see no sign of system level troubles on either client or server.
Unfortunately, I wasn't running strace on the other large backup,
but it also failed, at a completely different time.

The really fun part is that the date when the strace exited (was
doing "strace -p NUM ; date") is 6 hours before the BackupPC server
claims that the backup aborted.  My ClientTimeout is set to 72000;
both backups aborted significantly *after* the twenty hour mark.
It's not relevant anyways, though; the connection was clearly
broken on the client end long before BackupPC timed out.

I'm totally willing to accept that the problem might be hardware or
software config on my end, but:

1.  It seems to only happen with incrementals.

2.  I have no idea even where to look; everything looks fine at a
system level as I understand it.  I don't have the networking skill
to debug the networking end (the two machines are seperate RFC 1918
address ranges, with a load balancer/firewall associated with each
(2 total) between them, plus a bunch of switches and so on).

Given that, it seems completely bizarre to me that you all are, I
dunno, morally offended? that I proposed increases BackupPC's
resilience to transient errors as a solution.

In the meantime, I guess I'll go try sharding things and hope it
doesn't overload the client's running production software too much,
because I don't see another way out of this.I

'm certainly not interested in maintaining my own patches or fork.
I'd like to think that if I made my idea run-time optional y'all
would roll it in, but the response has been so negative I'm worried.
Also, it's a lot of work.  -_-

-Robin

-- 
They say:  "The first AIs will be built by the military as weapons."
And I'm  thinking:  "Does it even occur to you to try for something
other  than  the default  outcome?"  See http://shrunklink.com/cdiz
http://www.digitalkingdom.org/~rlpowell/ *** http://www.lojban.org/

------------------------------------------------------------------------------
This SF.Net email is sponsored by the Verizon Developer Community
Take advantage of Verizon's best-in-class app development support
A streamlined, 14 day to market process makes app distribution fast and easy
Join now and get one step closer to millions of Verizon customers
http://p.sf.net/sfu/verizon-dev2dev 
_______________________________________________
BackupPC-users mailing list
BackupPC-users AT lists.sourceforge DOT net
List:    https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:    http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/
Re: [BackupPC-users] An idea to fix both SIGPIPE and memory issues with rsync