BackupPC-users

Re: [BackupPC-users] An idea to fix both SIGPIPE and memory issues with rsync

2009-12-16 14:17:58
Subject: Re: [BackupPC-users] An idea to fix both SIGPIPE and memory issues with rsync
From: Robin Lee Powell <rlpowell AT digitalkingdom DOT org>
To: "General list for user discussion, questions and support" <backuppc-users AT lists.sourceforge DOT net>
Date: Wed, 16 Dec 2009 11:15:37 -0800
I want to start by saying that I appreciate all the help and
suggestions y'all have given on something that's obviously not your
problem.  :)  Unfortunately, it looks like this problem is (1) far
more interesting than I thought and (2) might be in BackupPC itself.

On Tue, Dec 15, 2009 at 11:29:46AM -0900, Chris Robertson wrote:
> My guess would be that your firewalls are set up to close
> "inactive" TCP sessions.  Try adding "-o ServerAliveInterval=60"
> to your RsyncClientCmd (so it looks something like "$sshPath -C -q
> -x -o ServerAliveInterval=60 -l root $host $rsyncPath $argList+")
> and see if that solves your problem.

Also suggested by several other people.

Well, umm, it kept the connection open.  Now I've got [counts] 8
BackupPC_dump processes doing absolutely nothing.

The one I was stracing and tcpduming and so on looks fine: ssh is
sending the keepalives, and they're being responded to
appropriately.  The rsync is up on the client, the dump is up on the
server, all procs are connected to each other correctly.

They're just not *doing* anything.  Nothing has errored out; BackupPC
thinks everything is fine.

Some of the places this is happening are very small backups that
usually take a matter of minutes.

Suddenly this isn't looking like a networking problem anymore; the
networking appears to be just fine.  This is looking like a BackupPC
problem.  My version, by the way, is 3.1.0.  It's like at some point
BackupPC forgot that it was supposed to be trying to back anything
up.

Is it possible the nightly jobs are configured in some weird way
that broke things?

Anyways, the one that has the problem consistently *also* always has
it in exactly the same place; I was watching it in basically every
way possible, so here comes the debugging stuff.  As will become
obvious, I probably need to turn the ssh debugging down and the
BackupPC debugging up.

If, for some reason, you want raw copies of any of this, let me know
and I'll post them.

On the machine in question, we're backing up /pictures/.
/pictures/agate/shared/pictures/ has about a thousand subdirs that
each have about a thousand subdirs; the *vast* majority of the files
on the file system are in there.

The backup always stops doing anything right after lstat of the last
file in /pictures/agate/shared/pictures/0216/2987/.  According to ls
-U, /pictures/agate/shared/pictures/0216/ is the last directory in
/pictures/agate/shared/pictures/, and
/pictures/agate/shared/pictures/0216/2987/ is the last directory in
/pictures/agate/shared/pictures/0216/.  Now, there are other
directories in /pictures/agate/shared/ that ls -U after pictures/,
but with rsync 3.0 on the client, I'm assuming those are handled in
parallel.

So it looks like the backup is stalling right after finishing
checking all the files.

No "new" dir is made.  There's a NewFileList, but it's empty.

XferLog.z is mostly useless; too many -v in my ssh call apparently,
but here's the last bit anyways (copied from less):

- -------------
browse.jpg<CE>^P^@^@w<85>@I:1^Kdisplay.jpgp<A3>^@^@y<85>@I:1^Ksegment.jpg<8D>^G^@^@t<85>@I:$^Y3187/DSCN1886_gallery.jpg<D4>^R^@^@}^V@I:2^Pfor_cropping.jpg<94>Y^@^@<88>^V@I:2^Klibrary.jpg<8A>
^@^@<8A>^V@I:2^Qmarket_banner.jpgN<C1>^@^@<8C>^V@I:2
browse.jpg~^H^@^@<80>^V@I:1^D.JPG^T<A0>^I^@<8C>^V@I:1
_badge.jpg5@^@<87>^V@I:2^Lselector.jpg'^E^@^@<84>^V@I:3^Nhop_banner.jpg<F7><88>^@^@<89>^V@I:2^Kdisplay.jpg<BF>x^@^@<82>^V@I:2^Ksegment.jpgl^D^@^@^?^V@I:%(968/food-wrapping-mr-gallery-x_badge.jpg<D5>b^@^@^\$@I:D^Kgallery.jpg^NB^@^@^Z$@I:D^Qmarket_banner.jpg<A8><D1>^@^@^^$@I:D^Klibrary.jpg<AB>^^^@^@^]$@I<BA>D^Oshop_banner.jpg<8C><CB>^@^@:D
browse.jpg&^Z^@^@@I:C^D.jpg<E3>^C^A^@^^$@I:C^L_segment.jpg<AD>
^@^@@I:D^Pfor_cropping.jpg!*^A^@^Z$@I:D^Lselector.jpg<96>^M^@^@^\$@I:D^Kdisplay.jpgR1^A^@@I:%^K539/061.JPG<DC>f^@^@d
 @I:,^L_gallery.jpgr^X^@^@a @I:-^Klibrary.jpg<D7>^I^@^@c 
@I:-^Pfor_cropping.jpg<F9>U^@^@` @I:-^Lselector.jpg<8C>^E^@^@b 
@I:-^Qmarket_banner.jpg<D2>l^@^@d @I:-^Oshop_banner.jpg^Hp^@^@c @I:-
browse.jpg<EE>^I^@^@b @I:-^Ksegment.jpg<8D>^D^@^@a @I:-^Kdisplay.jpgF@^@^@b 
@I:-^Ibadge.jpg]$^@^@c @I:$^T7026/blog_browse.jpg<D9>^O
^@^@<D7>o@I:.^Klibrary.jpg<EC>^R^@^@<D9>o@I:.^Lselector.jpg=^H^@^@<D8>o@I:.^Kgallery.jpg,&^@^@<D5>o@I:.^Qmarket_banner.jpgl
- -------------

The only relevant, and last, LOG line is: 2009-12-16 00:12:00 incr backup 
started back to 2009-12-12 00:00:08 (backup #0) for directory /

Current time on the box is Wed Dec 16 10:54:14 PST 2009
(boxes agree to within a second; didn't test beyond flipping back
and forth and typing "date").

Last bit of strace of the dump proc on the SERVER: backuppc  4607 24499  0 
00:12 ?        00:00:07 /usr/bin/perl /usr/local/bin/BackupPC_dump -i 
ey02-s00416_pictures

- -------------
04:34:54.081334 read(7, 
"nner.jpgDX\1\0mW\342I\272-\vlibrary.jpg\177\24\0\0:,\4.JPGN 
\t\0kW\342I:,\22_market_banner.jpg^\253\1\0lW\342I:$\0376998/P1010001_1_shop_banner.jpg\234\274\0\0\257F\342I:3\4.JPG\25j\t\0\255F\342I:3\f_library.jpg\340\n\0\0\256F\342I:4\21market_banner.jpg\255\356\0\0\255F\342I:4\vgallery.jpg\0\31\0\0\257F\342I:4\vsegment.jpg<\5\0\0\260F\342I\2726\nlector.jpgK\6\0\0:4\vdisplay.jpg\212\214\0\0\256F\342I:4\nbrowse.jpgU\v\0\0\257F\342I:5\10adge.jpg]$\0\0\256F\342I:$%0716/skpstonespecanbowl4_selector.jpg\10\6\0\0\347F\341I:<\4.jpg\252\313\4\0\344F\341I:<\f_gallery.jpgP\27\0\0\347F\341I:=\21market_banner.jpg\360\235\0\0\345F\341I:=\17shop_banner.jpgC\230\0\0\346F\341I:=\tbadge.jpgG\"\0\0\345F\341I:=\vlibrary.jpg\346\v\0\0\346F\341I:=\nbrowse.jpg+\n\0\0\347F\341I:=\vsegment.jpg\326\4\0\0\350F\341I:=\vdisplay.jpgh\223\0\0\345F\341I:$\0319342/CIMG8427_library.jpgd\16\0\0\1[\342I:2\fselector.jpg@\6\0\0\3[\342I:2\tbadge.jpg<0\0\0\1[\342I\2722\21market_banner.jpg\350\260\0\0:2\vgallery.jpgV\37\0\0\3[\342I:1\4.JPGR\210\2\0\0[\342I:1\20_shop_banner.jpg\260\243\0\0\2[\342I:2\vdisplay.jpg\212\311\0\0\1[\342I:2\vsegment.jpg\32\5\0\0\3[\342I\2722\nbrowse.jpg\363\v\0\0:$&7348/fuchsia_on_blue_2.JPG_display.jpg\346<\1\0\351K\342I\272?\tbadge.jpgH[\0\0:?\vsegment.jpg\305\t\0\0\353K\342I:>\4.jpg\257\360\22\0\350K\342I:>\f_library.jpg:\26\0\0\352K\342I\272?\vgallery.jpgf<\0\0:?\fselector.jpgS\f\0\0\353K\342I:@\16hop_banner.jpgA|\1\0\352K\342I:?\nbrowse.jpg\227\27\0\0\353K\342I:?\21market_banner.jpg`\301\1\0\351K\342I:$$0516/IMG_7919-copy_market_banner.jpg\363\366\0\0*D\341I:7\fselector.jpg4\n\0\0,D\341I:6\4.jpg\312y\10\0)D\341I:6\f_segment.jpg\371\7\0\0,D\341I\2727\nbrowse.jpg3\24\0\0:7\17shop_banner.jpg^\343\0\0+D\341I:7\vdisplay.jpgv\201\1\0*D\341I\2727\tbadge.jpg,K\0\0:7\vgallery.jpg\0223\0\0+D\341I\2727\vlibrary.jpg\352\23\0\0:$\0379142/CIMG8401_market_banner.jpg\205\351\0\0\253X\342I:2\vgallery.jpg\352#\0\0\254X\342I\2722\nbrowse.jpgq\r\0\0\2722\17shop_banner.jpg\264\330\0\0\2722\tbadge.jpg\3677\0\0\2722\vlibrary.jpg\225\17\0\0:2\fselector.jpg\271\6\0\0\255X\342I:1\4.JPG~\333\2\0\253X\342I:1\f_display.jpg\207\371\0\0\254X\342I:2\vsegment.jpg[\5\0\0\255X\342I:$)3457/ByNanasHands_437_gal"...,
 65536) = 4092
04:34:54.081334 select(8, [7], NULL, [7], NULL) = 1 (in [7])
04:35:06.849495 read(7, "\200\f\0\7"..., 65536) = 4
04:35:06.849495 select(8, [7], NULL, [7], NULL) = 1 (in [7])
04:35:06.849495 read(7, 
".jpg\3306\0\0002]\342I:7\21market_banner.jpg\323<\1\0000]\342I:$\0333057/IMG_3601new_browse.jpg\216\t\0\0\244\205\341I:5\vdisplay.jpg\274S\0\0\240\205\341I:5\vsegment.jpg\346\4\0\0\246\205\341I:7\nlector.jpg\315\5\0\0\245\205\341I:4\4.jpgy\304\22\0\235\205\341I:4\22_market_banner.jpg\r\252\0\0\236\205\341I:5\vgallery.jpgk\23\0\0\243\205\341I:5\tbadge.jpg\17\33\0\0\237\205\341I:5\vlibrary.jpg'\n\0\0\241\205\341I:5\17shop_banner.jpg\331u\0\0\242\205\341I:$\0265867/banner_browse.jpg\225\f\0\0\17*\342I\2720\vsegment.jpg\330\5\0\0\2720\vdisplay.jpg\30!\0\0\2720\21market_banner.jpg}\\\0\0\2720\vgallery.jpg\357\26\0\0\2720\vlibrary.jpgC\5\0\0\272/\4.JPG\t/\0\0\272/\r_selector.jpg\2\7\0\0\2720\tbadge.jpg\233\36\0\0\2720\17shop_banner.jpg\300?\0\0:$\0262887/williamsburg3.JPG\327v\5\0.{\341I:6\f_display.jpg\323\267\0\0/{\341I\2727\tbadge.jpg\34.\0\0:7\17shop_banner.jpg\227\344\0\0000{\341I:8\negment.jpg&\6\0\0001{\341I:9\nlector.jpg|\7\0\0000{\341I\2727\nbrowse.jpg\330\f\0\0:7\vlibrary.jpg\267\16\0\0/{\341I:7\21market_banner.jpg'\f\1\0.{\341I:7\vgallery.jpg\236\36\0\0000{\341I:$\0267919/Joy_2_segment.jpg-\7\0\0\312O\342I\272/\vdisplay.jpg\266\315\0\0\272/\20for_cropping.jpgg\227\0\0\272/\tbadge.jpg\2762\0\0\2720\trowse.jpg\27\20\0\0\272.\4.JPGZ\256\2\0\272.\20_shop_banner.jpgY\270\0\0\2720\velector.jpg\360\10\0\0\272/\vlibrary.jpg\234\22\0\0\272/\vgallery.jpg.#\0\0\272/\21market_banner.jpg\332\311\0\0:$\0320316/P1010406_selector.jpg\352\5\0\0\276B\341I:2\nbrowse.jpg\346\n\0\0\275B\341I:2\vdisplay.jpgY\275\0\0\273B\341I:2\vsegment.jpg\24\5\0\0\277B\341I:3\16hop_banner.jpg\3649\1\0\274B\341I:1\4.JPG\214$\20\0\271B\341I:1\f_library.jpg~\v\0\0\274B\341I:2\tbadge.jpg\r/\0\0\272B\341I\2722\21market_banner.jpg\310\261\1\0:2\vgallery.jpgb\35\0\0\275B\341I:$\0343257/S-54b_market_banner.jpg\245\17\0\0\233\223\341I:/\vlibrary.jpg6\v\0\0\234\223\341I\272/\vgallery.jpg\325\23\0\0\272/\17shop_banner.jpg\240\17\0\0\272/\vdisplay.jpg*[\0\0\272/\vsegment.jpg\310\4\0\0\272/\nbrowse.jpg\320\t\0\0:.\4.jpg\255T\1\0\233\223\341I:.\r_selector.jpg\33\6\0\0\234\223\341I\272/\tbadge.jpg\215\33\0\0:$\0309742/Jophiel_segment.jpg\364\5\0\0\0i\342I:1\vdisplay.jpg\23V\0\0\354h\342I:1\tbadge.jpg\321\"\0\0\377h\342I:1\17shop_banner.jpg\235\\\0\0\354h\342I:2\velector.jpgm\7\0\0\0i\342I:0\4.jpg\30a\2\0\353h\342"...,
 65536) = 3200
04:35:06.849495 select(8, [7], NULL, [7], NULL
- -------------

Note the select that's been sitting there for 6+ hours.

lsof, same proc, on the SERVER:

- -------------
xc88-s00071 ~ # lsof -p 4607
COMMAND    PID     USER   FD   TYPE             DEVICE     SIZE     NODE NAME
BackupPC_ 4607 backuppc  cwd    DIR                8,1      624        2 /
BackupPC_ 4607 backuppc  rtd    DIR                8,1      624        2 /
BackupPC_ 4607 backuppc  txt    REG                8,1  1104232    14724 
/usr/bin/perl5.8.8
BackupPC_ 4607 backuppc  mem    REG                8,1   115056    76305 
/lib64/ld-2.6.1.so
BackupPC_ 4607 backuppc  mem    REG                8,1    93392    75381 
/lib64/libpthread-2.6.1.so
BackupPC_ 4607 backuppc  mem    REG                8,1    84904    76357 
/lib64/libnsl-2.6.1.so
BackupPC_ 4607 backuppc  mem    REG                8,1    14632    76341 
/lib64/libdl-2.6.1.so
BackupPC_ 4607 backuppc  mem    REG                8,1   526568    75404 
/lib64/libm-2.6.1.so
BackupPC_ 4607 backuppc  mem    REG                8,1    22728    75346 
/lib64/libcrypt-2.6.1.so
BackupPC_ 4607 backuppc  mem    REG                8,1    10592    76264 
/lib64/libutil-2.6.1.so
BackupPC_ 4607 backuppc  mem    REG                8,1  1305848    76332 
/lib64/libc-2.6.1.so
BackupPC_ 4607 backuppc  mem    REG                8,1    31136    29639 
/usr/lib64/perl5/5.8.8/x86_64-linux/auto/Data/Dumper/Dumper.so
BackupPC_ 4607 backuppc  mem    REG                8,1    10288    29623 
/usr/lib64/perl5/5.8.8/x86_64-linux/auto/Cwd/Cwd.so
BackupPC_ 4607 backuppc  mem    REG                8,1    14288    29711 
/usr/lib64/perl5/5.8.8/x86_64-linux/auto/Fcntl/Fcntl.so
BackupPC_ 4607 backuppc  mem    REG                8,1    22592    29891 
/usr/lib64/perl5/5.8.8/x86_64-linux/auto/Socket/Socket.so
BackupPC_ 4607 backuppc  mem    REG                8,1    18440    29666 
/usr/lib64/perl5/5.8.8/x86_64-linux/auto/Digest/MD5/MD5.so
BackupPC_ 4607 backuppc  mem    REG                8,1    40080    29690 
/usr/lib64/perl5/5.8.8/x86_64-linux/auto/Encode/Encode.so
BackupPC_ 4607 backuppc  mem    REG                8,1    18592    29610 
/usr/lib64/perl5/5.8.8/x86_64-linux/auto/IO/IO.so
BackupPC_ 4607 backuppc  mem    REG                8,1    26904   209613 
/usr/lib64/perl5/vendor_perl/5.8.8/x86_64-linux/auto/List/Util/Util.so
BackupPC_ 4607 backuppc  mem    REG                8,1    18560    29648 
/usr/lib64/perl5/5.8.8/x86_64-linux/auto/File/Glob/Glob.so
BackupPC_ 4607 backuppc  mem    REG                8,1    67944   209571 
/usr/lib64/perl5/vendor_perl/5.8.8/x86_64-linux/auto/Compress/Raw/Zlib/Zlib.so
BackupPC_ 4607 backuppc  mem    REG                8,1    84368    76323 
/lib64/libz.so.1.2.3
BackupPC_ 4607 backuppc  mem    REG                8,1    22688   211593 
/usr/lib64/perl5/vendor_perl/5.8.8/x86_64-linux/auto/File/RsyncP/Digest/Digest.so
BackupPC_ 4607 backuppc  mem    REG                8,1    52064   211597 
/usr/lib64/perl5/vendor_perl/5.8.8/x86_64-linux/auto/File/RsyncP/FileList/FileList.so
BackupPC_ 4607 backuppc  mem    REG                8,1    31544    76340 
/lib64/libnss_compat-2.6.1.so
BackupPC_ 4607 backuppc  mem    REG                8,1    43496    75398 
/lib64/libnss_nis-2.6.1.so
BackupPC_ 4607 backuppc  mem    REG                8,1    43448    75374 
/lib64/libnss_files-2.6.1.so
BackupPC_ 4607 backuppc    0w   CHR                1,3               607 
/dev/null
BackupPC_ 4607 backuppc    1w  FIFO                0,6          15292289 pipe
BackupPC_ 4607 backuppc    2w  FIFO                0,6          15292289 pipe
BackupPC_ 4607 backuppc    3w   REG               8,33     2726   430785 
/backups/pc/ey02-s00416_pictures/LOG.122009
BackupPC_ 4607 backuppc    4w   REG               8,33 44026496 11706115 
/backups/pc/ey02-s00416_pictures/XferLOG.z
BackupPC_ 4607 backuppc    5w   REG               8,33        0 12301398 
/backups/pc/ey02-s00416_pictures/NewFileList
BackupPC_ 4607 backuppc    7u  unix 0xffff88002497f5c0          15292306 socket
- -------------

Last bits of strace of the ssh/rsyc proc on the SERVER: backuppc  4610  4607  0 
00:12 ?        00:00:01 /usr/bin/ssh -p 8416 -i 
/engineyard/backuppc/.ssh/id_dsa -o StrictHostKeyChecking=no -o 
ServerAliveInterval=60 -vvv -x -l root 65.74.174.196 /usr/bin/rsync --server 
--sender --numeric-ids --perms --owner --group -D --links --times 
--block-size=2048 --recursive . /

- -------------
10:54:07.576492 write(3, 
"\221\221\241\"\310\217\321\26\3\302\f\255ne8.\235\17\350~\201\234\227\305\34\200D\271\323\2463K\35\355C\355\3226\340\252hm\336\310\267-J!?\273\362\301\333\335\237'H\32\361F\370\262W\373"...,
 64) = 64
10:54:07.576492 select(7, [3 4], [], NULL, {60, 0}) = 1 (in [3], left {60, 0})
10:54:07.576492 read(3, 
"\3377\335\202\35\5\212_\361\205R\2176\vJ>\306Q\352\370(\224\342+I\211\234\231\250<\220L"...,
 8192) = 32
10:54:07.576492 select(7, [3 4], [], NULL, {60, 0}) = 0 (Timeout)
10:55:07.577250 select(7, [3 4], [3], NULL, {60, 0}) = 1 (out [3], left {60, 0})
10:55:07.577250 write(3, 
"\320\212\6\230!\253\314>j\300\356\1^\306\351\217\230f\3044G\f\235y\263>\223\352i\222\262\361\34\344G\231\v\215EyBs\300$\343\325\200\23R\216\332E\264e\217\304\336\314Q]\227$BM"...,
 64) = 64
10:55:07.577250 select(7, [3 4], [], NULL, {60, 0}) = 1 (in [3], left {60, 0})
10:55:07.577250 read(3, 
"$1\277{\241\253\3k\351\35\225\242\30\2\2770o\335\v\35\342\333%\251\312[\262x#y9O"...,
 8192) = 32
10:55:07.577250 select(7, [3 4], [], NULL, {60, 0}) = 0 (Timeout)
10:56:07.578008 select(7, [3 4], [3], NULL, {60, 0}) = 1 (out [3], left {60, 0})
10:56:07.578008 write(3, 
"\350]\373a\246G\21\255\352\246\317\27\"\252\207\256\243\337\33\343\334N2\232\350\316B\316'\204\345pq!\5\3R\217mH\225\335?e\302wI\25\0327\357V\373\350\376\235\330_\222\250\270\6\371\350"...,
 64) = 64
10:56:07.578008 select(7, [3 4], [], NULL, {60, 0}) = 1 (in [3], left {60, 0})
10:56:07.578008 read(3, 
"~\f\325\212\323u\227\375X\226W`>\3258G\2471\311\306S;\3\37\273\23M\263q$\v\324"...,
 8192) = 32
10:56:07.578008 select(7, [3 4], [], NULL, {60, 0}
- -------------

Note the rather different timestamps.

lsof, same proc, on the SERVER:

- -------------
xc88-s00071 ~ # lsof -p 4607
COMMAND    PID     USER   FD   TYPE             DEVICE     SIZE     NODE NAME
BackupPC_ 4607 backuppc  cwd    DIR                8,1      624        2 /
BackupPC_ 4607 backuppc  rtd    DIR                8,1      624        2 /
BackupPC_ 4607 backuppc  txt    REG                8,1  1104232    14724 
/usr/bin/perl5.8.8
BackupPC_ 4607 backuppc  mem    REG                8,1   115056    76305 
/lib64/ld-2.6.1.so
BackupPC_ 4607 backuppc  mem    REG                8,1    93392    75381 
/lib64/libpthread-2.6.1.so
BackupPC_ 4607 backuppc  mem    REG                8,1    84904    76357 
/lib64/libnsl-2.6.1.so
BackupPC_ 4607 backuppc  mem    REG                8,1    14632    76341 
/lib64/libdl-2.6.1.so
BackupPC_ 4607 backuppc  mem    REG                8,1   526568    75404 
/lib64/libm-2.6.1.so
BackupPC_ 4607 backuppc  mem    REG                8,1    22728    75346 
/lib64/libcrypt-2.6.1.so
BackupPC_ 4607 backuppc  mem    REG                8,1    10592    76264 
/lib64/libutil-2.6.1.so
BackupPC_ 4607 backuppc  mem    REG                8,1  1305848    76332 
/lib64/libc-2.6.1.so
BackupPC_ 4607 backuppc  mem    REG                8,1    31136    29639 
/usr/lib64/perl5/5.8.8/x86_64-linux/auto/Data/Dumper/Dumper.so
BackupPC_ 4607 backuppc  mem    REG                8,1    10288    29623 
/usr/lib64/perl5/5.8.8/x86_64-linux/auto/Cwd/Cwd.so
BackupPC_ 4607 backuppc  mem    REG                8,1    14288    29711 
/usr/lib64/perl5/5.8.8/x86_64-linux/auto/Fcntl/Fcntl.so
BackupPC_ 4607 backuppc  mem    REG                8,1    22592    29891 
/usr/lib64/perl5/5.8.8/x86_64-linux/auto/Socket/Socket.so
BackupPC_ 4607 backuppc  mem    REG                8,1    18440    29666 
/usr/lib64/perl5/5.8.8/x86_64-linux/auto/Digest/MD5/MD5.so
BackupPC_ 4607 backuppc  mem    REG                8,1    40080    29690 
/usr/lib64/perl5/5.8.8/x86_64-linux/auto/Encode/Encode.so
BackupPC_ 4607 backuppc  mem    REG                8,1    18592    29610 
/usr/lib64/perl5/5.8.8/x86_64-linux/auto/IO/IO.so
BackupPC_ 4607 backuppc  mem    REG                8,1    26904   209613 
/usr/lib64/perl5/vendor_perl/5.8.8/x86_64-linux/auto/List/Util/Util.so
BackupPC_ 4607 backuppc  mem    REG                8,1    18560    29648 
/usr/lib64/perl5/5.8.8/x86_64-linux/auto/File/Glob/Glob.so
BackupPC_ 4607 backuppc  mem    REG                8,1    67944   209571 
/usr/lib64/perl5/vendor_perl/5.8.8/x86_64-linux/auto/Compress/Raw/Zlib/Zlib.so
BackupPC_ 4607 backuppc  mem    REG                8,1    84368    76323 
/lib64/libz.so.1.2.3
BackupPC_ 4607 backuppc  mem    REG                8,1    22688   211593 
/usr/lib64/perl5/vendor_perl/5.8.8/x86_64-linux/auto/File/RsyncP/Digest/Digest.so
BackupPC_ 4607 backuppc  mem    REG                8,1    52064   211597 
/usr/lib64/perl5/vendor_perl/5.8.8/x86_64-linux/auto/File/RsyncP/FileList/FileList.so
BackupPC_ 4607 backuppc  mem    REG                8,1    31544    76340 
/lib64/libnss_compat-2.6.1.so
BackupPC_ 4607 backuppc  mem    REG                8,1    43496    75398 
/lib64/libnss_nis-2.6.1.so
BackupPC_ 4607 backuppc  mem    REG                8,1    43448    75374 
/lib64/libnss_files-2.6.1.so
BackupPC_ 4607 backuppc    0w   CHR                1,3               607 
/dev/null
BackupPC_ 4607 backuppc    1w  FIFO                0,6          15292289 pipe
BackupPC_ 4607 backuppc    2w  FIFO                0,6          15292289 pipe
BackupPC_ 4607 backuppc    3w   REG               8,33     2726   430785 
/backups/pc/ey02-s00416_pictures/LOG.122009
BackupPC_ 4607 backuppc    4w   REG               8,33 44026496 11706115 
/backups/pc/ey02-s00416_pictures/XferLOG.z
BackupPC_ 4607 backuppc    5w   REG               8,33        0 12301398 
/backups/pc/ey02-s00416_pictures/NewFileList
BackupPC_ 4607 backuppc    7u  unix 0xffff88002497f5c0          15292306 socket
- -------------

rsync/ssh on the CLIENT:

root     21661  3254  0 00:12 ?        00:00:01 sshd: root@notty
root     21663 21661 30 00:12 ?        03:18:21 /usr/bin/rsync --server 
--sender --numeric-ids --perms --owner --group -D --links --times 
--block-size=2048 --recursive . /

last bits of strace from rsync/ssh on the CLIENT:

- -------------
04:34:54.502841 
lstat("pictures/agate/shared/pictures/0216/2987/Sapphire_Pearl_Pendant_segment.jpg",
 {st_mode=S_IFREG|0644, st_size=1152, ...}) = 0
04:34:54.503024 
lstat("pictures/agate/shared/pictures/0216/2987/Sapphire_Pearl_Pendant.JPG", 
{st_mode=S_IFREG|0644, st_size=88733, ...}) = 0
04:34:54.503208 
lstat("pictures/agate/shared/pictures/0216/2987/Sapphire_Pearl_Pendant_market_banner.jpg",
 {st_mode=S_IFREG|0644, st_size=21168, ...}) = 0
04:34:54.503392 getdents(3, /* 0 entries */, 4096) = 0
04:34:54.503459 close(3)                = 0
04:34:54.503523 gettimeofday({1260966894, 503548}, NULL) = 0
04:34:54.503583 gettimeofday({1260966894, 503608}, NULL) = 0
04:34:54.503648 mmap(NULL, 20447232, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b3d5b87c000
04:35:06.795465 munmap(0x2b3d5b87c000, 20447232) = 0
04:35:06.844403 select(2, NULL, [1], [1], {60, 0}) = 1 (out [1], left {60, 0})
04:35:06.844557 write(1, "\200\f\0\7", 4) = 4
04:35:06.844633 select(2, NULL, [1], [1], {60, 0}) = 1 (out [1], left {60, 0})
04:35:06.844739 write(1, 
".jpg\3306\0\0002]\342I:7\21market_banner.jpg\323<\1\0000]\342I:$\0333057/IMG_3601new_browse.jpg\216\t\0\0\244\205\341I:5\vdisplay.jpg\274S\0\0\240\205\341I:5\vsegment.jpg\346\4\0\0\246\205\341I:7\nlector.jpg\315\5\0\0\245\205\341I:4\4.jpgy\304\22\0\235\205\341I:4\22_market_banner.jpg\r\252\0\0\236\205\341I:5\vgallery.jpgk\23\0\0\243\205\341I:5\tbadge.jpg\17\33\0\0\237\205\341I:5\vlibrary.jpg\'\n\0\0\241\205\341I:5\17shop_banner.jpg\331u\0\0\242\205\341I:$\0265867/banner_browse.jpg\225\f\0\0\17*\342I\2720\vsegment.jpg\330\5\0\0\2720\vdisplay.jpg\30!\0\0\2720\21market_banner.jpg}\\\0\0\2720\vgallery.jpg\357\26\0\0\2720\vlibrary.jpgC\5\0\0\272/\4.JPG\t/\0\0\272/\r_selector.jpg\2\7\0\0\2720\tbadge.jpg\233\36\0\0\2720\17shop_banner.jpg\300?\0\0:$\0262887/williamsburg3.JPG\327v\5\0.{\341I:6\f_display.jpg\323\267\0\0/{\341I\2727\tbadge.jpg\34.\0\0:7\17shop_banner.jpg\227\344\0\0000{\341I:8\negment.jpg&\6\0\0001{\341I:9\nlector.jpg|\7\0\0000{\341I\2727\nbrowse.jpg\330\f\0\0:7\vlibrary.jpg\267\16\0\0/{\341I:7\21market_banner.jpg\'\f\1\0.{\341I:7\vgallery.jpg\236\36\0\0000{\341I:$\0267919/Joy_2_segment.jpg-\7\0\0\312O\342I\272/\vdisplay.jpg\266\315\0\0\272/\20for_cropping.jpgg\227\0\0\272/\tbadge.jpg\2762\0\0\2720\trowse.jpg\27\20\0\0\272.\4.JPGZ\256\2\0\272.\20_shop_banner.jpgY\270\0\0\2720\velector.jpg\360\10\0\0\272/\vlibrary.jpg\234\22\0\0\272/\vgallery.jpg.#\0\0\272/\21market_banner.jpg\332\311\0\0:$\0320316/P1010406_selector.jpg\352\5\0\0\276B\341I:2\nbrowse.jpg\346\n\0\0\275B\341I:2\vdisplay.jpgY\275\0\0\273B\341I:2\vsegment.jpg\24\5\0\0\277B\341I:3\16hop_banner.jpg\3649\1\0\274B\341I:1\4.JPG\214$\20\0\271B\341I:1\f_library.jpg~\v\0\0\274B\341I:2\tbadge.jpg\r/\0\0\272B\341I\2722\21market_banner.jpg\310\261\1\0:2\vgallery.jpgb\35\0\0\275B\341I:$\0343257/S-54b_market_banner.jpg\245\17\0\0\233\223\341I:/\vlibrary.jpg6\v\0\0\234\223\341I\272/\vgallery.jpg\325\23\0\0\272/\17shop_banner.jpg\240\17\0\0\272/\vdisplay.jpg*[\0\0\272/\vsegment.jpg\310\4\0\0\272/\nbrowse.jpg\320\t\0\0:.\4.jpg\255T\1\0\233\223\341I:.\r_selector.jpg\33\6\0\0\234\223\341I\272/\tbadge.jpg\215\33\0\0:$\0309742/Jophiel_segment.jpg\364\5\0\0\0i\342I:1\vdisplay.jpg\23V\0\0\354h\342I:1\tbadge.jpg\321\"\0\0\377h\342I:1\17shop_banner.jpg\235\\\0\0\354h\342I:2\velector.jpgm\7\0\0\0i\342I:0\4.jpg\30a\2\0\353h"...,
 3200) = 3200
04:35:06.845208 select(1, [0], [], NULL, {60, 0}) = 0 (Timeout)
04:36:06.844621 select(1, [0], [], NULL, {60, 0}) = 0 (Timeout)
04:37:06.949968 select(1, [0], [], NULL, {60, 0}) = 0 (Timeout)
04:38:06.950773 select(1, [0], [], NULL, {60, 0}) = 0 (Timeout)
04:39:06.951859 select(1, [0], [], NULL, {60, 0}) = 0 (Timeout)
04:40:06.952896 select(1, [0], [], NULL, {60, 0}) = 0 (Timeout)
04:41:06.953942 select(1, [0], [], NULL, {60, 0}) = 0 (Timeout)
04:42:07.181208 select(1, [0], [], NULL, {60, 0}) = 0 (Timeout)
[snip]
10:53:10.572467 select(1, [0], [], NULL, {60, 0}) = 0 (Timeout)
10:54:11.041651 select(1, [0], [], NULL, {60, 0}) = 0 (Timeout)
10:55:11.282693 select(1, [0], [], NULL, {60, 0}) = 0 (Timeout)
10:56:11.523736 select(1, [0], [], NULL, {60, 0}) = 0 (Timeout)
10:57:11.524637 select(1, [0], [], NULL, {60, 0}) = 0 (Timeout)
10:58:11.525718 select(1, [0], [], NULL, {60, 0}
- -------------

lsof of same on CLIENT:

- -------------
ey02-s00416 ~ # lsof -p 21663
COMMAND   PID USER   FD   TYPE             DEVICE    SIZE      NODE NAME
rsync   21663 root  cwd    DIR                8,1     760         2 /
rsync   21663 root  rtd    DIR                8,1     760         2 /
rsync   21663 root  txt    REG                8,1  372408    519144 
/usr/bin/rsync
rsync   21663 root  mem    REG                8,1  114952     78335 
/lib64/ld-2.6.1.so
rsync   21663 root  mem    REG                8,1   31296     65085 
/lib64/libacl.so.1.1.0
rsync   21663 root  mem    REG                8,1   31424     35956 
/usr/lib64/libpopt.so.0.0.0
rsync   21663 root  mem    REG                8,1 1305744     78362 
/lib64/libc-2.6.1.so
rsync   21663 root  mem    REG                8,1   18696     99779 
/lib64/libattr.so.1.1.0
rsync   21663 root    0u  unix 0xffff8800b4546c40         175163806 socket
rsync   21663 root    1u  unix 0xffff8800b4546c40         175163806 socket
rsync   21663 root    2u  unix 0xffff880044935740         175163808 socket
- -------------

And, finally, tcpdump on the CLIENT:

- -------------
04:34:53.525749 IP (tos 0x8, ttl 56, id 19535, offset 0, flags [DF], proto TCP 
(6), length 52) 129.177.74.65.static.xc88.engineyard.com.42585 > 
ey02-s00416.ey02.engineyard.com.ssh: ., cksum 0xf81e (correct), 54385:54385(0) 
ack 115876872 win 501 <nop,nop,timestamp144817076 513742551>
04:34:53.525792 IP (tos 0x8, ttl 64, id 35494, offset 0, flags [DF], proto TCP 
(6), length 1284) ey02-s00416.ey02.engineyard.com.ssh > 
129.177.74.65.static.xc88.engineyard.com.42585: P 115878320:115879552(1232) ack 
54385 win 132 <nop,nop,timestamp 513742551 144817076>
04:34:53.526944 IP (tos 0x8, ttl 56, id 19536, offset 0, flags [DF], proto TCP 
(6), length 52) 129.177.74.65.static.xc88.engineyard.com.42585 > 
ey02-s00416.ey02.engineyard.com.ssh: ., cksum 0xeda6 (correct), 54385:54385(0) 
ack 115879552 win 501 <nop,nop,timestamp144817076 513742551>
04:34:54.074567 IP (tos 0x8, ttl 64, id 35495, offset 0, flags [DF], proto TCP 
(6), length 100) ey02-s00416.ey02.engineyard.com.ssh> 
129.177.74.65.static.xc88.engineyard.com.42585: P 115879552:115879600(48) ack 
54385 win 132 <nop,nop,timestamp 513742688 144817076>
04:34:54.075133 IP (tos 0x8, ttl 64, id 35496, offset 0, flags [DF], proto TCP 
(6), length 2948) ey02-s00416.ey02.engineyard.com.ssh > 
129.177.74.65.static.xc88.engineyard.com.42585: . 115879600:115882496(2896) ack 
54385 win 132 <nop,nop,timestamp 513742688 144817076>
04:34:54.076020 IP (tos 0x8, ttl 56, id 19537, offset 0, flags [DF], proto TCP 
(6), length 52) 129.177.74.65.static.xc88.engineyard.com.42585 > 
ey02-s00416.ey02.engineyard.com.ssh: ., cksum 0xe6bc (correct), 54385:54385(0) 
ack 115881048 win 501 <nop,nop,timestamp144817213 513742688>
04:34:54.076051 IP (tos 0x8, ttl 64, id 35498, offset 0, flags [DF], proto TCP 
(6), length 1284) ey02-s00416.ey02.engineyard.com.ssh > 
129.177.74.65.static.xc88.engineyard.com.42585: P 115882496:115883728(1232) ack 
54385 win 132 <nop,nop,timestamp 513742689 144817213>
04:34:54.077032 IP (tos 0x8, ttl 56, id 19538, offset 0, flags [DF], proto TCP 
(6), length 52) 129.177.74.65.static.xc88.engineyard.com.42585 > 
ey02-s00416.ey02.engineyard.com.ssh: ., cksum 0xdc44 (correct), 54385:54385(0) 
ack 115883728 win 501 <nop,nop,timestamp144817213 513742688>
04:35:06.844658 IP (tos 0x8, ttl 64, id 35499, offset 0, flags [DF], proto TCP 
(6), length 100) ey02-s00416.ey02.engineyard.com.ssh> 
129.177.74.65.static.xc88.engineyard.com.42585: P 115883728:115883776(48) ack 
54385 win 132 <nop,nop,timestamp 513745881 144817213>
04:35:06.845281 IP (tos 0x8, ttl 64, id 35500, offset 0, flags [DF], proto TCP 
(6), length 2948) ey02-s00416.ey02.engineyard.com.ssh > 
129.177.74.65.static.xc88.engineyard.com.42585: . 115883776:115886672(2896) ack 
54385 win 132 <nop,nop,timestamp 513745881 144817213>
04:35:06.846300 IP (tos 0x8, ttl 56, id 19539, offset 0, flags [DF], proto TCP 
(6), length 52) 129.177.74.65.static.xc88.engineyard.com.42585 > 
ey02-s00416.ey02.engineyard.com.ssh: ., cksum 0xbd7a (correct), 54385:54385(0) 
ack 115885224 win 501 <nop,nop,timestamp144820406 513745881>
04:35:06.846353 IP (tos 0x8, ttl 64, id 35502, offset 0, flags [DF], proto TCP 
(6), length 404) ey02-s00416.ey02.engineyard.com.ssh> 
129.177.74.65.static.xc88.engineyard.com.42585: P 115886672:115887024(352) ack 
54385 win 132 <nop,nop,timestamp 513745881 144820406>
04:35:06.847088 IP (tos 0x8, ttl 56, id 19540, offset 0, flags [DF], proto TCP 
(6), length 52) 129.177.74.65.static.xc88.engineyard.com.42585 > 
ey02-s00416.ey02.engineyard.com.ssh: ., cksum 0xb672 (correct), 54385:54385(0) 
ack 115887024 win 501 <nop,nop,timestamp144820406 513745881>
04:36:06.846945 IP (tos 0x8, ttl 56, id 19541, offset 0, flags [DF], proto TCP 
(6), length 116) 129.177.74.65.static.xc88.engineyard.com.42585 > 
ey02-s00416.ey02.engineyard.com.ssh: P 54385:54449(64) ack 115887024 win 501 
<nop,nop,timestamp 144835406 513745881>
04:36:06.846970 IP (tos 0x8, ttl 64, id 35503, offset 0, flags [DF], proto TCP 
(6), length 52) ey02-s00416.ey02.engineyard.com.ssh > 
129.177.74.65.static.xc88.engineyard.com.42585: ., cksum 0x4273 (correct), 
115887024:115887024(0) ack 54449 win 132 <nop,nop,timestamp 513760881 144835406>
04:36:06.847037 IP (tos 0x8, ttl 64, id 35504, offset 0, flags [DF], proto TCP 
(6), length 84) ey02-s00416.ey02.engineyard.com.ssh > 
129.177.74.65.static.xc88.engineyard.com.42585: P 115887024:115887056(32) ack 
54449 win 132 <nop,nop,timestamp 513760881 144835406>
04:36:06.886586 IP (tos 0x8, ttl 56, id 19542, offset 0, flags [DF], proto TCP 
(6), length 52) 129.177.74.65.static.xc88.engineyard.com.42585 > 
ey02-s00416.ey02.engineyard.com.ssh: ., cksum 0x40d8 (correct), 54449:54449(0) 
ack 115887056 win 501 <nop,nop,timestamp144835416 513760881>
04:37:06.847775 IP (tos 0x8, ttl 56, id 19543, offset 0, flags [DF], proto TCP 
(6), length 116) 129.177.74.65.static.xc88.engineyard.com.42585 > 
ey02-s00416.ey02.engineyard.com.ssh: P 54449:54513(64) ack 115887056 win 501 
<nop,nop,timestamp 144850406 513760881>
04:37:06.847878 IP (tos 0x8, ttl 64, id 35505, offset 0, flags [DF], proto TCP 
(6), length 84) ey02-s00416.ey02.engineyard.com.ssh > 
129.177.74.65.static.xc88.engineyard.com.42585: P 115887056:115887088(32) ack 
54513 win 132 <nop,nop,timestamp 513775881 144850406>
04:37:06.848515 IP (tos 0x8, ttl 56, id 19544, offset 0, flags [DF], proto TCP 
(6), length 52) 129.177.74.65.static.xc88.engineyard.com.42585 > 
ey02-s00416.ey02.engineyard.com.ssh: ., cksum 0xcb51 (correct), 54513:54513(0) 
ack 115887088 win 501 <nop,nop,timestamp144850406 513775881>
[snip]
10:53:07.575258 IP (tos 0x8, ttl 56, id 20295, offset 0, flags [DF], proto TCP 
(6), length 116) 129.177.74.65.static.xc88.engineyard.com.42585 > 
ey02-s00416.ey02.engineyard.com.ssh: P 78513:78577(64) ack 115899088 win 501 
<nop,nop,timestamp 150490516 519400964>
10:53:07.575383 IP (tos 0x8, ttl 64, id 35883, offset 0, flags [DF], proto TCP 
(6), length 84) ey02-s00416.ey02.engineyard.com.ssh > 
129.177.74.65.static.xc88.engineyard.com.42585: P 115899088:115899120(32) ack 
78577 win 132 <nop,nop,timestamp 519415964 150490516>
10:53:07.576056 IP (tos 0x8, ttl 56, id 20296, offset 0, flags [DF], proto TCP 
(6), length 52) 129.177.74.65.static.xc88.engineyard.com.42585 > 
ey02-s00416.ey02.engineyard.com.ssh: ., cksum 0x1e64 (correct), 78577:78577(0) 
ack 115899120 win 501 <nop,nop,timestamp 150490516 519415964>
10:54:07.576336 IP (tos 0x8, ttl 56, id 20297, offset 0, flags [DF], proto TCP 
(6), length 116) 129.177.74.65.static.xc88.engineyard.com.42585 > 
ey02-s00416.ey02.engineyard.com.ssh: P 78577:78641(64) ack 115899120 win 501 
<nop,nop,timestamp 150505516 519415964>
10:54:07.576514 IP (tos 0x8, ttl 64, id 35884, offset 0, flags [DF], proto TCP 
(6), length 84) ey02-s00416.ey02.engineyard.com.ssh > 
129.177.74.65.static.xc88.engineyard.com.42585: P 115899120:115899152(32) ack 
78641 win 132 <nop,nop,timestamp 519430964 150505516>
- -------------

Suggestions welcome, but it sure doesn't look like any kind of
networking problem to me.  The only place I can see where there
might have been some kind of communication breakdown is between
BackupPC_dump on the server and the ssh/rsync on the server; it
looks like BackupPC_dump is waiting for more data but rsync is done
sending it.

The last thing the client rsync sent was:

04:35:06.844739 write(1, 
".jpg\3306\0\0002]\342I:7\21market_banner.jpg\323<\1\0000[snip]"..., 3200) = 
3200

And the last thing BackupPC_dump saw before the select that never ends was:

04:35:06.849495 read(7, 
".jpg\3306\0\0002]\342I:7\21market_banner.jpg\323<\1\0000[snip]"..., 65536) = 
3200

So clearly their last communication was in accord.

It just looks like BackupPC_dump has no idea that rsync is done.  So
after a while, the NAT deletes the apparently "dead" connection,
except with keepalives that doesn't happen, so we just sit here
forever.

The other 7 BackupPC_dump sessions seem to be in the same state:
select stuck on FD 7.

I'm quite stumped.  Any ideas?

-Robin

-- 
They say:  "The first AIs will be built by the military as weapons."
And I'm  thinking:  "Does it even occur to you to try for something
other  than  the default  outcome?"  See http://shrunklink.com/cdiz
http://www.digitalkingdom.org/~rlpowell/ *** http://www.lojban.org/

------------------------------------------------------------------------------
This SF.Net email is sponsored by the Verizon Developer Community
Take advantage of Verizon's best-in-class app development support
A streamlined, 14 day to market process makes app distribution fast and easy
Join now and get one step closer to millions of Verizon customers
http://p.sf.net/sfu/verizon-dev2dev 
_______________________________________________
BackupPC-users mailing list
BackupPC-users AT lists.sourceforge DOT net
List:    https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:    http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/

<Prev in Thread] Current Thread [Next in Thread>