Hi all:
Having a problem with backuppc running over ssh with rsync.
I am backing up a 38GB file daily (database dump). There were some
changes on the database server, so I started a new full dump. After
two days (40+ hours) it had still not completed. This is over a GB
network link. I restarted the full dump thinking it had gotten hung.
I tried to find the partly written file on the backup server, and
found some very odd things.
The BackupPC_dump perl process (pid 16784) is showing this sequence in
strace:
alarm(172800) = 172800
select(16, [11], NULL, [11], NULL) = 1 (in [11])
read(11, "BjGq8Ax1SfiCcrVqaRuPzV5js\n /c3By"..., 65536) = 8188
alarm(172800) = 172800
alarm(172800) = 172800
alarm(172800) = 172800
select(16, [11], NULL, [11], NULL) = 1 (in [11])
read(11, "\374\17\0\7tQsU3/6Vsd/9z\n 5/k/8AH6xrT/X"..., 65536) = 12288
alarm(172800) = 172800
alarm(172800) = 172800
alarm(172800) = 172800
alarm(172800) = 172800
alarm(172800) = 172800
alarm(172800) = 172800
select(16, [11], NULL, [11], NULL) = 1 (in [11])
read(11, "\374\17\0\7HvnkYehOv7k\n Dz+T/vuo5EevevG"..., 65536) = 12288
alarm(172800) = 172800
alarm(172800) = 172800
alarm(172800) = 172800
lsof on that process shows the following file descriptors:
BackupPC_ 16784 backup 0w CHR 1,3 1937 /dev/null
BackupPC_ 16784 backup 1w FIFO 0,6 846787883 pipe
BackupPC_ 16784 backup 2w FIFO 0,6 846787883 pipe
BackupPC_ 16784 backup 3u IPv4 846787907 TCP
backup:59406->ldap:ldaps(ESTABLISHED)
BackupPC_ 16784 backup 4w REG 9,2 2171 47693861
/backup-pc/pc/ldap01.bos1.renesys.com/LOG.022009
BackupPC_ 16784 backup 5w REG 9,2 0 47695023
/backup-pc/pc/ldap01.bos1.renesys.com/XferLOG.z
BackupPC_ 16784 backup 6w REG 9,2 2228 47695267
/backup-pc/pc/ldap01.bos1.renesys.com/NewFileList
BackupPC_ 16784 backup 7w REG 9,2 8634251394 36864007
/backup-pc/pc/ldap01.bos1.renesys.com/new/f%2fdata%2fbak/ffedora-ds/fcurrent/fuserRoot/fid2entry.db4
BackupPC_ 16784 backup 8u REG 9,2 37303115776 34717713
/backup-pc/pc/ldap01.bos1.renesys.com/new/f%2fdata%2fbak/RStmp
BackupPC_ 16784 backup 9u unix 0xf6f12c80 846791450 socket
BackupPC_ 16784 backup 11u unix 0xf6835c80 846791422 socket
ls -l of the file on file descriptor 7:
-rw-r----- 1 backup dev 8634251394 Feb 6 15:10
/backup-pc/pc/ldap01.bos1.renesys.com/new/f%2fdata%2fbak/ffedora-ds/fcurrent/fuserRoot/fid2entry.db4
while the one on fd 8 shows:
-rw-r----- 1 backup dev 37303115776 Feb 5 18:34
/backup-pc/pc/ldap01.bos1.renesys.com/new/f%2fdata%2fbak/RStmp
What is the RStmp file? That one grew pretty quickly to it's current
size given the start time of the backup (18:08 on feb 5). If I run
file (1) against that file it identifies it as
/backup-pc/pc/ldap01.bos1.renesys.com/new/f%2fdata%2fbak/RStmp:
Berkeley DB (Btree, version 9, native byte-order)
While file on the fid2entry.db4 file shows:
/backup-pc/pc/ldap01.bos1.renesys.com/new/f%2fdata%2fbak/ffedora-ds/fcurrent/fuserRoot/fid2entry.db4:
data
I could almost believ that I looking at the actual rsynced data file
(in RStmp) which is a little shorter than the source file possibly due
to sparse file compression or something that makes it a bit shorter by
suppressing blocks of nulls? But running a cmp -l between the source
file and the RStmp file shows differences before the end of the RStmp
file.
The backup server has 8GB of memory available, is almost 100% idle and
I am seeing no I/O wait in top. I don't see the perl/BackupPC_dump
process showing up in the top output either. The server is running
centos 5.2 on an ext3 filesystem.
The rsync on the host being backed up looks like:
write(1, "\374\17\0\7", 4) = 4
select(2, NULL, [1], [1], {60, 0}) = 1 (out [1], left {60, 0})
write(1, " fLv/BPD4v6neeJr74fazcy31hf2r3em"..., 4092) = 4092
select(2, NULL, [1], [1], {60, 0}) = 1 (out [1], left {60, 0})
write(1, "\374\17\0\7", 4) = 4
select(2, NULL, [1], [1], {60, 0}) = 1 (out [1], left {60, 0})
write(1, "\266.\27\0\1\0\346\37\0\7dn: nsuniqueid=4ed5424"..., 4092) =
4092
select(2, NULL, [1], [1], {60, 0}) = 1 (out [1], left {60, 0})
write(1, "\374\17\0\7", 4) = 4
select(2, NULL, [1], [1], {60, 0}) = 1 (out [1], left {60, 0})
write(1, "DkmvilSqKpyWd+x9aqkHHmjJHg3x28V/"..., 4092) = 4092
select(2, NULL, [1], [1], {60, 0}) = 0 (Timeout)
select(2, NULL, [1], [1], {60, 0}) = 0 (Timeout)
select(2, NULL, [1], [1], {60, 0}) = 0 (Timeout)
with a bunch of timeout's but it does go back into cycles of
select/writes.
lsof shows:
COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME
rsync 8865 root 0u unix 0xffff81005cc16380 625717825 socket
rsync 8865 root 1u unix 0xffff81005cc16380 625717825 socket
rsync 8865 root 2u unix 0xffff8101871c4ec0 625717827 socket
rsync 8865 root 3r REG 253,6 38268567552 13828180
/data/bak/fedora-ds/old/userRoot/id2entry.db4
I am not seeing any reads from fd3, but it's possible it has a bufer
full of data and is just writing thet buffer.
ls -l of fd 3's file:
-rw------- 1 fedorads fedorads 38268567552 Feb 4 22:42
/data/bak/fedora-ds/old/userRoot/id2entry.db4
The status display in the web interface reports:
Host Type User Start Time
Command PID Xfer PID
ldap01.bos1.renesys.com full sysops 2009-02-05 18:08
BackupPC_dump -f ldap01.bos1.renesys.com 16567 16783, 16784
So anybody got a clue as to what is happening here, or what I should
be looking for to try to find out why this is hung, or what is really
happening under the covers?
This is pretty much a show stopper for us.
Thanks for any help.
--
-- rouilj
John Rouillard
System Administrator
Renesys Corporation
603-244-9084 (cell)
603-643-9300 x 111
------------------------------------------------------------------------------
Create and Deploy Rich Internet Apps outside the browser with Adobe(R)AIR(TM)
software. With Adobe AIR, Ajax developers can use existing skills and code to
build responsive, highly engaging applications that combine the power of local
resources and data with the reach of the web. Download the Adobe AIR SDK and
Ajax docs to start building applications today-http://p.sf.net/sfu/adobe-com
_______________________________________________
BackupPC-users mailing list
BackupPC-users AT lists.sourceforge DOT net
List: https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki: http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/
|