BackupPC-users

[BackupPC-users] 38GB file backup is hanging backuppc.

2009-02-06 10:43:33
Subject: [BackupPC-users] 38GB file backup is hanging backuppc.
From: John Rouillard <rouilj-backuppc AT renesys DOT com>
To: "General list for user discussion, questions and support" <backuppc-users AT lists.sourceforge DOT net>
Date: Fri, 6 Feb 2009 15:41:40 +0000
Hi all:

Having a problem with backuppc running over ssh with rsync.

I am backing up a 38GB file daily (database dump). There were some
changes on the database server, so I started a new full dump. After
two days (40+ hours) it had still not completed. This is over a GB
network link. I restarted the full dump thinking it had gotten hung.

I tried to find the partly written file on the backup server, and
found some very odd things.

The BackupPC_dump perl process (pid 16784) is showing this sequence in
strace:

  alarm(172800)                           = 172800
  select(16, [11], NULL, [11], NULL)      = 1 (in [11])
  read(11, "BjGq8Ax1SfiCcrVqaRuPzV5js\n /c3By"..., 65536) = 8188
  alarm(172800)                           = 172800
  alarm(172800)                           = 172800
  alarm(172800)                           = 172800
  select(16, [11], NULL, [11], NULL)      = 1 (in [11])
  read(11, "\374\17\0\7tQsU3/6Vsd/9z\n 5/k/8AH6xrT/X"..., 65536) = 12288
  alarm(172800)                           = 172800
  alarm(172800)                           = 172800
  alarm(172800)                           = 172800
  alarm(172800)                           = 172800
  alarm(172800)                           = 172800
  alarm(172800)                           = 172800
  select(16, [11], NULL, [11], NULL)      = 1 (in [11])
  read(11, "\374\17\0\7HvnkYehOv7k\n Dz+T/vuo5EevevG"..., 65536) = 12288
  alarm(172800)                           = 172800
  alarm(172800)                           = 172800
  alarm(172800)                           = 172800

lsof on that process shows the following file descriptors:

  BackupPC_ 16784 backup    0w   CHR        1,3                  1937 /dev/null
  BackupPC_ 16784 backup    1w  FIFO        0,6             846787883 pipe
  BackupPC_ 16784 backup    2w  FIFO        0,6             846787883 pipe
  BackupPC_ 16784 backup    3u  IPv4  846787907                   TCP 
backup:59406->ldap:ldaps(ESTABLISHED)
  BackupPC_ 16784 backup    4w   REG        9,2        2171  47693861
     /backup-pc/pc/ldap01.bos1.renesys.com/LOG.022009
  BackupPC_ 16784 backup    5w   REG        9,2           0  47695023
     /backup-pc/pc/ldap01.bos1.renesys.com/XferLOG.z
  BackupPC_ 16784 backup    6w   REG        9,2        2228  47695267
     /backup-pc/pc/ldap01.bos1.renesys.com/NewFileList
  BackupPC_ 16784 backup    7w   REG        9,2  8634251394  36864007
     
/backup-pc/pc/ldap01.bos1.renesys.com/new/f%2fdata%2fbak/ffedora-ds/fcurrent/fuserRoot/fid2entry.db4
  BackupPC_ 16784 backup    8u   REG        9,2 37303115776  34717713
     /backup-pc/pc/ldap01.bos1.renesys.com/new/f%2fdata%2fbak/RStmp
  BackupPC_ 16784 backup    9u  unix 0xf6f12c80             846791450 socket
  BackupPC_ 16784 backup   11u  unix 0xf6835c80             846791422 socket

ls -l of the file on file descriptor 7: 

   -rw-r----- 1 backup dev 8634251394 Feb  6 15:10
     
/backup-pc/pc/ldap01.bos1.renesys.com/new/f%2fdata%2fbak/ffedora-ds/fcurrent/fuserRoot/fid2entry.db4

while the one on fd 8 shows:

   -rw-r----- 1 backup dev 37303115776 Feb  5 18:34
     /backup-pc/pc/ldap01.bos1.renesys.com/new/f%2fdata%2fbak/RStmp

What is the RStmp file? That one grew pretty quickly to it's current
size given the start time of the backup (18:08 on feb 5). If I run
file (1) against that file it identifies it as

  /backup-pc/pc/ldap01.bos1.renesys.com/new/f%2fdata%2fbak/RStmp:
  Berkeley DB (Btree, version 9, native byte-order)

While file on the fid2entry.db4 file shows:

  
/backup-pc/pc/ldap01.bos1.renesys.com/new/f%2fdata%2fbak/ffedora-ds/fcurrent/fuserRoot/fid2entry.db4:
 data

I could almost believ that I looking at the actual rsynced data file
(in RStmp) which is a little shorter than the source file possibly due
to sparse file compression or something that makes it a bit shorter by
suppressing blocks of nulls? But running a cmp -l between the source
file and the RStmp file shows differences before the end of the RStmp
file.

The backup server has 8GB of memory available, is almost 100% idle and
I am seeing no I/O wait in top. I don't see the perl/BackupPC_dump
process showing up in the top output either.  The server is running
centos 5.2 on an ext3 filesystem.

The rsync on the host being backed up looks like:

  write(1, "\374\17\0\7", 4)              = 4
  select(2, NULL, [1], [1], {60, 0})      = 1 (out [1], left {60, 0})
  write(1, " fLv/BPD4v6neeJr74fazcy31hf2r3em"..., 4092) = 4092
  select(2, NULL, [1], [1], {60, 0})      = 1 (out [1], left {60, 0})
  write(1, "\374\17\0\7", 4)              = 4
  select(2, NULL, [1], [1], {60, 0})      = 1 (out [1], left {60, 0})
  write(1, "\266.\27\0\1\0\346\37\0\7dn: nsuniqueid=4ed5424"..., 4092) =
  4092
  select(2, NULL, [1], [1], {60, 0})      = 1 (out [1], left {60, 0})
  write(1, "\374\17\0\7", 4)              = 4
  select(2, NULL, [1], [1], {60, 0})      = 1 (out [1], left {60, 0})
  write(1, "DkmvilSqKpyWd+x9aqkHHmjJHg3x28V/"..., 4092) = 4092
  select(2, NULL, [1], [1], {60, 0})      = 0 (Timeout)
  select(2, NULL, [1], [1], {60, 0})      = 0 (Timeout)
  select(2, NULL, [1], [1], {60, 0})      = 0 (Timeout)

with a bunch of timeout's but it does go back into cycles of
select/writes.

lsof shows:

COMMAND  PID USER   FD   TYPE             DEVICE        SIZE      NODE NAME
rsync   8865 root    0u  unix 0xffff81005cc16380             625717825 socket
rsync   8865 root    1u  unix 0xffff81005cc16380             625717825 socket
rsync   8865 root    2u  unix 0xffff8101871c4ec0             625717827 socket
rsync   8865 root    3r   REG              253,6 38268567552  13828180
    /data/bak/fedora-ds/old/userRoot/id2entry.db4

I am not seeing any reads from fd3, but it's possible it has a bufer
full of data and is just writing thet buffer.

ls -l of fd 3's file:

   -rw------- 1 fedorads fedorads 38268567552 Feb  4 22:42 
/data/bak/fedora-ds/old/userRoot/id2entry.db4

The status display in the web interface reports:

Host     Type    User    Start Time  
         Command         PID     Xfer PID
ldap01.bos1.renesys.com         full    sysops  2009-02-05 18:08
         BackupPC_dump -f ldap01.bos1.renesys.com       16567   16783, 16784

So anybody got a clue as to what is happening here, or what I should
be looking for to try to find out why this is hung, or what is really
happening under the covers?

This is pretty much a show stopper for us.

Thanks for any help.

--
                                -- rouilj

John Rouillard
System Administrator
Renesys Corporation
603-244-9084 (cell)
603-643-9300 x 111

------------------------------------------------------------------------------
Create and Deploy Rich Internet Apps outside the browser with Adobe(R)AIR(TM)
software. With Adobe AIR, Ajax developers can use existing skills and code to
build responsive, highly engaging applications that combine the power of local
resources and data with the reach of the web. Download the Adobe AIR SDK and
Ajax docs to start building applications today-http://p.sf.net/sfu/adobe-com
_______________________________________________
BackupPC-users mailing list
BackupPC-users AT lists.sourceforge DOT net
List:    https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:    http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/