amandad keeps dying on me...

Hi all,

I'm using Debian/stable, amanda 2.5.1p1 (note 2.5.1, NOT 2.5.2).

For some reason amandad keeps dying on me.  I can't find any reason in
any of my logs for this.  Currently, I still have the following
processes running:

  amanda2:/var/log/amanda/amandad# ps -ef |grep backup
  backup   25763  9089  0 05:31 pts/1    00:00:00 /bin/sh /usr/sbin/amdump 
offsite amanda2
  backup   25772 25763  0 05:31 pts/1    00:00:00 /usr/lib/amanda/planner 
offsite amanda2
  backup   25773 25763  0 05:31 pts/1    00:00:00 /usr/lib/amanda/driver 
offsite amanda2
  backup   25774 25773  0 05:31 pts/1    00:00:00 taper offsite
  backup   25775 25773  0 05:31 pts/1    00:00:00 dumper0 offsite
  backup   25776 25773  0 05:31 pts/1    00:00:00 dumper1 offsite
  backup   25777 25773  0 05:31 pts/1    00:00:00 dumper2 offsite
  backup   25778 25773  0 05:31 pts/1    00:00:00 dumper3 offsite
  backup   25779 25773  0 05:31 pts/1    00:00:00 dumper4 offsite
  backup   25780 25773  0 05:31 pts/1    00:00:00 dumper5 offsite
  backup   25781 25773  0 05:31 pts/1    00:00:00 dumper6 offsite
  backup   25782 25773  0 05:31 pts/1    00:00:00 dumper7 offsite
  backup   25783 25773  0 05:31 pts/1    00:00:00 dumper8 offsite
  backup   25784 25773  0 05:31 pts/1    00:00:00 dumper9 offsite
  backup   25785 25773  0 05:31 pts/1    00:00:00 dumper10 offsite
  backup   25786 25773  0 05:31 pts/1    00:00:00 dumper11 offsite
  backup   25787 25773  0 05:31 pts/1    00:00:00 dumper12 offsite
  backup   25788 25773  0 05:31 pts/1    00:00:00 dumper13 offsite
  backup   25789 25773  0 05:31 pts/1    00:00:00 dumper14 offsite
  backup   25790 25773  0 05:31 pts/1    00:00:00 dumper15 offsite
  backup   25791 25774  0 05:31 pts/1    00:00:00 taper offsite
  backup   29382 29381  0 12:47 pts/3    00:00:00 -sh

The client in question is 'amanda2', which is NFS mounting several
file systems from an OnStor NFS server.

amstatus reports:

   Using /var/log/amanda/offsite/amdump.1 from Wed Sep  5 05:31:03 EDT 2007

   amanda2:/                    0        0g waiting to flush
   amanda2:/                    0        0g estimate done
   amanda2:/home                0        2g waiting to flush
   amanda2:/home                0        2g estimate done
   amanda2:/nfs            0        0g waiting to flush
   amanda2:/nfs            0        0g estimate done
   amanda2:/nfs/RT         1       39g waiting to flush
   amanda2:/nfs/RT         0      582g estimate done
   amanda2:/nfs/archive    0       11g waiting to flush
   amanda2:/nfs/archive    0       11g estimate done
   amanda2:/nfs/backups    0        0g waiting to flush
   amanda2:/nfs/backups    0        0g estimate done
   amanda2:/nfs/builds     1        5g waiting to flush
   amanda2:/nfs/builds     0      117g estimate done
   amanda2:/nfs/debian     0       22g waiting to flush
   amanda2:/nfs/debian     0       22g estimate done
   amanda2:/nfs/patent     0        1g waiting to flush
   amanda2:/nfs/patent     0        1g estimate done
   amanda2:/nfs/release    0      236g estimate done
   amanda2:/nfs/software   0       24g estimate done
   amanda2:/nfs/system     0       10g estimate done
   amanda2:/nfs/user       0        0g estimate done
   amanda2:/nfs/user/ad    0       74g partial estimate done
   amanda2:/nfs/user/assar             getting estimate
   amanda2:/nfs/user/eh                getting estimate
   amanda2:/nfs/user/il                getting estimate
   amanda2:/nfs/user/mp                getting estimate
   amanda2:/nfs/user/qt                getting estimate
   amanda2:/nfs/user/uz                getting estimate
   amanda2:/usr                 0        0g waiting to flush
   amanda2:/usr                 0        0g estimate done
   amanda2:/var                 0        1g waiting to flush
   amanda2:/var                 0        1g estimate done

   SUMMARY          part      real  estimated
                              size       size
   partition       :  33
   estimated       :  16                 1085g
   flush           :  11        85g
   failed          :   0                    0g           (  0.00%)
   wait for dumping:   0                    0g           (  0.00%)
   dumping to tape :   0                    0g           (  0.00%)
   dumping         :   0         0g         0g (  0.00%) (  0.00%)
   dumped          :   0         0g         0g (  0.00%) (  0.00%)
   wait for writing:   0         0g         0g (  0.00%) (  0.00%)
   wait to flush   :  11        85g        85g (100.00%) (  0.00%)
   writing to tape :   0         0g         0g (  0.00%) (  0.00%)
   failed to tape  :   0         0g         0g (  0.00%) (  0.00%)
   taped           :   0         0g         0g (  0.00%) (  0.00%)
   16 dumpers idle : not-idle
   taper idle
   network free kps:   1048576
   holding space   :      1700g (100.00%)
    0 dumpers busy :  0:00:00  (  0.00%)

But there is no estimate being done.  There is no tar process running,
amandad is not running.  The last thing in the amandad log is:

  amandad: time 21432.640: sending PREP pkt:
  <<<<<
  OPTIONS features=fffffeff9ffeffffff7f;
  / 0 SIZE 116610
  / 1 SIZE 680
  /usr 0 SIZE 359660
  /usr 1 SIZE 1630
  /var 0 SIZE 1878760
  /var 1 SIZE 12860
  /home 0 SIZE 2500110
  /home 1 SIZE 610
  /nfs 0 SIZE 10
  /nfs 1 SIZE 10
  /nfs/RT 0 SIZE 611267600
  /nfs/RT 1 SIZE 41678340
  /nfs/RT 2 SIZE 108760
  /nfs/archive 0 SIZE 12575130
  /nfs/archive 1 SIZE 33840
  /nfs/backups 0 SIZE 68940
  /nfs/backups 1 SIZE 10
  /nfs/builds 0 SIZE 123579430
  /nfs/builds 1 SIZE 5636750
  /nfs/debian 0 SIZE 23200280
  /nfs/debian 1 SIZE 24390
  /nfs/patent 0 SIZE 1458840
  /nfs/patent 1 SIZE 42680
  /nfs/release 0 SIZE 247633410
  /nfs/release 1 SIZE 505700
  /nfs/software 0 SIZE 25182370
  /nfs/software 1 SIZE 5580
  /nfs/system 0 SIZE 10536630
  /nfs/system 1 SIZE 9300
  /nfs/user 0 SIZE 10
  /nfs/user 1 SIZE 10
  /nfs/user/ad 0 SIZE 78236310
  >>>>>
  amandad: dgram_send_addr(addr=0xbf863d00, dgram=0xb7ec5084)
  amandad: time 21432.654: (sockaddr_in *)0xbf863d00 = { 2, 854, 10.0.0.4 }
  amandad: dgram_send_addr: 0xb7ec5084->socket = 0
  amandad: time 21600.669: /usr/lib/amanda/sendsize timed out waiting for REP 
data
  amandad: time 21600.669: sending NAK pkt:
  <<<<<
  ERROR timeout on reply pipe
  >>>>>
  amandad: dgram_send_addr(addr=0xbf863d00, dgram=0xb7ec5084)
  amandad: time 21600.669: (sockaddr_in *)0xbf863d00 = { 2, 854, 10.0.0.4 }
  amandad: dgram_send_addr: 0xb7ec5084->socket = 0
  security_close(handle=0x804edf8, driver=0xb7ec40e0 (BSD))
  amandad: time 21604.668: pid 26163 finish time Wed Sep  5 11:31:11 2007

Do these last 2 sections mean that amandad has gracefully exited
because etimeout has been reached?  And if so, why isn't amdump, the
planner, or the driver being notified of this so they can start
dumping.  I'm assuming they're not being notified by amandad, since
they're still running, but not actively doing anything.

I've connected to all the processes listed above via strace, and
they're all still 'wait'ing for something, I assume amandad to report
back the estimates, which it can't do, since it's no longer running!

-- 
Thanks,
Paul