Hi all,
I'm using Debian/stable, amanda 2.5.1p1 (note 2.5.1, NOT 2.5.2).
For some reason amandad keeps dying on me. I can't find any reason in
any of my logs for this. Currently, I still have the following
processes running:
amanda2:/var/log/amanda/amandad# ps -ef |grep backup
backup 25763 9089 0 05:31 pts/1 00:00:00 /bin/sh /usr/sbin/amdump
offsite amanda2
backup 25772 25763 0 05:31 pts/1 00:00:00 /usr/lib/amanda/planner
offsite amanda2
backup 25773 25763 0 05:31 pts/1 00:00:00 /usr/lib/amanda/driver
offsite amanda2
backup 25774 25773 0 05:31 pts/1 00:00:00 taper offsite
backup 25775 25773 0 05:31 pts/1 00:00:00 dumper0 offsite
backup 25776 25773 0 05:31 pts/1 00:00:00 dumper1 offsite
backup 25777 25773 0 05:31 pts/1 00:00:00 dumper2 offsite
backup 25778 25773 0 05:31 pts/1 00:00:00 dumper3 offsite
backup 25779 25773 0 05:31 pts/1 00:00:00 dumper4 offsite
backup 25780 25773 0 05:31 pts/1 00:00:00 dumper5 offsite
backup 25781 25773 0 05:31 pts/1 00:00:00 dumper6 offsite
backup 25782 25773 0 05:31 pts/1 00:00:00 dumper7 offsite
backup 25783 25773 0 05:31 pts/1 00:00:00 dumper8 offsite
backup 25784 25773 0 05:31 pts/1 00:00:00 dumper9 offsite
backup 25785 25773 0 05:31 pts/1 00:00:00 dumper10 offsite
backup 25786 25773 0 05:31 pts/1 00:00:00 dumper11 offsite
backup 25787 25773 0 05:31 pts/1 00:00:00 dumper12 offsite
backup 25788 25773 0 05:31 pts/1 00:00:00 dumper13 offsite
backup 25789 25773 0 05:31 pts/1 00:00:00 dumper14 offsite
backup 25790 25773 0 05:31 pts/1 00:00:00 dumper15 offsite
backup 25791 25774 0 05:31 pts/1 00:00:00 taper offsite
backup 29382 29381 0 12:47 pts/3 00:00:00 -sh
The client in question is 'amanda2', which is NFS mounting several
file systems from an OnStor NFS server.
amstatus reports:
Using /var/log/amanda/offsite/amdump.1 from Wed Sep 5 05:31:03 EDT 2007
amanda2:/ 0 0g waiting to flush
amanda2:/ 0 0g estimate done
amanda2:/home 0 2g waiting to flush
amanda2:/home 0 2g estimate done
amanda2:/nfs 0 0g waiting to flush
amanda2:/nfs 0 0g estimate done
amanda2:/nfs/RT 1 39g waiting to flush
amanda2:/nfs/RT 0 582g estimate done
amanda2:/nfs/archive 0 11g waiting to flush
amanda2:/nfs/archive 0 11g estimate done
amanda2:/nfs/backups 0 0g waiting to flush
amanda2:/nfs/backups 0 0g estimate done
amanda2:/nfs/builds 1 5g waiting to flush
amanda2:/nfs/builds 0 117g estimate done
amanda2:/nfs/debian 0 22g waiting to flush
amanda2:/nfs/debian 0 22g estimate done
amanda2:/nfs/patent 0 1g waiting to flush
amanda2:/nfs/patent 0 1g estimate done
amanda2:/nfs/release 0 236g estimate done
amanda2:/nfs/software 0 24g estimate done
amanda2:/nfs/system 0 10g estimate done
amanda2:/nfs/user 0 0g estimate done
amanda2:/nfs/user/ad 0 74g partial estimate done
amanda2:/nfs/user/assar getting estimate
amanda2:/nfs/user/eh getting estimate
amanda2:/nfs/user/il getting estimate
amanda2:/nfs/user/mp getting estimate
amanda2:/nfs/user/qt getting estimate
amanda2:/nfs/user/uz getting estimate
amanda2:/usr 0 0g waiting to flush
amanda2:/usr 0 0g estimate done
amanda2:/var 0 1g waiting to flush
amanda2:/var 0 1g estimate done
SUMMARY part real estimated
size size
partition : 33
estimated : 16 1085g
flush : 11 85g
failed : 0 0g ( 0.00%)
wait for dumping: 0 0g ( 0.00%)
dumping to tape : 0 0g ( 0.00%)
dumping : 0 0g 0g ( 0.00%) ( 0.00%)
dumped : 0 0g 0g ( 0.00%) ( 0.00%)
wait for writing: 0 0g 0g ( 0.00%) ( 0.00%)
wait to flush : 11 85g 85g (100.00%) ( 0.00%)
writing to tape : 0 0g 0g ( 0.00%) ( 0.00%)
failed to tape : 0 0g 0g ( 0.00%) ( 0.00%)
taped : 0 0g 0g ( 0.00%) ( 0.00%)
16 dumpers idle : not-idle
taper idle
network free kps: 1048576
holding space : 1700g (100.00%)
0 dumpers busy : 0:00:00 ( 0.00%)
But there is no estimate being done. There is no tar process running,
amandad is not running. The last thing in the amandad log is:
amandad: time 21432.640: sending PREP pkt:
<<<<<
OPTIONS features=fffffeff9ffeffffff7f;
/ 0 SIZE 116610
/ 1 SIZE 680
/usr 0 SIZE 359660
/usr 1 SIZE 1630
/var 0 SIZE 1878760
/var 1 SIZE 12860
/home 0 SIZE 2500110
/home 1 SIZE 610
/nfs 0 SIZE 10
/nfs 1 SIZE 10
/nfs/RT 0 SIZE 611267600
/nfs/RT 1 SIZE 41678340
/nfs/RT 2 SIZE 108760
/nfs/archive 0 SIZE 12575130
/nfs/archive 1 SIZE 33840
/nfs/backups 0 SIZE 68940
/nfs/backups 1 SIZE 10
/nfs/builds 0 SIZE 123579430
/nfs/builds 1 SIZE 5636750
/nfs/debian 0 SIZE 23200280
/nfs/debian 1 SIZE 24390
/nfs/patent 0 SIZE 1458840
/nfs/patent 1 SIZE 42680
/nfs/release 0 SIZE 247633410
/nfs/release 1 SIZE 505700
/nfs/software 0 SIZE 25182370
/nfs/software 1 SIZE 5580
/nfs/system 0 SIZE 10536630
/nfs/system 1 SIZE 9300
/nfs/user 0 SIZE 10
/nfs/user 1 SIZE 10
/nfs/user/ad 0 SIZE 78236310
>>>>>
amandad: dgram_send_addr(addr=0xbf863d00, dgram=0xb7ec5084)
amandad: time 21432.654: (sockaddr_in *)0xbf863d00 = { 2, 854, 10.0.0.4 }
amandad: dgram_send_addr: 0xb7ec5084->socket = 0
amandad: time 21600.669: /usr/lib/amanda/sendsize timed out waiting for REP
data
amandad: time 21600.669: sending NAK pkt:
<<<<<
ERROR timeout on reply pipe
>>>>>
amandad: dgram_send_addr(addr=0xbf863d00, dgram=0xb7ec5084)
amandad: time 21600.669: (sockaddr_in *)0xbf863d00 = { 2, 854, 10.0.0.4 }
amandad: dgram_send_addr: 0xb7ec5084->socket = 0
security_close(handle=0x804edf8, driver=0xb7ec40e0 (BSD))
amandad: time 21604.668: pid 26163 finish time Wed Sep 5 11:31:11 2007
Do these last 2 sections mean that amandad has gracefully exited
because etimeout has been reached? And if so, why isn't amdump, the
planner, or the driver being notified of this so they can start
dumping. I'm assuming they're not being notified by amandad, since
they're still running, but not actively doing anything.
I've connected to all the processes listed above via strace, and
they're all still 'wait'ing for something, I assume amandad to report
back the estimates, which it can't do, since it's no longer running!
--
Thanks,
Paul
|