I use amanda 2.4.5. Server is a centos 4.2 box. Various clients are
linux (RedHAt EL 3, Centos 3.3 and centos 4.x) and solaris 9 boxes.
This morning, it looks like amanda had troubles doing the backup
during the night. Normally, the backup takes a couple of hours or so.
When I arrived this morning at the office, the usual tape contents
list (DLT.ps) was not printed and the amanda report was not in my
e-mail.
An amstatus DailySet1 shows that amanda is still running:
-----------------------------------------------------------
Using /usr/local/var/amanda/log/DailySet1/amdump from Tue Mar 28
01:30:03 EST 2006
cahors:/ 0 planner: [disk /,
all estimate timed out]
cahors:/disk2 0 planner: [disk
/disk2, all estimate timed out]
cahors:/disk3 0 planner: [disk
/disk3, all estimate timed out]
cahors:/disk4 0 planner: [disk
/disk4, all estimate timed out]
cahors:/disk5 0 planner: [disk
/disk5, all estimate timed out]
cahors:/disk6 0 planner: [disk
/disk6, all estimate timed out]
cahors:/disk7 0 planner: [disk
/disk7, all estimate timed out]
cahors:/disk8 0 planner: [disk
/disk8, all estimate timed out]
cahors:/disk9 0 planner: [disk
/disk9, all estimate timed out]
chablis:/ 1 84m finished (7:34:33)
gobelet:/ 0 772m dump
done (8:14:45), wait for writing to tape
gobelet:/disk1 1 600m finished (7:45:30)
lnx-que-amanda:/ 0 1320m finished (8:05:56)
lnx-que-webpublic:/criqdata/web/copie_securite 1 1m finished (8:07:21)
lnx-que-webpublic:/criqdata/web/securite 0 0m finished (7:31:14)
lnx-que-webpublic:/criqdata/web/webdav 0 0m finished (7:30:58)
lnx-que-webpublic:/criqutil 0 0m finished (7:31:28)
lnx-que-webpublic:/etc 1 1m finished (7:30:10)
lnx-que-webpublic:/home 1 0m finished (7:30:48)
lnx-que-wforms1:/WEB_DEP_6i 0 27m finished (7:35:27)
lnx-que-wforms1-dev:/WEB_DEP_6i 0 27m finished (7:35:03)
lnx-que-wforms1-dev:/disk1/criqdata/web/sites_web 0 0m finished (7:30:12)
madiran:/ 1 2m dump
done (9:20:31), wait for writing to tape
madiran:/data1 1 0m dump
done (9:22:16), wait for writing to tape
madiran:/data2 1 0m dump
done (9:22:02), wait for writing to tape
madiran:/data3 0 0m dump
done (9:20:47), wait for writing to tape
madiran:/data4 0 0m dump
done (9:21:47), wait for writing to tape
madiran:/data5 0 0m dump
done (9:21:32), wait for writing to tape
madiran:/data6 0 0m dump
done (9:20:32), wait for writing to tape
madiran:/data7 1 0m dump
done (9:21:17), wait for writing to tape
madiran:/data8 1 0m dump
done (9:21:02), wait for writing to tape
madiran:/disk1 1 1796m writing
to tape (9:20:03)
produc-new:/ 1 0m finished (7:35:30)
produc-new:/disk1 1 3m finished (7:30:43)
produc-new:/disk10 1 0m finished (7:30:56)
produc-new:/disk11 0 1443m finished (9:20:01)
produc-new:/disk2 1 0m finished (8:07:23)
produc-new:/disk3 1 0m finished (7:35:38)
produc-new:/disk4 1 0m finished (7:31:26)
produc-new:/disk5 0 0m finished (7:31:12)
produc-new:/disk6 1 0m finished (7:35:36)
produc-new:/disk7 0 0m finished (7:35:34)
produc-new:/disk8 0 0m finished (7:35:33)
produc-new:/disk9 1 1m finished (7:30:07)
produc-new:/export/home 1 0m finished (7:30:46)
riesling:/ 1 97m finished (8:07:18)
sol:/ 0 763m dump
done (8:25:25), wait for writing to tape
sol:/data1 0 0m dump
done (8:25:50), wait for writing to tape
sol:/data2 0 0m dump
done (8:26:05), wait for writing to tape
sol:/disk1 1 5m dump
done (8:25:50), wait for writing to tape
sol:/disk1/RDBMS_BACKUP/PWEB/arch_save 1 3754m finished (9:01:50)
sol:/disk1/RDBMS_BACKUP/PWEB/data_save 0 718m dump
done (8:18:30), wait for writing to tape
SUMMARY part real estimated
size size
partition : 52
estimated : 43 10923m
flush : 0 0m
failed : 9 0m ( 0.00%)
wait for dumping: 0 0m ( 0.00%)
dumping to tape : 0 0m ( 0.00%)
dumping : 0 0m 0m ( 0.00%) ( 0.00%)
dumped : 43 11422m 10923m (104.57%) (104.57%)
wait for writing: 15 2262m 2262m ( 99.99%) ( 20.71%)
wait to flush : 0 0m 0m (100.00%) ( 0.00%)
writing to tape : 1 1796m 1708m (105.17%) ( 16.45%)
failed to tape : 0 0m 0m ( 0.00%) ( 0.00%)
taped : 27 7363m 6953m (105.90%) ( 67.41%)
tape 1 : 27 7363m 6953m ( 49.09%) DailySet1-008
6 dumpers idle : not-idle
taper writing, tapeq: 15
network free kps: 807000
holding space : 173594m ( 97.71%)
dumper0 busy : 0:27:57 ( 5.92%)
dumper1 busy : 1:49:52 ( 23.27%)
dumper2 busy : 0:08:36 ( 1.82%)
dumper3 busy : 0:40:12 ( 8.52%)
dumper4 busy : 1:28:18 ( 18.70%)
dumper5 busy : 0:42:57 ( 9.10%)
taper busy : 1:44:34 ( 22.15%)
0 dumpers busy : 6:01:30 ( 76.55%) not-idle: 5:59:46 ( 99.52%)
start-wait: 0:01:43 ( 0.48%)
1 dumper busy : 0:20:41 ( 4.38%) client-constrained: 0:20:26 ( 98.75%)
start-wait: 0:00:15 ( 1.24%)
2 dumpers busy : 0:34:00 ( 7.20%) client-constrained: 0:34:00 (100.00%)
3 dumpers busy : 0:11:05 ( 2.35%) client-constrained: 0:11:05 (100.00%)
4 dumpers busy : 0:32:58 ( 6.98%) client-constrained: 0:32:58 (100.00%)
5 dumpers busy : 0:06:21 ( 1.35%) no-bandwidth: 0:06:21 (100.00%)
6 dumpers busy : 0:05:23 ( 1.14%) no-dumpers: 0:04:36 ( 85.50%)
not-idle: 0:00:46 ( 14.50%)
[amanda@lnx-que-amanda ~]$
-----------------------------------------------------------------------
I'm suspecting a problem with the host "cahors" as the status shows
that all estimate timed out. I have looked at the amanda process
running on cahors and here's what I have:
ps -ef | grep amanda
amanda 8405 193 0 01:29:57 ? 0:00 amandad
amanda 8406 8405 0 01:29:57 ? 0:00 /usr/local/libexec/sendsize
amanda 8407 8405 0 0:00 <defunct>
amanda 8408 8406 0 01:29:57 ? 0:00 /usr/local/libexec/sendsize
For some reason, it looks like I have a defunct child of amandad and a
couple of sendsize processes that are hung there since 1h29 this
morning. My backup begins at around 1h30
I still don't know why amanda is still running. Did it wait for that
host until it timed out before dumping the other hosts ?
Also, pelase not that cahors is a new host that I added to my config
last week. It backed up fine last week. But the amanda version of the
cahors client is 2.4.5p1 while the server is at 2.4.5, could it be a
problem ? I don't want to have to reinstall 2.4.5p1 everywhere and I
can't find the 2.4.5 sources anywhere (the disk where I kept it on my
PC crashed)
Why are those processes on cahors "hung", should I kill them ?
Thanks
|