Amanda-Users

Strange amanda problem this morning

2006-03-28 09:49:14
Subject: Strange amanda problem this morning
From: "Guy Dallaire" <clepeterd AT gmail DOT com>
To: amanda-users <amanda-users AT amanda DOT org>
Date: Tue, 28 Mar 2006 09:39:39 -0500
I use amanda 2.4.5. Server is a centos 4.2 box. Various clients are
linux (RedHAt EL 3, Centos 3.3 and centos 4.x) and solaris 9 boxes.

This morning, it looks like amanda had troubles doing the backup
during the night. Normally, the backup takes a couple of hours or so.

When I arrived this morning at the office, the usual tape contents
list (DLT.ps) was not printed and the amanda report was not in my
e-mail.

An amstatus DailySet1 shows that amanda is still running:

-----------------------------------------------------------

Using /usr/local/var/amanda/log/DailySet1/amdump from Tue Mar 28
01:30:03 EST 2006

cahors:/                                          0 planner: [disk /,
all estimate timed out]
cahors:/disk2                                     0 planner: [disk
/disk2, all estimate timed out]
cahors:/disk3                                     0 planner: [disk
/disk3, all estimate timed out]
cahors:/disk4                                     0 planner: [disk
/disk4, all estimate timed out]
cahors:/disk5                                     0 planner: [disk
/disk5, all estimate timed out]
cahors:/disk6                                     0 planner: [disk
/disk6, all estimate timed out]
cahors:/disk7                                     0 planner: [disk
/disk7, all estimate timed out]
cahors:/disk8                                     0 planner: [disk
/disk8, all estimate timed out]
cahors:/disk9                                     0 planner: [disk
/disk9, all estimate timed out]
chablis:/                                         1       84m finished (7:34:33)
gobelet:/                                         0      772m dump
done (8:14:45), wait for writing to tape
gobelet:/disk1                                    1      600m finished (7:45:30)
lnx-que-amanda:/                                  0     1320m finished (8:05:56)
lnx-que-webpublic:/criqdata/web/copie_securite    1        1m finished (8:07:21)
lnx-que-webpublic:/criqdata/web/securite          0        0m finished (7:31:14)
lnx-que-webpublic:/criqdata/web/webdav            0        0m finished (7:30:58)
lnx-que-webpublic:/criqutil                       0        0m finished (7:31:28)
lnx-que-webpublic:/etc                            1        1m finished (7:30:10)
lnx-que-webpublic:/home                           1        0m finished (7:30:48)
lnx-que-wforms1:/WEB_DEP_6i                       0       27m finished (7:35:27)
lnx-que-wforms1-dev:/WEB_DEP_6i                   0       27m finished (7:35:03)
lnx-que-wforms1-dev:/disk1/criqdata/web/sites_web 0        0m finished (7:30:12)
madiran:/                                         1        2m dump
done (9:20:31), wait for writing to tape
madiran:/data1                                    1        0m dump
done (9:22:16), wait for writing to tape
madiran:/data2                                    1        0m dump
done (9:22:02), wait for writing to tape
madiran:/data3                                    0        0m dump
done (9:20:47), wait for writing to tape
madiran:/data4                                    0        0m dump
done (9:21:47), wait for writing to tape
madiran:/data5                                    0        0m dump
done (9:21:32), wait for writing to tape
madiran:/data6                                    0        0m dump
done (9:20:32), wait for writing to tape
madiran:/data7                                    1        0m dump
done (9:21:17), wait for writing to tape
madiran:/data8                                    1        0m dump
done (9:21:02), wait for writing to tape
madiran:/disk1                                    1     1796m writing
to tape (9:20:03)
produc-new:/                                      1        0m finished (7:35:30)
produc-new:/disk1                                 1        3m finished (7:30:43)
produc-new:/disk10                                1        0m finished (7:30:56)
produc-new:/disk11                                0     1443m finished (9:20:01)
produc-new:/disk2                                 1        0m finished (8:07:23)
produc-new:/disk3                                 1        0m finished (7:35:38)
produc-new:/disk4                                 1        0m finished (7:31:26)
produc-new:/disk5                                 0        0m finished (7:31:12)
produc-new:/disk6                                 1        0m finished (7:35:36)
produc-new:/disk7                                 0        0m finished (7:35:34)
produc-new:/disk8                                 0        0m finished (7:35:33)
produc-new:/disk9                                 1        1m finished (7:30:07)
produc-new:/export/home                           1        0m finished (7:30:46)
riesling:/                                        1       97m finished (8:07:18)
sol:/                                             0      763m dump
done (8:25:25), wait for writing to tape
sol:/data1                                        0        0m dump
done (8:25:50), wait for writing to tape
sol:/data2                                        0        0m dump
done (8:26:05), wait for writing to tape
sol:/disk1                                        1        5m dump
done (8:25:50), wait for writing to tape
sol:/disk1/RDBMS_BACKUP/PWEB/arch_save            1     3754m finished (9:01:50)
sol:/disk1/RDBMS_BACKUP/PWEB/data_save            0      718m dump
done (8:18:30), wait for writing to tape

SUMMARY          part      real  estimated
                           size       size
partition       :  52
estimated       :  43                10923m
flush           :   0         0m
failed          :   9                    0m           (  0.00%)
wait for dumping:   0                    0m           (  0.00%)
dumping to tape :   0                    0m           (  0.00%)
dumping         :   0         0m         0m (  0.00%) (  0.00%)
dumped          :  43     11422m     10923m (104.57%) (104.57%)
wait for writing:  15      2262m      2262m ( 99.99%) ( 20.71%)
wait to flush   :   0         0m         0m (100.00%) (  0.00%)
writing to tape :   1      1796m      1708m (105.17%) ( 16.45%)
failed to tape  :   0         0m         0m (  0.00%) (  0.00%)
taped           :  27      7363m      6953m (105.90%) ( 67.41%)
  tape 1        :  27      7363m      6953m ( 49.09%) DailySet1-008
6 dumpers idle  : not-idle
taper writing, tapeq: 15
network free kps:    807000
holding space   :    173594m ( 97.71%)
 dumper0 busy   :  0:27:57  (  5.92%)
 dumper1 busy   :  1:49:52  ( 23.27%)
 dumper2 busy   :  0:08:36  (  1.82%)
 dumper3 busy   :  0:40:12  (  8.52%)
 dumper4 busy   :  1:28:18  ( 18.70%)
 dumper5 busy   :  0:42:57  (  9.10%)
   taper busy   :  1:44:34  ( 22.15%)
 0 dumpers busy :  6:01:30  ( 76.55%)            not-idle:  5:59:46  ( 99.52%)
                                               start-wait:  0:01:43  (  0.48%)
 1 dumper busy  :  0:20:41  (  4.38%)  client-constrained:  0:20:26  ( 98.75%)
                                               start-wait:  0:00:15  (  1.24%)
 2 dumpers busy :  0:34:00  (  7.20%)  client-constrained:  0:34:00  (100.00%)
 3 dumpers busy :  0:11:05  (  2.35%)  client-constrained:  0:11:05  (100.00%)
 4 dumpers busy :  0:32:58  (  6.98%)  client-constrained:  0:32:58  (100.00%)
 5 dumpers busy :  0:06:21  (  1.35%)        no-bandwidth:  0:06:21  (100.00%)
 6 dumpers busy :  0:05:23  (  1.14%)          no-dumpers:  0:04:36  ( 85.50%)
                                                 not-idle:  0:00:46  ( 14.50%)
[amanda@lnx-que-amanda ~]$

-----------------------------------------------------------------------

I'm suspecting a problem with the host "cahors" as the status shows
that all estimate timed out. I have looked at the amanda process
running on cahors and here's what I have:

 ps -ef | grep amanda

  amanda  8405   193  0 01:29:57 ?        0:00 amandad
  amanda  8406  8405  0 01:29:57 ?        0:00 /usr/local/libexec/sendsize
  amanda  8407  8405  0                   0:00 <defunct>
  amanda  8408  8406  0 01:29:57 ?        0:00 /usr/local/libexec/sendsize

For some reason, it looks like I have a defunct child of amandad and a
couple of sendsize processes that are hung there since 1h29 this
morning. My backup begins at around 1h30

I still don't know why amanda is still running. Did it wait for that
host until it timed out before dumping the other hosts ?

Also, pelase not that cahors is a new host that I added to my config
last week. It backed up fine last week. But the amanda version of the
cahors client is 2.4.5p1 while the server is at 2.4.5, could it be a
problem ? I don't want to have to reinstall 2.4.5p1 everywhere and I
can't find the 2.4.5 sources anywhere (the disk where I kept it on my
PC crashed)

Why are those processes on cahors "hung", should I kill them ?

Thanks


<Prev in Thread] Current Thread [Next in Thread>