Amanda-Users

Re: Strange amanda problem this morning

2006-03-28 10:24:50
Subject: Re: Strange amanda problem this morning
From: Paul Bijnens <paul.bijnens AT xplanation DOT com>
To: Guy Dallaire <clepeterd AT gmail DOT com>
Date: Tue, 28 Mar 2006 17:20:56 +0200
On 2006-03-28 16:39, Guy Dallaire wrote:
I use amanda 2.4.5. Server is a centos 4.2 box. Various clients are
linux (RedHAt EL 3, Centos 3.3 and centos 4.x) and solaris 9 boxes.

This morning, it looks like amanda had troubles doing the backup
during the night. Normally, the backup takes a couple of hours or so.

When I arrived this morning at the office, the usual tape contents
list (DLT.ps) was not printed and the amanda report was not in my
e-mail.

An amstatus DailySet1 shows that amanda is still running:

-----------------------------------------------------------

Using /usr/local/var/amanda/log/DailySet1/amdump from Tue Mar 28
01:30:03 EST 2006

cahors:/            0 planner: [disk /, all estimate timed out]
cahors:/disk2       0 planner: [disk /disk2, all estimate timed out]
cahors:/disk3       0 planner: [disk /disk3, all estimate timed out]
cahors:/disk4       0 planner: [disk /disk4, all estimate timed out]
cahors:/disk5       0 planner: [disk /disk5, all estimate timed out]
cahors:/disk6       0 planner: [disk /disk6, all estimate timed out]
cahors:/disk7       0 planner: [disk /disk7, all estimate timed out]
cahors:/disk8       0 planner: [disk /disk8, all estimate timed out]
cahors:/disk9       0 planner: [disk /disk9, all estimate timed out]
[...]
 0 dumpers busy :  6:01:30  ( 76.55%)            not-idle:  5:59:46  ( 99.52%)
                                               start-wait:  0:01:43  (  0.48%)


There were 0 dumpers busy during 6 hours, I guess the server
waited 6 hours for the estimate; 6 hours, i.e. 21600 seconds, or 2400 seconds for each DLE (cahors has 9 DLE's).
Could it be that you have an "etimeout 2400" in the amanda.conf ?
(A positive value = "the amount of time *per disk* on a a given client...")


I'm suspecting a problem with the host "cahors" as the status shows
that all estimate timed out. I have looked at the amanda process
running on cahors and here's what I have:

 ps -ef | grep amanda

  amanda  8405   193  0 01:29:57 ?        0:00 amandad
  amanda  8406  8405  0 01:29:57 ?        0:00 /usr/local/libexec/sendsize
  amanda  8407  8405  0                   0:00 <defunct>
  amanda  8408  8406  0 01:29:57 ?        0:00 /usr/local/libexec/sendsize

For some reason, it looks like I have a defunct child of amandad and a
couple of sendsize processes that are hung there since 1h29 this
morning. My backup begins at around 1h30

The defunct process is probably not the problem.
The time you see is the time that the programs started.

You probably find more information in the client debug files on cahors: /tmp/amanda/sendsize.datetime.debug .
Try to find out what those processes are doing now by:
"strace -p PID"   (or "truss" on Solaris).



I still don't know why amanda is still running. Did it wait for that
host until it timed out before dumping the other hosts ?

Your etimeout is proabably too large, if my assumption of
 "etimeout 2400" is correct, of course.


Also, pelase not that cahors is a new host that I added to my config
last week. It backed up fine last week. But the amanda version of the
cahors client is 2.4.5p1 while the server is at 2.4.5, could it be a
problem ? I don't want to have to reinstall 2.4.5p1 everywhere and I
can't find the 2.4.5 sources anywhere (the disk where I kept it on my
PC crashed)

I do not expect a version mismatch or incompatibility between client
and server.  That should be the first time I notice something like that
in 7 year of experience. (I use a mix of 2.4.4xx, 2.4.5xx, 2.5.0b2, 2.5.0 currently and did never notice any problem with that mix.)


Why are those processes on cahors "hung", should I kill them ?

Probably, and but first find out why it takes so long...

Sometimes an unresponsive NFS server can hang gnutar while walking
the filesystem tree.  Other messages in /var/log/messages ?

If those disks are filled with many small files, than doing an estimate
with gnutar really can take a long time.
Normally Amanda can ask 1, 2 or 3 estimates for a disk (level 0,
current level, and curren lvl plus 1).  If you gradually added each
disk to cahors last week, then this may be the first time that Amanda
needs to estimate each disk multiple times, which takes too long.

Is cahors a fast host, with separate disks?  If yes, you could
add "maxdumps 2" or even more for that host (and better indicate a
spindle number in the disklist entries for cahors.

Or switch to the faster estimate options available in 2.4.5
for the DLE's of cahors: "estimate calcsize", or "estimate server"
(see man amanda.conf).



--
Paul Bijnens, xplanation Technology Services        Tel  +32 16 397.511
Technologielaan 21 bus 2, B-3001 Leuven, BELGIUM    Fax  +32 16 397.512
http://www.xplanation.com/          email:  Paul.Bijnens AT xplanation DOT com
***********************************************************************
* I think I've got the hang of it now:  exit, ^D, ^C, ^\, ^Z, ^Q, ^^, *
* F6, quit, ZZ, :q, :q!, M-Z, ^X^C, logoff, logout, close, bye, /bye, *
* stop, end, F3, ~., ^]c, +++ ATH, disconnect, halt,  abort,  hangup, *
* PF4, F20, ^X^X, :D::D, KJOB, F14-f-e, F8-e,  kill -1 $$,  shutdown, *
* init 0, kill -9 1, Alt-F4, Ctrl-Alt-Del, AltGr-NumLock, Stop-A, ... *
* ...  "Are you sure?"  ...   YES   ...   Phew ...   I'm out          *
***********************************************************************


<Prev in Thread] Current Thread [Next in Thread>