Amanda-Users

Re: Estimate phase failed for all disks: server not backed up...

2005-05-19 11:44:27
Subject: Re: Estimate phase failed for all disks: server not backed up...
From: Brian Cuttler <brian AT wadsworth DOT org>
To: amanda-users AT amanda DOT org
Date: Thu, 19 May 2005 11:29:25 -0400
I had a problem with one of my amanda servers not hearing back
from one of the clients during estimate phase. While it was my
fault (for non-amanda reasons I'd started amdump during daylight
hours when the machine was more heavily loaded).

I was able to get a very good idea of how long it too the estimate
phase to complete by examining the /tmp/amanda files on the client.

I'm not recommending this as a practice, but as long as the info
is in the log files... The estimate phase completes when the last
child terminates - AFAIK, YMMV

[samar] /tmp/amanda 104> grep termin sendsize.20050516210006.debug
sendsize[5891]: time 6.609: child 6150 terminated normally
sendsize[6146]: time 7.461: asking killpgrp to terminate
sendsize[6146]: time 23.051: asking killpgrp to terminate
sendsize[5891]: time 24.063: child 6146 terminated normally
sendsize[6149]: time 30.378: asking killpgrp to terminate
sendsize[6149]: time 68.221: asking killpgrp to terminate
sendsize[5891]: time 69.232: child 6149 terminated normally
sendsize[5891]: time 143.366: child 6175 terminated normally
sendsize[5891]: time 300.214: child 6186 terminated normally
sendsize[5891]: time 356.362: child 6207 terminated normally
sendsize[5891]: time 1341.432: child 6160 terminated normally
sendsize[5891]: time 1796.592: child 6166 terminated normally
sendsize[5891]: time 1797.537: child 6364 terminated normally
sendsize[5891]: time 1820.820: child 6367 terminated normally
sendsize[5891]: time 1820.944: child 6376 terminated normally
sendsize[5891]: time 1924.346: child 6216 terminated normally
sendsize[5891]: time 2810.064: child 6379 terminated normally
sendsize[5891]: time 3041.239: child 6151 terminated normally
sendsize[5891]: time 3617.385: child 6320 terminated normally
[

On Thu, May 19, 2005 at 11:12:09AM -0400, Jon LaBadie wrote:
> On Thu, May 19, 2005 at 10:47:26AM -0400, Guy Dallaire wrote:
> > Here is what I have in my amanda log this morning:
> > 
> > FAILURE AND STRANGE DUMP SUMMARY:
> >   planner: ERROR Estimate timeout from sol
> >   sol        /data2 lev 0 FAILED [disk /data2, all estimate timed out]
> >   sol        /data1 lev 0 FAILED [disk /data1, all estimate timed out]
> >   sol        /disk1 lev 0 FAILED [disk /disk1, all estimate timed out]
> >   sol        / lev 0 FAILED [disk /, all estimate timed out]
> > 
> > What might be wrong here ? The first time I ran amanda, it backed up
> > this server without a problem.  The timeout parameter is at the
> > standard 300 secs. I bumped it up to 600 seconds for the next run but
> > I'm worried.
> > 
> > I did not change anything to the config, except maybe yesterday I did
> > an "amdamin force sol /" because I decided to start using gnu-tar
> > instead of ufsdump on all my root file systems.
> 
> raise them way up, say 6000 sec to just to see if it simply is slow.
> 
> BTW in general it is best to introduce a lot of things a few DLE at
> a time.  This avoids the problem of massive level 0's all in one dump.
> Spread them out like amanda will eventually.  Add a couple from sol,
> a couple from mercury, one or two from venus, ...  Then tomorrow
> a few more.
> 
> -- 
> Jon H. LaBadie                  jon AT jgcomp DOT com
>  JG Computing
>  4455 Province Line Road        (609) 252-0159
>  Princeton, NJ  08540-4322      (609) 683-7220 (fax)
---
   Brian R Cuttler                 brian.cuttler AT wadsworth DOT org
   Computer Systems Support        (v) 518 486-1697
   Wadsworth Center                (f) 518 473-6384
   NYS Department of Health        Help Desk 518 473-0773