Re: estimate timeouts at 6hrs?

* Jean-Louis Martineau <martineau AT zmanda DOT com> [20070611 10:00]:
> amandad have a hard limit of 6h (see REP_TIMEOUT in amandad-src/amandad.c)
> in waiting for the reply from sendsize.
> 
> Try the attached patch, it reset the timeout after each estimates.

Thanks Jean-Louis.

Would that explains why I see a lot of runaway processes after
sendsize times out? Over the weekend I had a situation where over +90
gnutar processes were left around with init as parent like the
following:

UID    PID      PPID  C    STIME    TTY     TIME  CMD
root   23243074 1     0    16:22:41 ?       11:40 gtar --create --file -
--directory /data/mafalda/mafalda1/susanita/jen/anxiety_

The relevent debug file showed:

runtar.20070610162241.debug
runtar: debug 1 pid 23243074 ruid 666 euid 0: start at Sun Jun 10
16:22:41 2007
runtar: time 0.002: version 2.5.2-20070523
/usr/freeware/bin/tar version: tar (GNU tar) 1.13.25

config: stk_80-conf1
runtar: debug 1 pid 23243074 ruid 0 euid 0: rename at Sun Jun 10
16:22:41 2007
running: /usr/freeware/bin/tar: 'gtar' '--create' '--file' '-'
'--directory'
'/data/mafalda/mafalda1/susanita/jen/anxiety_version1/sub115'
'--one-file-system' '--listed-incremental'
'/opt/amanda/amanda1/var/amanda/gnutar-lists/yoricksub115_1.new'
'--sparse'
'--ignore-failed-read' '--totals' '.' 
runtar: time 0.020: pid 23243074 finish time Sun Jun 10 16:22:41 2007


I've this with both xfsdump and gnutar.

thanks, jf

> 
> Jean-Louis
> 
> Jean-Francois Malouin wrote:
> >Hi,
> >
> >A new problem that has me stumped: all the amdumps from client to server
> >(same host runing 2.5.2-20070623) have failed due to estimate timing
> >out after 6:00h. This happened in all the multiple config that I run,
> >even though the etimeout in each of the amanda config is set to
> >ridiculous value: in one case etimeout=5600 and I have 77 DLEs which
> >should not timeout for ~120h! Anything else could cause this:
> >
> >FAILURE AND STRANGE DUMP SUMMARY:
> >  yorick  /data/bigml/bigml1                  lev 0  FAILED [disk
> >/data/bigml/bigml1, all estimate timed out]
> >...
> >  yorick  /data/nih/nih1/                     lev 0  FAILED [disk
> >/data/nih/nih1/, all estimate timed out]
> > planner: ERROR Request to yorick failed: EOF on read from yorick
> >
> >
> >STATISTICS:
> >                          Total       Full      Incr.
> >                        --------   --------   --------
> >Estimate Time (hrs:min)    6:00
> >Run Time (hrs:min)        15:07
> >Dump Time (hrs:min)       15:14      14:59       0:15
> >
> >
> >jf
> >  
> 



-- 
<° ><