Re: Estimate Timeouts

* Garnet Harris <garnet.harris AT ekasystems DOT com> [20091030 13:33]:
> I am experimenting with amanda and am having problems understanding how  
> the "estimate timeout" works.  What I am seeing is not what I expect  
> based on the explaination in the amanda.conf page.
>
> Client and server are on the same machine.
>
> 36 - DLEs  (some set to calcsize, most use tar for estimates)
>
> all DLEs are set to "always full" (to remove any confusion caused by  
> incremental backups)
>
> etimeout = 900
>
> My understanding is that amanda will allow 900 seconds per DLE.  So it  
> should timeout in 32,400 seconds (900 x 36) or 9 hours.  According to  
> the report after amdump runs, the estimate phase is over 11 hours.
>
> Looking at the sendsize log, the last "estimate time for" the 28th DLE  
> is at time 30,708.  And, the time stamp on the sendsize file matches:  
> approximately 8.5 hours after amanda started.  (So far so good.)  
> However, the first runtar log doesn't appear until another 1.5 hours  
> later.  Which means amanda didn't do anything for 1.5 hours.  (The lag  
> is greater when allowing incremental backups.)
>
> Looking at the planner log on the server side, there is a "dgram_recv"  
> with a matching for each "estimate time" entry on the client side for  
> the first 27 DLEs.
>
> client sendsize log:
>
> sendsize[8661]: time 9899.506: estimate time for home_q level 0: 12.033
> sendsize[8668]: time 13962.655: estimate time for home_r level 0: 4063.141
> sendsize[9335]: time 30708.356: estimate time for home_s level 0: 16745.578
>
>
> server planner log:
>
> time 9899.665: dgram_recv(dgram=0xb805c764, timeout=0, fromaddr=0xb806c750)
> time 9899.665: (sockaddr_in6 *)0xb806c750 = { 10, 10080,  
> ::ffff:192.168.0.247 }
> time 13962.833: dgram_recv(dgram=0xb805c764, timeout=0, fromaddr=0xb806c750)
> time 13962.833: (sockaddr_in6 *)0xb806c750 = { 10, 10080,  
> ::ffff:192.168.0.247 }
> time 21600.190: dgram_recv(dgram=0xb805c764, timeout=0, fromaddr=0xb806c750)
> time 21600.211: (sockaddr_in6 *)0xb806c750 = { 10, 10080,  
> ::ffff:192.168.0.247 }
> time 40082.633: security_seterror(handle=0x80721f8, driver=0xb804a720  
> (BSD) error=timeout waiting for REP)
> time 40082.665: security_close(handle=0x80721f8, driver=0xb804a720 (BSD))
> time 40082.719: pid 6922 finish time Sun Oct 25 19:08:03 2009
>
>
> Something is happening at 21600 (6 hours).  The server recevies a dmesg  
> from somewhere (there is no corresponding entry in the sendsize log) and  
> stops looking for estimates from the client.  Then waits another 5 hours  
> before it starts the actual backup.
>
> Any idea what is happening at 21600?

been there seen that.

Look at REP_TIMEOUT = (6*60*60) set in amandad-src/amandad.c, ie 6hrs.
This is what you are seeing. You either have to figure out why your
clients are so slow or recompile the client with a bigger REP_TIMEOUT.

I'm sure Dustin or Jean-Louis will chime in if I'm mistaken.

hth,
jf


>
> -- 
> Garnet Harris                           TEL: +301 515 7118
> Eka Systems                             FAX: +301 515 4965
> 20201 Century Blvd., Suite 250
> Germantown, MD   20874                  garnet.harris AT ekasystems DOT com

-- 
<° >< Jean-François Malouin          McConnell Brain Imaging Centre        
Systems/Network Administrator       Montréal Neurological Institute
3801 Rue Université, Suite WB219          Montréal, Québec, H3A 2B4
Phone: 514-398-8924                               Fax: 514-398-8948