* Garnet Harris <garnet.harris AT ekasystems DOT com> [20091030 13:33]:
> I am experimenting with amanda and am having problems understanding how
> the "estimate timeout" works. What I am seeing is not what I expect
> based on the explaination in the amanda.conf page.
>
> Client and server are on the same machine.
>
> 36 - DLEs (some set to calcsize, most use tar for estimates)
>
> all DLEs are set to "always full" (to remove any confusion caused by
> incremental backups)
>
> etimeout = 900
>
> My understanding is that amanda will allow 900 seconds per DLE. So it
> should timeout in 32,400 seconds (900 x 36) or 9 hours. According to
> the report after amdump runs, the estimate phase is over 11 hours.
>
> Looking at the sendsize log, the last "estimate time for" the 28th DLE
> is at time 30,708. And, the time stamp on the sendsize file matches:
> approximately 8.5 hours after amanda started. (So far so good.)
> However, the first runtar log doesn't appear until another 1.5 hours
> later. Which means amanda didn't do anything for 1.5 hours. (The lag
> is greater when allowing incremental backups.)
>
> Looking at the planner log on the server side, there is a "dgram_recv"
> with a matching for each "estimate time" entry on the client side for
> the first 27 DLEs.
>
> client sendsize log:
>
> sendsize[8661]: time 9899.506: estimate time for home_q level 0: 12.033
> sendsize[8668]: time 13962.655: estimate time for home_r level 0: 4063.141
> sendsize[9335]: time 30708.356: estimate time for home_s level 0: 16745.578
>
>
> server planner log:
>
> time 9899.665: dgram_recv(dgram=0xb805c764, timeout=0, fromaddr=0xb806c750)
> time 9899.665: (sockaddr_in6 *)0xb806c750 = { 10, 10080,
> ::ffff:192.168.0.247 }
> time 13962.833: dgram_recv(dgram=0xb805c764, timeout=0, fromaddr=0xb806c750)
> time 13962.833: (sockaddr_in6 *)0xb806c750 = { 10, 10080,
> ::ffff:192.168.0.247 }
> time 21600.190: dgram_recv(dgram=0xb805c764, timeout=0, fromaddr=0xb806c750)
> time 21600.211: (sockaddr_in6 *)0xb806c750 = { 10, 10080,
> ::ffff:192.168.0.247 }
> time 40082.633: security_seterror(handle=0x80721f8, driver=0xb804a720
> (BSD) error=timeout waiting for REP)
> time 40082.665: security_close(handle=0x80721f8, driver=0xb804a720 (BSD))
> time 40082.719: pid 6922 finish time Sun Oct 25 19:08:03 2009
>
>
> Something is happening at 21600 (6 hours). The server recevies a dmesg
> from somewhere (there is no corresponding entry in the sendsize log) and
> stops looking for estimates from the client. Then waits another 5 hours
> before it starts the actual backup.
>
> Any idea what is happening at 21600?
been there seen that.
Look at REP_TIMEOUT = (6*60*60) set in amandad-src/amandad.c, ie 6hrs.
This is what you are seeing. You either have to figure out why your
clients are so slow or recompile the client with a bigger REP_TIMEOUT.
I'm sure Dustin or Jean-Louis will chime in if I'm mistaken.
hth,
jf
>
> --
> Garnet Harris TEL: +301 515 7118
> Eka Systems FAX: +301 515 4965
> 20201 Century Blvd., Suite 250
> Germantown, MD 20874 garnet.harris AT ekasystems DOT com
--
<° >< Jean-François Malouin McConnell Brain Imaging Centre
Systems/Network Administrator Montréal Neurological Institute
3801 Rue Université, Suite WB219 Montréal, Québec, H3A 2B4
Phone: 514-398-8924 Fax: 514-398-8948
|