Re: BUG (was: Re: Handitarded....odd (partial) estimate timeout errors.)

--On January 5, 2006 4:49:53 PM +0100 Paul Bijnens<paul.bijnens AT xplanation DOT com> wrote:

Michael Loftis wrote:



Paul asked for the logs, it seems like there's an amanda bug.  The units


Yes, indeed, there is a bug in Amanda!
You have 236 DLE's for that host, and from my reading of the code
the REQuest UDP packet is limited to 32K instead of 64K (see planner.c
lines 1377-1383)  (Need to update the documentation!)


Woot, I'm NOT crazy! :D

...did I just say woot?  My apologies.

It seems that that planner splits up the REQuest packet into separate
UDP-packets when exceeding MAX_DGRAM/2, i.e. 32K.
Your first request was 32580 bytes.  Adding the next string to that
request would have excceeded the 32768 limit.
The reason for division by 2 seems to reserver space for error replies
on each of those.

I knew it was size related but that my packets were significantly less thanthe MAX_DGRAM. This definitely explains it.

However, the amandad client only expects one and only one REQuest packet.
Any other REQuest packet coming from the same connection (5-tuple:
protocol, remotehost, remoteport, localhost, localport) and having
a type "REQ" is considered a duplicate.
It should actually test for the handle and sequence to be identical
too. It does not.

It's not fixed quickly either:  when receiving the first "REQ" packet,
the amandad client forks and execs the request program (sendsize in
this case) and reads from the results from a pipe.

By the time the second, non-identical request comes in (with different
handle, sequence -- which is currently not checked), sendsize is already
started and cannot be given additional DLE's to estimate.


As a temporary workaround, you could shorten the exclude-list string for
that host by creating a symlink:

    ln -s /etc/amanda/exclude.gtar /.excl

Yeah...This will help for a time. Hopefully long enough for a patch to fixamandad. I'll have to create a separate type for this server, since we'vegot well over a hundred now and they all share that main backup type. Ifigured shortening the UDP packets somehow would help, I knew it was justodd that it wasn't quite right and I seemed to be running into the problemway too early :)

and use that as exclude-list: this shortens each line by 20 byte, which
would shrink the package to fit again. (236 DLE's * 20  = 4720 bytes
less in a REQuest UDP for that host!)

Anyway....I'm getting a headache thinking about it :)  all my other DLEs
seem ok for that host, and the ones that it misses are not always
exactly the same, but all seem to be non-calcsize estimated.


Just bad luck for those entries that happen to go in the end of the
queue.  On the other hand, when really unlucky, you could have up to
three estimates for each DLE, overflowing even the 4K we saved by
shrinking the exclude string...

Like I said, hopefully by then either the hackers (or myself) will have puttogether a patch. ... I see three ways to fix this...one of which I don'tknow will fix, what about turning wait=yes to wait=no in my xinetd.conf?Not sure what that would break. The other two involve code...multiplesendsize's, *or* a protocol change to wait for a 'final start' packet, oran amandad change to wait a few extra seconds before starting the actualsendsize, coalescing the results.

And you're right, the other ways aren't easy...one involves possiblybreaking the protocol too.