Amanda-Users

Re: Estimate timeout error

2004-12-03 12:32:18
Subject: Re: Estimate timeout error
From: Nick Danger <nick AT hackermonkey DOT com>
To: amanda-users AT amanda DOT org
Date: Fri, 03 Dec 2004 11:48:50 -0500
There is a PIX between the two, but Im backing up a bunch (10?) linux and solaris servers in the same areas of the network, to this same amanda server without any issues so I dont believe it to be a firewall issue. There are no iptables running on either host (both linux in this case)

In the amandad.XXX.debug log I have the following lines, which Im assuming are the error report of the problem? Now, the question is, how to fix it :-)

-Nick


amandad: time 0.010: amandahosts security check passed
amandad: time 0.010: running service "/usr/lib/amanda/sendsize"
amandad: time 182.436: sending REP packet:
----
Amanda 2.4 REP HANDLE 005-40813308 SEQ 1102082216
OPTIONS features=fffffeff9ffe0f;
/ 0 SIZE 301197
/ 1 SIZE 100
/u00 0 SIZE 143930
/u00 1 SIZE 41411
/usr 0 SIZE 880958
/usr 1 SIZE 79
/usr/local 0 SIZE 174
/usr/local 1 SIZE 47
/var 0 SIZE 299300
/var 1 SIZE 2857
----

amandad: time 192.437: dgram_recv: timeout after 10 seconds
amandad: time 192.437: waiting for ack: timeout, retrying
amandad: time 202.439: dgram_recv: timeout after 10 seconds
amandad: time 202.439: waiting for ack: timeout, retrying
amandad: time 212.441: dgram_recv: timeout after 10 seconds
amandad: time 212.442: waiting for ack: timeout, retrying
amandad: time 222.444: dgram_recv: timeout after 10 seconds
amandad: time 222.444: waiting for ack: timeout, retrying
amandad: time 232.446: dgram_recv: timeout after 10 seconds
amandad: time 232.446: waiting for ack: timeout, giving up!
amandad: time 232.446: pid 21896 finish time Fri Dec  3 09:01:32 2004


Paul Bijnens wrote:

Nick Danger wrote:


Nope - still a problem. The error is still as below:

FAILURE AND STRANGE DUMP SUMMARY:

 dominion.h /var lev 0 FAILED [Estimate timeout from dominion.xxx]
 dominion.h /usr/local lev 0 FAILED [Estimate timeout from dominion.xxx]
 dominion.h /usr lev 0 FAILED [Estimate timeout from dominion.xxx]
 dominion.h /u00 lev 0 FAILED [Estimate timeout from dominion.xxx]
 dominion.h / lev 0 FAILED [Estimate timeout from dominion.xxx]

I have the timeout in amanda.conf set to an ungodly high number of

etimeout -12000         # total number of seconds for estimates.

[...]

sendsize: debug 1 pid 26242 ruid 33 euid 33: start at Thu Dec 2 11:25:07 2004
sendsize: version 2.4.4p1

[...]

sendsize: time 172.473: pid 26242 finish time Thu Dec  2 11:27:59 2004


The estimate really takes only 173 seconds.  That means that etimeout
is plenty (better lower it again to normal values).

The problem seems to be in the reply packet.

I've already seen problems with a UDP-packet overflow, but that's
unlikely.  That problem happened with older versions where the UDP
size was only 8Kbyte or so. Currently it is 64K, but it could be
limited by the OS too, of course.  The reply packet is usually larger
than the request packet, because it contains 1 to 3 lines for each
DLE (level 0, current level, current plus 1).
In amandad.DATETIME.debug, you can find the request packet, and the
reply packet.
Any weird limitation on UDP packet size on one of the hosts (or
intermediate routers/firewalls)?

Another problem could be in the iptables modules for amanda, where
there is already twice a bug introduced.  I don't know exactly the
last status of that bug.  If not needed, do not use the amanda iptables
modules.  Try "lsmod | grep amanda".  (Or on intermediate firewalls!)

Maybe try a network traffic dump (with tcpdump or similar program)
on client *and* host?