2.4.2p2 client, 2.4.4p3 server: timeout from amandad...

Hello everyone,

I'm in the process of migrating a full backup configuration from one machine 
to another. The problem with the old one is threefold:

- it's a mess, which happens to be our main internal LAN server,
- it's obsolete software wise,
- the DAT changer attached to it now has insufficient capacity.

Such a mess it is that I cannot afford to install a new Amanda version on it 
and see if that would cure the problem - some dumps are left on the holding 
disk but then I can copy them around.

So, the new machine has a brand new DAT72x6 HP changer (DAT24x6 on the old 
one), I've copied the old Amanda configuration to the new one and adapted 
some settings (tapetype, in essence). Now the configuration looks like this:

--------------------
org "One2team"          
mailto "[email protected]"
dumpuser "amanda"       
inparallel 4            
netusage  5000 Kbps     
dumpcycle 0 days        
runspercycle 1 days     
tapecycle 2 tapes       
bumpsize 20 Mb          
bumpdays 1              
bumpmult 4              
etimeout 1200           
runtapes 1              
tpchanger "/usr/lib/amanda/chg-zd-mtx"
tapedev "/dev/nst0"            
changerfile "changer.conf"
changerdev "/dev/sg1"
tapetype HP-DAT72              
labelstr "^full-[0-9][0-9]*$"  
holdingdisk hd1 {
    comment "main holding disk"
    directory "/var/lib/amanda/full/dumps"
    use -1024 Mb       
    }
reserve 30 
infofile "/var/lib/amanda/full/info" 
logdir   "/var/lib/amanda/full/logs" 
indexdir "/var/lib/amanda/full/index"
define tapetype HP-DAT72 {
    comment "Produced by tapetype prog (hardware compression off)"
    length 37511 mbytes
    filemark 625 kbytes
    speed 1758 kps
}
----

The dump types are defined like this (relevant settings only AFAICS):

----
define dumptype global {
    comment "Global definitions"
    exclude "./tmp"
}
define dumptype root-tar {
    global
    program "GNUTAR"
    comment "root partitions dumped with tar"
    compress none
    index
    exclude list "/etc/amanda/exclude.gtar"
    priority low
}
define dumptype comp-root-tar {
    root-tar
    comment "Root partitions with compression"
    compress server fast
}
--------------------


The list of filesystems represent 24 Gb total (compressed with gzip). The 
problem is this: it works fine when I try and backup every directory but one 
of the two largest (which are resp. 8.4 Gb and 10 Gb uncompressed on disk), 
and fails when I try to include either of these because _amandad_, not 
amdump, times out. I get this in the amandad logfile:

--------------------
amandad: debug 1 pid 6636 ruid 33 euid 33 start time Wed May  3 12:15:04 2006
amandad: version 2.4.2p2
amandad: build: VERSION="Amanda-2.4.2p2"
[blah blah]
Amanda 2.4 REQ HANDLE 000-9006B109 SEQ 1146651283
SECURITY USER amanda
SERVICE sendsize
OPTIONS features=fffffeff9ffe0f;maxdumps=1;hostname=crios.olympe.o2t;
GNUTAR /usr/local 0 1970:1:1:0:0:0 -1 exclude-file=./tmp
[more blah]
sending ack:
----
Amanda 2.4 ACK HANDLE 000-9006B109 SEQ 1146651283
----

bsd security: remote host circe.olympe.o2t user amanda local user amanda
amandahosts security check passed
amandad: running service "/usr/lib/amanda/sendsize"
amandad: sending REP packet:
----
Amanda 2.4 REP HANDLE 000-9006B109 SEQ 1146651283
OPTIONS maxdumps=1;
/etc 0 SIZE 5990
/var/named 0 SIZE 40
[goes on and reports the rest]
----

amandad: dgram_recv: timeout after 10 seconds
amandad: waiting for ack: timeout, retrying
amandad: dgram_recv: timeout after 10 seconds
amandad: waiting for ack: timeout, retrying
amandad: dgram_recv: timeout after 10 seconds
amandad: waiting for ack: timeout, retrying
amandad: dgram_recv: timeout after 10 seconds
amandad: waiting for ack: timeout, retrying
amandad: dgram_recv: timeout after 10 seconds
amandad: waiting for ack: timeout, giving up!
amandad: pid 6636 finish time Wed May  3 12:20:06 2006
--------------------

Reproducible at will: amandad always times out after 5 minutes. Meanwhile, 
amdump stays there waiting for... Well, I don't know, frankly, but I have to 
C-c it and amcleanup afterwards.

What I've already done is increase the etimeout parameter on the server side: 
I put 1200 instead of the default value, 300. But that didn't help. Out of 
despair I even tried and changed this value in the old server config files, 
in case amandad would try and read them :p But no.

It should also be noted that the client machine is such a mess that my 
predecessor of a sysadmin created 6 aliases for interface eth0... I had to 
bind amandad specifically to the address I wanted so that dumps could work in 
the first place. But I don't see this having an influence here, since smaller 
backups work perfectly...

I'd appreciate any hint on this one!

Thanks,
-- 
Francis Galiegue, fg AT one2team DOT com
One2team - 12bis rue de la Pierre Levée, 75011 PARIS - 0143381980