Amanda-Users

Intermittent "FAILED [cannot read header: got 0 instead of 32768]" errors

2008-06-13 05:25:44
Subject: Intermittent "FAILED [cannot read header: got 0 instead of 32768]" errors
From: "Wallace, Michael" <Michael.Wallace AT ubht.nhs DOT uk>
To: <amanda-users AT amanda DOT org>
Date: Fri, 13 Jun 2008 09:43:15 +0100
Hello,

I have Amanda 2.5.1 running on a Mandriva 2008 server and 8 clients
which are running either Amanda 2.4 or Amanda 2.5.1.  Generally
everything is working okay but I am getting the following intermittent
error which occurs on one or more DLEs:

FAILURE AND STRANGE DUMP SUMMARY:
  HOSTNAME    /opt          lev 1  FAILED [cannot read header: got 0
instead of 32768]
  HOSTNAME    /opt          lev 1  FAILED [too many dumper retry:
"[request failed: timeout waiting for ACK]"]

The corresponding information in /tmp/amanda/sendbackup... on the client
is:

sendbackup: debug 1 pid 14475 ruid 203 euid 203: start at Sat May 31
00:13:12 2008
/usr/local/libexec/sendbackup: version 2.4.3
  parsed request as: program `DUMP'
                     disk `/opt'
                     device `/opt'
                     level 1
                     since 2008:5:29:23:29:11
                     options `|;auth=BSD;compress-fast;index;'
sendbackup: try_socksize: send buffer size is 65536
sendbackup: time 0.004: stream_server: waiting for connection:
0.0.0.0.40506
sendbackup: time 0.004: stream_server: waiting for connection:
0.0.0.0.40507
sendbackup: time 0.005: stream_server: waiting for connection:
0.0.0.0.40508
sendbackup: time 0.005: waiting for connect on 40506, then 40507, then
40508
sendbackup: time 30.015: stream_accept: timeout after 30 seconds
sendbackup: time 30.015: timeout on data port 40506
sendbackup: time 60.025: stream_accept: timeout after 30 seconds
sendbackup: time 60.025: timeout on mesg port 40507
sendbackup: time 90.035: stream_accept: timeout after 30 seconds
sendbackup: time 90.035: timeout on index port 40508
sendbackup: time 90.035: pid 14475 finish time Sat May 31 00:14:42 2008

Sometimes this happens for all disks on a single host and sometimes it
only happens on one or two.  There doesn't seem to be any correlation
between dump levels, disk, disk size or host, and I can't see any other
patterns in the failures.  These failures seem to occur on average 1 day
out of 3, but only one or two hosts will fail each time.  The problems
only started when we upgraded from Amanda version 2.5.0 to 2.5.1 on the
server as part of an upgrade to Mandriva 2008.

I've had a search through the mailing list archive and similar errors
seem to point to network issues.  Capturing the packets between the
server and one of the hosts gave the following packet trace:

Amanda 2.5 REQ HANDLE 000-0000000b SEQ 1213076860
SECURITY USER amanda
SERVICE sendbackup
OPTIONS features=fffffeff9ffeffffff7f;hostname=HOSTNAME;
DUMP /opt  1 2008:5:29:23:29:11 OPTIONS |;auth=BSD;compress-fast;index;
Amanda 2.4 ACK HANDLE 000-0000000b SEQ 1213076860
Amanda 2.4 REP HANDLE 000-0000000b SEQ 1213076860
CONNECT DATA 40506 MESG 40507 INDEX 40508
OPTIONS features=fffffeff9f00;
Amanda 2.5 REQ HANDLE 000-0000000b SEQ 1213076860
SECURITY USER amanda
SERVICE sendbackup
OPTIONS features=fffffeff9ffeffffff7f;hostname=HOSTNAME;
DUMP /opt  1 2008:5:29:23:29:11 OPTIONS |;auth=BSD;compress-fast;index;
Amanda 2.4 REP HANDLE 000-0000000b SEQ 1213076860
CONNECT DATA 40506 MESG 40507 INDEX 40508
OPTIONS features=fffffeff9f00;
Amanda 2.5 REQ HANDLE 000-0000000b SEQ 1213076860
SECURITY USER amanda
SERVICE sendbackup
OPTIONS features=fffffeff9ffeffffff7f;hostname=HOSTNAME;
DUMP /opt  1 2008:5:29:23:29:11 OPTIONS |;auth=BSD;compress-fast;index;
...

For comparison, a packet trace for a successful connection between the
same client and server looked like this:

Amanda 2.5 REQ HANDLE 000-00000000 SEQ 1212388333
SECURITY USER amanda
SERVICE sendbackup
OPTIONS features=fffffeff9ffeffffff7f;hostname=HOSTNAME;
DUMP /export/home1  0 1970:1:1:0:0:0 OPTIONS
|;auth=BSD;srvcomp-fast;index;
Amanda 2.4 ACK HANDLE 000-00000000 SEQ 1212388333
Amanda 2.4 REP HANDLE 000-00000000 SEQ 1212388333
CONNECT DATA 40403 MESG 40404 INDEX 40405
OPTIONS features=fffffeff9f00;
Amanda 2.5 ACK HANDLE 000-00000000 SEQ 1212388333

So it looks like the client is responding but the server is not
receiving the packet.  The packet capture software (Wireshark) is
running on the same machine as the Amanda server so the packet is
definitely being received by the server, but for some reason the Amanda
process doesn't receive it and it sends out another request - this
repeats until it eventually gives up.

Does anyone have any suggestions as to what might be going on here?
We've been trying to work out what the problem could be for some time
now so any ideas would be very welcome.
 
Regards,

Michael

<Prev in Thread] Current Thread [Next in Thread>