Re: data timeout error

Guy Dallaire schreef:

One of my DLE fails to backup correctly from time to time. It happened
twice this week. This is a bit annoying because it changes everyday
(it's an oracle cold/hot backup destination for dabase files) and I
really need to get this stuff backed up every day. This is the only
DLE exhibiting this behavior.

I have to use anoither tape in the morning and RE-BACKUP this DLE
(Which normally works fine)

It's not really huge.
The error I have in the amanda report is:

FAILURE AND STRANGE DUMP SUMMARY:

sol /disk1/RDBMS_BACKUP lev 0 FAILED [data timeout]
It's no really huge.

I'n using amanda 2.4.5 the tape server is centos 4.2 (RHEL 4 Clone)
and the client with the failing DLE is a sun solaris 9 box.

Is there a timeout I can configure for this ? What might cause this ?
The client is in the DMZ, but I have FW1 rules allowing for the backup
UDP and TCP ports.



You should also take a look in the amanda debug files on the client.

One thing that came up recently again, and could apply here, is
the timeout of the MESG connection.

When the backup starts, there are actually 2 or 3 tcp connections
between the client and the server.
One is the DATA connection, and I doubt that it is that one that
is timing out.
A second TCP connections is the MESG connection, carrying the stderr
output of the backup command.  If there are no errors, the only thing
that is sent over that connection is the summary at the end with
the total number of bytes transferred.
When you indexing enabled there is a third connection INDEX, carrying
the table of contents of the backup.

As I already mentioned, the MESG connection does not carry much traffic,
and if the you a few very large files, then the INDEX connection has
the same problem.  A firewall could easily time out these connections
without traffic.

A solution for this is to increase the TCP keepalive frequency that
can artificially generate traffic on the connection.
For Linux with a /proc pseudo filesystem, you can do:

   echo 900 >/proc/sys/net/ipv4/tcp_keepalive_time

setting the tcp keepalive to 900 seconds (instead of the default 7200seconds). Other OSes probably have commands to accomplish the same.


It could also be that FW1 recognizes keep alive packets, and does not
count those as traffic either...

The fact that this does not happen always, is maybe because the
backup runs just a little around the limit.  Heavy load on the amanda
server makes it flip over, and when you do in the morning again, it is
the only backup, so it is much faster.

Certainly take a look on the /tmp/amanda/*.debug files too.

--
Paul Bijnens, Xplanation                            Tel  +32 16 397.511
Technologielaan 21 bus 2, B-3001 Leuven, BELGIUM    Fax  +32 16 397.512
http://www.xplanation.com/          email:  Paul.Bijnens AT xplanation DOT com
***********************************************************************
* I think I've got the hang of it now:  exit, ^D, ^C, ^\, ^Z, ^Q, F6, *
* quit,  ZZ, :q, :q!,  M-Z, ^X^C,  logoff, logout, close, bye,  /bye, *
* stop, end, F3, ~., ^]c, +++ ATH, disconnect, halt,  abort,  hangup, *
* PF4, F20, ^X^X, :D::D, KJOB, F14-f-e, F8-e,  kill -1 $$,  shutdown, *
* kill -9 1,  Alt-F4,  Ctrl-Alt-Del,  AltGr-NumLock,  Stop-A,  ...    *
* ...  "Are you sure?"  ...   YES   ...   Phew ...   I'm out          *
***********************************************************************