RES: dumper issue - timeout problem?

Thank you, Paul. You´re absolutely right.
I´ve configured SOlaris tcp stack to decrease the keep alive interval 
("ndd -set /dev/tcp tcp_time_wait_interval 240000") and everything runs fine 
now.

Regards

-----Mensagem original-----
De: Paul Bijnens [mailto:paul.bijnens AT xplanation DOT com]
Enviada em: terça-feira, 21 de março de 2006 19:58
Para: Edson Noboru Yamada; Mailing List Amanda User
Assunto: Re: dumper issue - timeout problem?


Edson Noboru Yamada schreef:
> 
> I´ve been facing a problem when trying to backup one of our clients.
> The backup starts normally, but after some time, the following message shows 
> up
> in the taper log:
> 
> 
> dumper: stream_client: our side is 0.0.0.0.45740
> driver: result time 553.824 from dumper0: FAILED 01-00002 [mesg read: 
> Connection reset by peer]
> dumper: kill index command
> taper: reader-side: got label DMX224 filenum 1

Note: it is the "mesg" channel that was closed by the peer.
Probably because it was idle for too long.


> 
> 
> On the client side, I can read something like this on the sendbackup log:
> 
> sendbackup-gnutar: time 0.248: /usr/local/libexec/runtar: pid 15147
> sendbackup: time 0.309: started index creator: "/usr/bin/tar -tf - 
> 2>/dev/null | sed -e 's/^\.//'"
> sendbackup: time 301.700: index tee cannot write [Broken pipe]
> sendbackup: time 301.700: pid 15145 finish time Tue Mar 21 15:39:18 2006
> sendbackup: time 301.712: 124: strange(?): sendbackup: index tee cannot write 
> [Broken pipe]

The index was closed by the server, after the mesg channel broke down.
Because the client does not need to send through the mesg channel yet, 
it did not notice that.  But it tries to write to the index channel, 
which was closed by the server already.


> 
> 
> I've already tried to turn off index and the holding disk, but no success.
> 
> One important thing I´ve noticed is that the error allways occurs after 300 
> seconds.
> Is there some tunable timeout I´m forgetting?
> 
> Additional info: strangely, the backup appears successful, even when this 
> message shows up.
> The same client is able to backup other file systems, and the one that fails 
> the most
> is the / filesystem.
> 
> Any ideas?

Is it the problem described here:

http://wiki.zmanda.com/index.php/Amdump_fails_to_backup_large_DLEs

      Increase tcp keepalive probes:

   echo 90 > /proc/sys/net/ipv4/tcp_keepalive_time


-- 
Paul Bijnens, Xplanation                            Tel  +32 16 397.511
Technologielaan 21 bus 2, B-3001 Leuven, BELGIUM    Fax  +32 16 397.512
http://www.xplanation.com/          email:  Paul.Bijnens AT xplanation DOT com
***********************************************************************
* I think I've got the hang of it now:  exit, ^D, ^C, ^\, ^Z, ^Q, F6, *
* quit,  ZZ, :q, :q!,  M-Z, ^X^C,  logoff, logout, close, bye,  /bye, *
* stop, end, F3, ~., ^]c, +++ ATH, disconnect, halt,  abort,  hangup, *
* PF4, F20, ^X^X, :D::D, KJOB, F14-f-e, F8-e,  kill -1 $$,  shutdown, *
* kill -9 1,  Alt-F4,  Ctrl-Alt-Del,  AltGr-NumLock,  Stop-A,  ...    *
* ...  "Are you sure?"  ...   YES   ...   Phew ...   I'm out          *
***********************************************************************