Re: client was working, now suddenly is getting self check "host down?"

On Monday, 02.06.2003 at 10:43 +0100, Martin Hepworth wrote:

> >>>Any ideas of why a client would work for a while then randomly not
> >>>be able to do a selfchecK? The other amanda client is still working
> >>>great...
> >>
> >>I have a random problem like this as well running RH Linux.  The
> >>client occasionally fails amcheck in the afternoon. (Backups run at
> >>nite.)  When I look at portland, the client, I find the selfcheck
> >>task "stuck" and I am unable to kill it, even with kill -9.  See if
> >>you have the same problem.  On the client, try
> >>
> >>ps -ef | grep amand
> >>
> >>or grep with whatever your amanda user account is.
> >>
> >>If you see selfcheck running, you'll be unable to get amcheck on the
> >>server to finish until it's gone.  Just something to check.
> >
> >
> >Interesting to see this problem reported - I've had this happen
> >sporadically too.  The 'host down' error relates to the localhost and
> >it leaves 'selfcheck' and 'amandad' running in the background.  The
> >server is RH Linux 7.3, running AMANDA 2.4.2p2.
> >
> >However, killing those processes does not make everything better.
> >The problem seems unrelated to the AMANDA configuration.  The last
> >time it happened here, we were fortunate enough to have a
> >'maintenance window' and rebooted the server and after that amcheck
> >ran without complaint.  However, given that this is a production
> >server, rebooting is not a good solution.
> 
> do you use 'localhost' or 'hostname' in the disk list.
> 
> I perfer to use 'hostname' for the reason that if you move the amanda
> server to 'someotherhostname' all the tapes etc still reflect the
> correct hosts!

We use 'localhost' ... :-)

> what do the debug logs in /tmp/amanda say when this happens, also
> anything else in /var/log/messages indication anything odd at this
> time?

Nothing obviously helpful - the only difference between a working and
non-working copy of the selfcheck and amcheck debugs are the timestamps
and process IDs.  The amandad debugs show "amandad: dgram_recv: timeout
after 10 seconds" a few times followed by a "amandad: waiting for ack:
timeout, giving up!" message a little later.

As I said this is a Problem That Goes Away By Itself.  I was wondering
if there was some sort of DNS thing going on, but I couldn't get
anywhere with that ...

Dave.
-- 
Dave Ewart
Dave.Ewart AT cancer.org DOT uk
Computing Manager, Epidemiology Unit, Oxford
Cancer Research UK
PGP: CC70 1883 BD92 E665 B840 118B 6E94 2CFD 694D E370
Re: client was working, now suddenly is getting self check "host down?" errors