Amanda-Users

Re: Amanda Runs Hanging

2009-01-15 12:09:01
Subject: Re: Amanda Runs Hanging
From: Jim Summers <jsummers AT bachman.cs.ou DOT edu>
To: amanda-users <amanda-users AT amanda DOT org>
Date: Thu, 15 Jan 2009 10:59:30 -0600
Jim Summers wrote:

Hello All,

I started having my nightly backups hang. The backups have been working fine until Thursday of last week.

The server is amanda-2.5.2p1 on rhel4 and the client that I believe is involved is amanda-2.6.0p2 on centos5. Their are other clients and they are amanda-2.5.2p1 on rhel4 or fc8 also.

It seems to not be completing the estimates or ever coming out of the planner phase. It never actually dumps or does any backup. The next day when the scheduled amcheck runs I get an email saying that the amdump or amflush is running and I need to run amcleanup if needed.

Running amcleanup returns "Results Missing" for all of the dle's for each of the hosts.

So I am not sure where the issue is. I have increased the etimeout from 300 to 3600 and the results seem to be the same. I also re-compiled the client and server to only use ipv4. Still same results.

I found the following error in the amandad.xxxxx.debug file on the client:

 >>>>>
1231933529.123236: amandad: dgram_send_addr(addr=0x5b524d0, dgram=0x2af76db07d08) 1231933529.123257: amandad: (sockaddr_in *)0x5b524d0 = { 2, 889, 129.15.11.211 }
1231933529.123274: amandad: dgram_send_addr: 0x2af76db07d08->socket = 0
1231933835.099966: amandad: /usr/local/libexec/amanda/sendsize timed out waiting for REP data
1231933835.100027: amandad: sending NAK pkt:
<<<<<
ERROR timeout on reply pipe
 >>>>>
1231933835.100056: amandad: dgram_send_addr(addr=0x5b524d0, dgram=0x2af76db07d08) 1231933835.100076: amandad: (sockaddr_in *)0x5b524d0 = { 2, 889, 129.15.11.211 }
1231933835.100093: amandad: dgram_send_addr: 0x2af76db07d08->socket = 0
1231933835.100219: amandad: security_close(handle=0x5b52490, driver=0x2af76dafe3e0 (BSD))
1231933864.098431: amandad: pid 19716 finish time Wed Jan 14 05:51:04 2009

==================

and then at the end of the planner.xxxxxx.debug file on the tape server I see:

planner: time 11076.021: (sockaddr_in *)0x627530 = { 2, 10080, 129.15.11.173 } planner: time 11092.842: dgram_recv(dgram=0x617544, timeout=0, fromaddr=0x627530) planner: time 11092.842: (sockaddr_in *)0x627530 = { 2, 10080, 129.15.11.173 } planner: time 21324.684: dgram_recv(dgram=0x617544, timeout=0, fromaddr=0x627530) planner: time 21324.705: (sockaddr_in *)0x627530 = { 2, 10080, 129.15.11.173 } planner: time 21630.663: dgram_recv(dgram=0x617544, timeout=0, fromaddr=0x627530) planner: time 21630.663: (sockaddr_in *)0x627530 = { 2, 10080, 129.15.11.173 }

==================

The odd thing there is there is not a dgram_recv for the last entry into the log. From there everything just seems to stop.

Any ideas or suggestions?

Please let me know and I can provide more debug if needed.

TIA


I found a thread that mentioned that possibly the estimate was timing out and to try using the calcsize program for the estimate. I made that switch and the amdump run is now running.

Thanks



--
Jim Summers
School of Computer Science-University of Oklahoma
-------------------------------------------------

<Prev in Thread] Current Thread [Next in Thread>