Re: got FAILED for no apparent reason

Rodrigo Ventura wrote:

It happened again. This time the timeout was on localhost! There is no
firewall involved, both in this case as well as in the last (mail from
July 12th).

FAILURE AND STRANGE DUMP SUMMARY:
  omni       //new/E$ lev 0 FAILED [Estimate timeout from omni]
  omni       //new/C$ lev 0 FAILED [Estimate timeout from omni]
  omni       /var/spool/imap/user/uz lev 0 FAILED [Estimate timeout from omni]
  omni       /var/spool/imap/user/nt lev 0 FAILED [Estimate timeout from omni]
[...]

The timeout setting at amanda.conf is 300 (the default, I
believe). There is comment stating that:
# a positive number will be multiplied by the number of filesystems on
# each host; a negative number will be taken as an absolute total time-out.
# The default is 5 minutes per filesystem.

Let's look at the debug files:

-----[amandad.20050901030010000.debug]----------------------------------------
[...]
//new/C$ 2 SIZE 874751
//new/E$ 0 SIZE 7786744
//new/E$ 1 SIZE 467994
//new/E$ 2 SIZE 467994
----


4 * 300 sec = 1200 seconds timeout


amandad: time 14935.917: dgram_recv: timeout after 10 seconds


but apparently it took almost 15000 seconds...

That is very strange too. smbclient uses the superfast builtin
"du" command to do the estimates (except for very old smblient versions).
On which filesystem did it spend all that time?  You did not show that.

-----[sendsize.20050901030010.debug]----------------------------------------
[...]
sendsize[22886]: estimate time for //new/E$ level 2: 0.422


And indeed, this particular estimate took less than half a second.

sendsize[22886]: estimate size for //new/E$ level 2: 467994 KB
sendsize[22886]: time 14925.824: waiting for /usr/bin/smbclient "//new/E$" child
sendsize[22886]: time 14925.824: after /usr/bin/smbclient "//new/E$" wait
sendsize[22886]: time 14925.824: done with amname '//new/E$', dirname 
'//new/E$', spindle -1
sendsize[1525]: time 14925.827: child 22886 terminated normally
sendsize: time 14925.834: pid 1525 finish time Thu Sep  1 07:08:56 2005



What can we gather from this? sendsize finishes 50 seconds before
amandad; the amandad timeouts start 10 seconds before sendize
finishes. What is happening here?


???  amandad timeouts start 10 seconds AFTER sendsize finishes.
That looks completely normal to me.
Can you find out which filesystem(s) did took the majority of the
15000 seconds?


--
Paul Bijnens, Xplanation                            Tel  +32 16 397.511
Technologielaan 21 bus 2, B-3001 Leuven, BELGIUM    Fax  +32 16 397.512
http://www.xplanation.com/          email:  Paul.Bijnens AT xplanation DOT com
***********************************************************************
* I think I've got the hang of it now:  exit, ^D, ^C, ^\, ^Z, ^Q, ^^, *
* F6, quit, ZZ, :q, :q!, M-Z, ^X^C, logoff, logout, close, bye, /bye, *
* stop, end, F3, ~., ^]c, +++ ATH, disconnect, halt,  abort,  hangup, *
* PF4, F20, ^X^X, :D::D, KJOB, F14-f-e, F8-e,  kill -1 $$,  shutdown, *
* init 0, kill -9 1, Alt-F4, Ctrl-Alt-Del, AltGr-NumLock, Stop-A, ... *
* ...  "Are you sure?"  ...   YES   ...   Phew ...   I'm out          *
***********************************************************************