Amanda-Users

amdump is failing for localhost

2005-09-05 06:10:06
Subject: amdump is failing for localhost
From: Rodrigo Ventura <yoda AT isr.ist.utl DOT pt>
To: amanda-users AT amanda DOT org
Date: Mon, 05 Sep 2005 10:32:02 +0100
Hello all. My backup setup has been working fine for the last
months. Now I'm getting systematic FAILED dumps on the local
filesystem. What changed? There is a user process that is running a
scientific simulation program which uses about 500M of virtual memory
footprint (the machine has 520M RAM and 1G swap). This may be causing
timeouts, and I could increase the timeouts value, but from an
inspection to the debug files, I'm not sure if it is just a timeout
issue.

Therefore, I'm sending this mail to ask for help on figuring out what
is wrong, from the log files.

Ok, let's start with the amanda mail report:

----------------------------------------------------------------------
FAILURE AND STRANGE DUMP SUMMARY:
  omni       //new/E$ lev 0 FAILED [Estimate timeout from omni]
  omni       //new/C$ lev 0 FAILED [Estimate timeout from omni]
  omni       /var/spool/imap/user/uz lev 0 FAILED [Estimate timeout from omni]
  omni       /var/spool/imap/user/nt lev 0 FAILED [Estimate timeout from omni]
[...]
----------------------------------------------------------------------

(we can forget for now the smb backups, and restrict our attention to
the local filesystems)

The following file is amandad.20050904195400000.debug, which I assume
is amandad daemon running on localhost, attending requests to
sendsize/backups:

----------------------------------------------------------------------
[...]
amandad: time 0.000: got packet:
--------
Amanda 2.4 REQ HANDLE 000-789C0608 SEQ 1125860041
SECURITY USER amanda
SERVICE sendsize
OPTIONS features=fffffeff9ffe0f;maxdumps=1;hostname=omni;
GNUTAR //new/E$ 0 1970:1:1:0:0:0 -1 OPTIONS |;bsd-auth;compress-fast;index;
GNUTAR //new/E$ 1 2005:8:30:3:5:35 -1 OPTIONS |;bsd-auth;compress-fast;index;
GNUTAR //new/E$ 2 2005:8:31:3:50:7 -1 OPTIONS |;bsd-auth;compress-fast;index;
GNUTAR //new/C$ 0 1970:1:1:0:0:0 -1 OPTIONS 
|;bsd-auth;compress-fast;index;exclude-file=./pagefile.sys;
GNUTAR //new/C$ 1 2005:8:28:3:18:35 -1 OPTIONS 
|;bsd-auth;compress-fast;index;exclude-file=./pagefile.sys;
GNUTAR //new/C$ 2 2005:8:31:3:53:48 -1 OPTIONS 
|;bsd-auth;compress-fast;index;exclude-file=./pagefile.sys;
GNUTAR /var/spool/imap/user/uz /var/spool/imap/user 0 1970:1:1:0:0:0 -1 OPTIONS 
|;bsd-auth;compress-fast;index;exclude-file=./[a-t]*;
GNUTAR /var/spool/imap/user/uz /var/spool/imap/user 1 2005:8:29:3:20:17 -1 
OPTIONS |;bsd-auth;compress-fast;index;exclude-file=./[a-t]*;
[...]
--------

amandad: time 0.000: sending ack:
----
Amanda 2.4 ACK HANDLE 000-789C0608 SEQ 1125860041
----

amandad: time 0.001: bsd security: remote host omni.isr.ist.utl.pt user amanda 
local user amanda
amandad: time 0.001: amandahosts security check passed
amandad: time 0.001: running service "/usr/local/amanda/libexec/sendsize"
amandad: time 12571.560: sending REP packet:
----
Amanda 2.4 REP HANDLE 000-789C0608 SEQ 1125860041
OPTIONS features=fffffeff9ffe0f;
/ 0 SIZE 4974600
/ 1 SIZE 678600
/boot 0 SIZE 6750
/boot 1 SIZE 10
/usr 0 SIZE 3312330
/usr 1 SIZE 174220
[...]
----

amandad: time 12581.571: dgram_recv: timeout after 10 seconds
amandad: time 12581.571: waiting for ack: timeout, retrying
amandad: time 12591.572: dgram_recv: timeout after 10 seconds
amandad: time 12591.572: waiting for ack: timeout, retrying
amandad: time 12601.572: dgram_recv: timeout after 10 seconds
amandad: time 12601.572: waiting for ack: timeout, retrying
amandad: time 12611.572: dgram_recv: timeout after 10 seconds
amandad: time 12611.572: waiting for ack: timeout, retrying
amandad: time 12621.572: dgram_recv: timeout after 10 seconds
amandad: time 12621.572: waiting for ack: timeout, giving up!
amandad: time 12621.602: pid 2657 finish time Sun Sep  4 23:24:21 2005
----------------------------------------------------------------------

Now, let's check sendsize.20050904195400.debug:

----------------------------------------------------------------------
sendsize: debug 1 pid 2659 ruid 92 euid 92: start at Sun Sep  4 19:54:00 2005
sendsize: version 2.4.4p4
sendsize[2659]: time 0.232: waiting for any estimate child: 1 running
sendsize[2664]: time 0.232: calculating for amname '/', dirname '/', spindle -1
sendsize[2664]: time 0.232: getting size via gnutar for / level 0
sendsize[2664]: time 0.513: spawning /usr/local/amanda/libexec/runtar in 
pipeline
sendsize[2664]: argument list: /bin/tar --create --file /dev/null --directory / 
--one-file-system --listed-incremental 
/usr/local/amanda/var/amanda/gnutar-lists/omni__0.new --sparse 
--ignore-failed-read --totals --exclude-from 
/tmp/amanda/sendsize._.20050904195400.exclude .
sendsize[2664]: time 91.573: /bin/tar: 
./var/amavis/tmp/spamassassin.149.wxdvgp.tmp: Warning: Cannot stat: No such 
file or directory
sendsize[2664]: time 108.878: /bin/tar: 
./var/spool/mqueue-mta/local/tfj84IdMD1018683: Warning: Cannot stat: No such 
file or directory
[...]
sendsize[1498]: estimate time for //new/E$ level 2: 0.444
sendsize[1498]: estimate size for //new/E$ level 2: 872525 KB
sendsize[1498]: time 12571.433: waiting for /usr/bin/smbclient "//new/E$" child
sendsize[1498]: time 12571.433: after /usr/bin/smbclient "//new/E$" wait
sendsize[1498]: time 12571.433: done with amname '//new/E$', dirname 
'//new/E$', spindle -1
sendsize[2659]: time 12571.433: child 1498 terminated normally
sendsize: time 12571.438: pid 2659 finish time Sun Sep  4 23:23:31 2005
----------------------------------------------------------------------


Conclusions: sendsize finishes after 12571.438 secs, while amandad
starts complaining about timeouts at time 12581.571.

The timeouts could be caused by the abnormal CPU/swap load, but these
timeouts are not visible in the logs.

What do you suggest me to do next to trackdown the problem?

Cheers,

Rodrigo Ventura

-- 

*** Rodrigo Martins de Matos Ventura <yoda AT isr.ist.utl DOT pt>
***  Web page: http://www.isr.ist.utl.pt/~yoda
***   Teaching Assistant and PhD Student at ISR:
***    Instituto de Sistemas e Robotica, Polo de Lisboa
***     Instituto Superior Tecnico, Lisboa, PORTUGAL
*** PGP fingerprint = 0119 AD13 9EEE 264A 3F10  31D3 89B3 C6C4 60C6 4585

<Prev in Thread] Current Thread [Next in Thread>
  • amdump is failing for localhost, Rodrigo Ventura <=