Amanda-Users

Re: got FAILED for no apparent reason

2005-09-05 07:18:36
Subject: Re: got FAILED for no apparent reason
From: Paul Bijnens <paul.bijnens AT xplanation DOT com>
To: Rodrigo Ventura <yoda AT isr.ist.utl DOT pt>
Date: Mon, 05 Sep 2005 13:03:33 +0200
Rodrigo Ventura wrote:
Hi.

Meanwhile I sent a mail to amanda-users re-reporting the problem. Here

I just read it.

From the timings below, it seems to me that that scientific simulation
program is also doing havy IO, or at least causing very havy swapping.

I wouldn't be suprised if those "state of the art scientific programs"
used pre-historic dumb user - "it works on these three lines of input,
so it should work on these million lines too" algorithms  :-)

Note that RAM-access is measured in nanoseconds, while disk access is measured in milliseconds! When the "working set" of a program gets
larger than the physical RAM, the kernel cannot do more than swapping
and "access some variable" takes then 100000 times more time.


goes the data relative to that: I have 14 filesystems to dump on
localhost, so that total timeout should be 300*14=4200 seconds, right?
Doing a grep on sensize log, I get:

$ grep "estimate time" sendsize.20050904195400.debug
sendsize[2664]: estimate time for / level 0: 8169.854
sendsize[2664]: estimate time for / level 1: 342.415

root is indeed taking a very long time.
Possible causes:
  - some unresponsive filesystems mounted
  - using gnutar and having many small files
  - and of course, using the disk for something else while
    trying to backup...


sendsize[28610]: estimate time for /boot level 0: 0.186
sendsize[28610]: estimate time for /boot level 1: 0.021
sendsize[28613]: estimate time for /usr level 0: 1200.347
sendsize[28613]: estimate time for /usr level 1: 830.321
sendsize[28613]: estimate time for /usr level 2: 899.342

/usr is also taking a long time.

Could that be because / and /usr are on the same disk
as the swap area?


sendsize[32577]: estimate time for /root level 0: 21.309
sendsize[32577]: estimate time for /root level 1: 2.288
sendsize[32602]: estimate time for /home/ag level 0: 50.686
sendsize[32602]: estimate time for /home/ag level 1: 2.806
sendsize[32636]: estimate time for /home/hm level 0: 127.386
sendsize[32636]: estimate time for /home/hm level 1: 544.951
sendsize[1152]: estimate time for /home/nt level 0: 96.014
sendsize[1152]: estimate time for /home/nt level 1: 4.326
sendsize[1226]: estimate time for /home/uz level 0: 73.265
sendsize[1226]: estimate time for /home/uz level 1: 2.615
sendsize[1305]: estimate time for /var/spool/imap/user/ag level 0: 87.474
sendsize[1305]: estimate time for /var/spool/imap/user/ag level 2: 4.176
sendsize[1305]: estimate time for /var/spool/imap/user/ag level 3: 4.861
sendsize[1393]: estimate time for /var/spool/imap/user/hm level 0: 20.776
sendsize[1393]: estimate time for /var/spool/imap/user/hm level 1: 5.285
sendsize[1393]: estimate time for /var/spool/imap/user/hm level 2: 4.355
sendsize[1458]: estimate time for /var/spool/imap/user/nt level 0: 11.698
sendsize[1458]: estimate time for /var/spool/imap/user/nt level 2: 1.072
sendsize[1458]: estimate time for /var/spool/imap/user/nt level 3: 0.868
sendsize[1465]: estimate time for /var/spool/imap/user/uz level 0: 21.152
sendsize[1465]: estimate time for /var/spool/imap/user/uz level 1: 3.358
sendsize[1465]: estimate time for /var/spool/imap/user/uz level 2: 2.961
sendsize[1486]: estimate time for //new/C$ level 0: 22.735
sendsize[1486]: estimate time for //new/C$ level 1: 1.289
sendsize[1486]: estimate time for //new/C$ level 2: 1.182
sendsize[1498]: estimate time for //new/E$ level 0: 4.540
sendsize[1498]: estimate time for //new/E$ level 1: 0.410
sendsize[1498]: estimate time for //new/E$ level 2: 0.444

It seems that the level 0 estimate for / is the one taking longer.
The tail of that log is:

$ tail sendsize.20050904195400.debug
sendsize[1498]: time 12571.432:                 59992 blocks of size 262144. 
29027 blocks available
sendsize[1498]: time 12571.432: Total number of bytes: 893464856
sendsize[1498]: time 12571.433: .....
sendsize[1498]: estimate time for //new/E$ level 2: 0.444
sendsize[1498]: estimate size for //new/E$ level 2: 872525 KB
sendsize[1498]: time 12571.433: waiting for /usr/bin/smbclient "//new/E$" child
sendsize[1498]: time 12571.433: after /usr/bin/smbclient "//new/E$" wait
sendsize[1498]: time 12571.433: done with amname '//new/E$', dirname 
'//new/E$', spindle -1
sendsize[2659]: time 12571.433: child 1498 terminated normally
sendsize: time 12571.438: pid 2659 finish time Sun Sep  4 23:23:31 2005

It takes 12571.438 secs for the estimates; much greater than
4200.

If this is correct, then I should increase the estimate timeout, maybe
ten-fold. But I'm still not sure that is the problem. Is it worthwhile
to try with a giant timeout and see what happens?

Maybe set it like "etimeout -16000".
But remember that after the estimate, then starts the backup itself,
which also havily uses the disks.
And because this is the amanda server itself, the holdingdisk is used
very hard too.  Make sure the holdingdisk can feed the bytes fast
enough to the tapedrive, otherwise, you'll end up with a trashing-tape
drive (shoeshining effect of having to stop, rewind-a-little, restart)
and making your tapedrive ready for the trashcan in 2 months or less.

With such a heavy load, you should probably disable software compression too.

Amanda itself does not need a very fast PC.   My amanda server is
a 333 MHz linux machine with 128 Mbyte RAM.  But I do have a large
(80 GByte) and fast (7200 RPM) dedicated holdingdisk (and soon it
will have a 250 MB holdingdisk), and two AIT-1 tapes.
I back up about 280 Gbyte in total with that little PC that would
take 5 minutes to boot XP on it.
Any possibility to migrate the amanda server to another machine?
Or migrate the scientific simulation to another machine.

--
Paul Bijnens, Xplanation                            Tel  +32 16 397.511
Technologielaan 21 bus 2, B-3001 Leuven, BELGIUM    Fax  +32 16 397.512
http://www.xplanation.com/          email:  Paul.Bijnens AT xplanation DOT com
***********************************************************************
* I think I've got the hang of it now:  exit, ^D, ^C, ^\, ^Z, ^Q, ^^, *
* F6, quit, ZZ, :q, :q!, M-Z, ^X^C, logoff, logout, close, bye, /bye, *
* stop, end, F3, ~., ^]c, +++ ATH, disconnect, halt,  abort,  hangup, *
* PF4, F20, ^X^X, :D::D, KJOB, F14-f-e, F8-e,  kill -1 $$,  shutdown, *
* init 0, kill -9 1, Alt-F4, Ctrl-Alt-Del, AltGr-NumLock, Stop-A, ... *
* ...  "Are you sure?"  ...   YES   ...   Phew ...   I'm out          *
***********************************************************************



<Prev in Thread] Current Thread [Next in Thread>