Amanda-Users

Re: incremental backup not working

2006-06-07 04:36:33
Subject: Re: incremental backup not working
From: Paul Bijnens <paul.bijnens AT xplanation DOT com>
To: mario.lobo AT ipad.com DOT br
Date: Wed, 07 Jun 2006 10:26:32 +0200
On 2006-06-07 00:07, Mario Lobo wrote:
Hi to all; Forgive me for such a long post but I needed to post as much info I had so you guys could have a better picture of what is going on. The logs and config files bellow will detail the full story. Short story: Fisrt full backup of the cycle works great, down to the last byte !! First incremental backup of the cycle, only 3 clients work, all the others fail :(.

I have only partial answers.  No answers to your real problem.



No firewall issues involved. The server and the clients are open at the firewall to see each other, otherwise, how would the full backup work, right?. And I checked if the firewall rules were there, before running the incremental.

Firewall is still not to be ruled out.
The first run, amanda had to estimate only full dumps. The subsequent runs Amanda will estimate for more levels: Amanda can ask an estimate
for level 0, level N, and leven N+1.
So the second run; it will as for a level 0 estimate + a level 1
estimate for all the DLE's.

Why?  Because Amanda will spread the level 0 over a complete dumpcycle.
The very first run, it needs to do all (most of the time people have
trouble here, because their tape is large enough to hold it all).
But already during the second run Amanda will schedule level 0 dumps
for about 1/runspercycle of the total amount of level 0 dumps.

Amanda just notices that 6 days from now, an enormous amount of
level 0 needs to be done, and she likes to spread the work over the
complete dumpcycle. That's why some of the DLE's get a level 0 earlier
now.  But after a complete dumpcycle the situation calms down, until
the data moves/grows, and Amanda reshuffles the DLE's a bit again.



Well, my questions are spread throughout the info bellow. I´ve been learning and experimenting with AMANDA for more than 20 days straight and I´ve looked all the docs, wiki, google, etc.. and I don´t know where else to look for info that could help me fix this. Thank you before-hand for your help. Mario Lobo [SETUP]=========================================== 1 Server machine: backup 9 Client machines: backup, dtwmaster, intranet, jaboatao, moreno, olinda, recife, servidor, spyket-lab Obs - The backup server is backing up a windows share, mounted on itself. [amanda.conf ]=========================================
...
dumpcycle 7 runspercycle 5 tapecycle 6 runtapes 6

The above "could" result in using all of your tapes on a single
run.  That means the next run will need to overwrite some of
the tapes already.  If those happen to contain any level 0 dump
then the incrementals depending on it, will be much less worth too.

But luckily a full dump all is about 106 GB, and fits in about
2 vtapes of 57 GB, as you defined below.

...
etimeout 350

I don't believe this is the problem, but make really sure that
sendsize on each client has finished in 350 seconds multiplied
by the number of DLE's on that host.


...
define tapetype HARD-DISK { comment "Dump onto hard disk" length 57344 mbytes }
...
define dumptype full-windows { global program "GNUTAR" comment "FULL Windows dump with tar and no compression" options no-compress priority high fallback_splitsize 128m }



Just curious:  you have "fallback_splitsize" everywhere, but that
parameter is only used when you have "tapesplit_size" set, which is not.

All your dumptypes have name "full-..." while nothing
in the dumptype forces them to full only.  Strange.
And full-windows is actually "compress no", which make it different
from the others...
The string "options ..." is very old syntax, and one of these days it
could be removed from the parser.


... backup /gravata { full-windows estimate calcsize }

Is "backup" a Windows computer with cygwin?  Just curious.


FULL BACKUP - success ]================================== *** running su - amanda -c "amdump test" The backup starts at 16:04:35 and the last dump is finished at 20:04:52.


...

[INCREMENTAL BACKUP - failure ]============================== *** running `su - amanda -c "amdump test"`, a day after the full backup, without doing ANYTHING between the two runs. *** All level 0 backups here were supposed to be level 1! What could be causing this?

Some level 0 dumps will be promoted (= scheduled early)
to spread the amount of leve 0 dumps over the dumpcycle.


Using /amanda/test/incrdump.1 from Sat Jun 3 18:49:58 BRT 2006 backup:/gravata (level 1 correct !) 1 520k finished (19:00:36)

What is strange here is that the estimate took about 10 minutes.
Considering your "etimeout 350" above, and the number of DLE's of each
host, that seems to me like even the initial "feature" exchange using
the "noop" service failed for most hosts.




dtwmaster:/db (level 0 !! why ??) 0 planner: [hmm, disk was stranded on waitq]

"stranded on waitq" means that when the estimate finished, there were
no estimates (or not all estimates) received for this DLE; it was still
in the queue of DLE's that were waiting for results from the client.



[...]
intranet:/etc (level 1 correct !) 1 71k finished (20:44:43) intranet:/home (level 0 ! why promoted?) 0 33014946k finished (20:44:43)

I already explained what "promoted" means.


[...]
ERROR planner Request to servidor failed: timeout waiting for ACK ERROR planner Request to recife failed: timeout waiting for ACK ERROR planner Request to spyket-lab failed: timeout waiting for ACK ERROR planner Request to moreno failed: timeout waiting for ACK ERROR planner Request to olinda failed: timeout waiting for ACK ERROR planner Request to dtwmaster failed: timeout waiting for ACK

So actually, these hosts never even replied anything.
I would check firewall rules, as well as xinetd configuration
on those hosts.


INFO planner Full dump of intranet:/home promoted from 6 days ahead. *** why did the above happen?

Because otherwise, 6 days from now, there would be too much work.


[...]
--------------------------------------------------------------------- Lines taken from the amandad.xxx.debug file, from one of the FAILed clients (recife) during the FAILED incremental backup --------------------------------------------------------------------- *** This is the full log of the incremental session !! amandad: debug 1 pid 21199 ruid 11026 euid 11026: start at Sat Jun 3 19:09:07 2006

Strange thing is that this is about 20 minutes later than the amstatus said that the backup begins... The server even finished backup:/gravata
already at 19:00:36.  Or maybe the clocks are not synchronized?

Amanda first does a "feature" exchange with each client.  It does this
by sending a dummy request "SERVICE noop" to each client.  As part
of the reply, it gets back an "ACK" with the feature bitmask.
Then the server can restrict its request to the features that that
client can handle.

Next Amanda sends a "SERVICE sendsize" request to each client.
Is that really missing?  Is there no file amanda.DATATIME.debug having
a "SERVICE sendsize"?

Then, when the estimates of all the hosts have been collected, then the
server sends "SERVICE sendbackup" one by one for each DLE.

Why is the "noop" request received 9 minutes after the backup of another
has already finished?



amandad: version 2.5.0p2 amandad: build: VERSION="Amanda-2.5.0p2" amandad: BUILT_DATE="Sun May 21 14:31:04 BRT 2006" amandad: BUILT_MACH="Linux spyket-lab 2.6.13.4 #10 Tue Nov 1 11:03:01 BRT 2005 i686 i686 i386 GNU/Linux" amandad: CC="gcc" amandad: CONFIGURE_COMMAND="'./configure' '--libexecdir=/usr/local/libexec/amanda' '--with-
amandahosts' '--with-fqdn' '--with-dump-honor-nodump' '--with-buffered-dump' 
'--disable-libtool' '--
prefix=/usr/local' '--with-user=amandabck' '--with-group=amandabck' 
'--with-gnutar-
listdir=/usr/local/var/amanda/gnutar-lists' '--with-gnutar=/bin/gtar' 
'--without-server' '--
prefix=/usr/local'" amandad: paths: bindir="/usr/local/bin" sbindir="/usr/local/sbin" amandad: libexecdir="/usr/local/libexec/amanda" amandad: mandir="/usr/local/man" AMANDA_TMPDIR="/tmp/amanda" amandad: AMANDA_DBGDIR="/tmp/amanda" amandad: CONFIG_DIR="/usr/local/etc/amanda" DEV_PREFIX="/dev/" amandad: RDEV_PREFIX="/dev/r" DUMP="/sbin/dump" amandad: RESTORE="/sbin/restore" VDUMP=UNDEF VRESTORE=UNDEF amandad: XFSDUMP=UNDEF XFSRESTORE=UNDEF VXDUMP=UNDEF VXRESTORE=UNDEF amandad: SAMBA_CLIENT="/usr/bin/smbclient" GNUTAR="/bin/gtar" amandad: COMPRESS_PATH="/bin/gzip" UNCOMPRESS_PATH="/bin/gzip" amandad: LPRCMD="/usr/bin/lpr" MAILER="/usr/bin/Mail" amandad: listed_incr_dir="/usr/local/var/amanda/gnutar-lists" amandad: defs: DEFAULT_SERVER="spyket-lab" amandad: DEFAULT_CONFIG="DailySet1" amandad: DEFAULT_TAPE_SERVER="spyket-lab" amandad: DEFAULT_TAPE_DEVICE="null:" HAVE_MMAP HAVE_SYSVSHM amandad: LOCKING=POSIX_FCNTL SETPGRP_VOID DEBUG_CODE amandad: AMANDA_DEBUG_DAYS=4 BSD_SECURITY RSH_SECURITY USE_AMANDAHOSTS amandad: CLIENT_LOGIN="amandabck" FORCE_USERID HAVE_GZIP amandad: COMPRESS_SUFFIX=".gz" COMPRESS_FAST_OPT="--fast" amandad: COMPRESS_BEST_OPT="--best" UNCOMPRESS_OPT="-dc" amandad: time 0.002: accept recv REQ pkt: <<<<< SERVICE noop OPTIONS features=fffffeff9ffeffff07; amandad: time 0.002: creating new service: /usr/local/libexec/amanda/noop OPTIONS features=fffffeff9ffeffff07; amandad: time 0.005: sending ACK pkt: <<<<< amandad: time 0.006: sending REP pkt: <<<<< OPTIONS features=fffffeff9ffeffff07; amandad: time 0.006: received ACK pkt: <<<<< amandad: time 29.999: pid 21199 finish time Sat Jun 3 19:09:37 2006 =====================================================

The above exchange seems perfectly normal.  Except the time does not
match the time on the server.  Are you 100% sure this is the right
file?



*** Why does it start the noop SERVICE on the incremental, instead of the sendbackup SERVICE,like in the full dump?

First a "noop", then "sendsize", then multiple "sendbackup".



--
Paul Bijnens, xplanation Technology Services        Tel  +32 16 397.511
Technologielaan 21 bus 2, B-3001 Leuven, BELGIUM    Fax  +32 16 397.512
http://www.xplanation.com/          email:  Paul.Bijnens AT xplanation DOT com
***********************************************************************
* I think I've got the hang of it now:  exit, ^D, ^C, ^\, ^Z, ^Q, ^^, *
* F6, quit, ZZ, :q, :q!, M-Z, ^X^C, logoff, logout, close, bye, /bye, *
* stop, end, F3, ~., ^]c, +++ ATH, disconnect, halt,  abort,  hangup, *
* PF4, F20, ^X^X, :D::D, KJOB, F14-f-e, F8-e,  kill -1 $$,  shutdown, *
* init 0, kill -9 1, Alt-F4, Ctrl-Alt-Del, AltGr-NumLock, Stop-A, ... *
* ...  "Are you sure?"  ...   YES   ...   Phew ...   I'm out          *
***********************************************************************


<Prev in Thread] Current Thread [Next in Thread>