Re: error recovery

Rodrigo Ventura wrote:

I was finally able to pass the estimate timeout problem I was having
(solution: a combination of decreased cpu/swap intensive processes and


Fine!

increased etimeout value). However, a couple of troubles came up:

I got failed dumps (I was expecting this) because of end-of-tape:

======================================================================
These dumps were to tape ISR005.
*** A TAPE ERROR OCCURRED: [[writing file: No space left on device]].
Some dumps may have been left in the holding disk.
Run amflush to flush them to tape.
The next tape Amanda expects to use is: ISR001.
[...]
  omni       /home/nt lev 0 FAILED [out of tape]
  omni       / lev 1 FAILED [can't dump no-hold disk in degraded mode]
[...]
NOTES:
  planner: Last full dump of omni:/home/ag on tape ISR005 overwritten on this 
run.
  planner: Last full dump of omni:/var/spool/imap/user/hm on tape ISR005 
overwritten on this run.
  planner: Last full dump of omni://new/E$ on tape ISR005 overwritten on this 
run.
  planner: Incremental of damiao:/ bumped to level 2.
  planner: Incremental of omni:/var/spool/imap/user/hm bumped to level 2.
  planner: gtisr /usr 20050906 0 [dumps too big, 398656 KB, full dump delayed]
  planner: damiao /usr 20050906 0 [dumps too big, 598668 KB, full dump delayed]
  planner: gtisr / 20050906 0 [dumps too big, 716448 KB, full dump delayed]
  planner: omni / 20050906 0 [dumps too big, 3878447 KB, full dump delayed]
  planner: omni /home/hm 20050906 0 [dumps too big, 6360061 KB, full dump 
delayed]
  planner: omni //new/E$ 20050906 0 [dumps too big, 3386713 KB, full dump 
delayed]
  planner: omni /home/ag 20050906 0 [dumps too big, 5373985 KB, full dump 
delayed]
  planner: omni /var/spool/imap/user/hm 20050906 0 [dumps too big, 5761742 KB, 
full dump delayed]
  taper: tape ISR005 kb 30871904 fm 21 writing file: No space left on device
  driver: going into degraded mode because of tape error.
[...]
DUMP SUMMARY:

DUMPER STATS TAPER STATSHOSTNAME DISK L ORIG-KB OUT-KB COMP% MMM:SS KB/s MMM:SS KB/s

-------------------------- --------------------------------- ------------
damiao       /           0 1141360 429995  37.7  13:48 519.3   3:012370.4
damiao       /boot       0    4930   3654  74.1   0:12 302.0   0:021759.7
damiao       /raid1      1    3600    290   8.1   0:35   8.2   0:02 130.4
damiao       /raid1/www  1   22370   8621  38.5   1:15 114.4   0:061473.3
damiao       /usr        1    5510    532   9.7   1:04   8.3   0:09  59.4
gtisr        /           1   31710   3496  11.0   0:47  74.7   0:021938.1
gtisr        /boot       0    6820   4709  69.0   0:031360.6   0:022932.4
gtisr        /usr        1    5650    562   9.9   0:25  22.7   0:14  39.4
omni         /           1 FAILED ---------------------------------------
omni         //new/C$    0 32633101670382  51.2  16:381673.1  11:022523.3
omni         //new/E$    1  961248 103532  10.8   3:12 538.2   0:422470.3
omni         /boot       0    6750   5038  74.6   0:031458.7   0:031511.8
omni         /home/ag    1   11510   2649  23.0   0:45  58.5   0:05 576.7
omni         /home/hm    1 12223101195526  97.8   7:352629.9   7:502542.7
omni         /home/nt    0 1493210012752875  85.4  60:573487.4  FAILED  ----
omni         /home/uz    0 28596102189626  76.6  14:052590.4  13:542626.1
omni         /root       0  432700 347637  80.3   2:172542.0   2:152578.8
omni         /usr        0 33126401578130  47.6  29:11 901.0  10:352485.7
omni         -ap/user/ag 0 78292705183322  66.2  45:381892.9  33:282581.7
omni         -ap/user/hm 2  345390 191659  55.5   2:461157.9   1:132615.7
omni         -ap/user/nt 0 67581704649497  68.8  31:582423.7  29:542591.1
omni         -ap/user/uz 0 20160101059867  52.6  28:49 613.0   7:012517.0


One remark here:  specify a "columnspec" in amanda.conf to avoid
running columns together.  I have (on one long line!):

columnspec "HostName=0:9,Disk=1:18,Level=1:1,OrigKB=1:8,OutKB=1:7,
Compress=1:5,DumpTime=1:6,DumpRate=1:6,TapeTime=1:6,TapeRate=1:6"

(brought to you by Amanda version 2.4.4p4)
======================================================================


Now, I have a bunch of questions about this report:

1. what's the meaning of "Some dumps may have been left in the holding
disk."? I mean, can I expect them to be dumped on the next amdump? Or
do I have to call amcleanup/amflush (which?) to check them out to
tape? In this latter case, I'll be overwriting a tape of my cycle. I
have one spare tape (my cycle is 4 tapes, and I have a magazine of 5
loaded).


You should verify when amanda warns you that some dumps are left
in the holdingdisk.  Amanda could not put all the files on tape.
That means that some dumps are probably left on the holdingdisk.
I did look at that part of the source code yet, but I guess that
the information in that point in the program cannot really be 100% sure
that that is the case.  In my experience I never found an empty
holdingdisk after that warning, but I guess that some bordercase could
produce the warning even with nothing really left in the holdingdisk.


2. what does it mean to go "into degraded mode"?


When the tape is full, or an I/O error occured, or there is no
suitable tape, amanda goes to plan B.
She now tries to dump as much as possible to holdingdisk only, and

usually switches to incremental dumps instead of full dumps. The"reserve" parameter in amanda.conf says how much of the holdingdiskspace

to reserve for these incremental backups in "degraded" mode.  The
default is 100%, but if you really have a large holdingdisk, you can
let amanda use some of the space for full dumps too by lowering the
reserve of 100%.


3. let's take omni:/home/ag for instance; in NOTES amanda says "Last
full dump of omni:/home/ag on tape ISR005 overwritten on this run",
but in DUMP SUMMARY it says

DUMPER STATS TAPER STATSHOSTNAME DISK L ORIG-KB OUT-KB COMP% MMM:SS KB/s MMM:SS KB/s

-------------------------- --------------------------------- ------------
omni         /home/ag    1   11510   2649  23.0   0:45  58.5   0:05 576.7

meaning it was level 1. My interpretation is: since the last full dump
was overwritten, and the current one is level 1, I LOST the backup. Is
this correct? (Shall I panic?)


Yes.  :-)


4. How can I figure out the *current* *state* of the whole backup
system? I'd like to know what full backups exist in what tapes, and
what levels and where, and whether I am able to recover everything
from the current tape set. The amanda mail report are great, but they
only reflect the last amdump operation. I'd like to know about the
actual state of the whole tape set.


amoverview is a nice utility.
And amadmin also has some options to find things out like:
 amadmin xxx due        # when the next level 0 is planned
 amadmin xxx balance    # how much tape amanda expects to use each run

and of course "amadmin xxx info" give you many details about the backups
of individual disks.


5. I read somewhere in amanda docs that whenever there is a
end-of-tape, the next one is automatically loaded and overwritten with
the dumps that did not fit on the previous one. Is this true? However,
in this amanda report I quoted, "These dumps were to tape ISR005" and
"The next tape Amanda expects to use is: ISR001" which seems ok (I
only have ISR001-ISR005). So can I conclude that amanda *only* used
one tape? The dumps that did not fit are on holding disk?


Different remarks:
1. amanda will flush automatically when doing a normal dump, when you
specify "autoflush yes" in amanda.conf.

2.  when you specify a changer device instead of a single tape device,
you may set runtapes to more than 1, and then amanda will use up to
that amount of tapes (could be less, if all fits on less tapes).
Besides a real tapechanger, amanda can also use two or more tapedrives
together and handle it as a changer, or even a "manual" changer, where
the operator needs to manually change the tape during a run when
amanda says it to.


--
Paul Bijnens, Xplanation                            Tel  +32 16 397.511
Technologielaan 21 bus 2, B-3001 Leuven, BELGIUM    Fax  +32 16 397.512
http://www.xplanation.com/          email:  Paul.Bijnens AT xplanation DOT com
***********************************************************************
* I think I've got the hang of it now:  exit, ^D, ^C, ^\, ^Z, ^Q, ^^, *
* F6, quit, ZZ, :q, :q!, M-Z, ^X^C, logoff, logout, close, bye, /bye, *
* stop, end, F3, ~., ^]c, +++ ATH, disconnect, halt,  abort,  hangup, *
* PF4, F20, ^X^X, :D::D, KJOB, F14-f-e, F8-e,  kill -1 $$,  shutdown, *
* init 0, kill -9 1, Alt-F4, Ctrl-Alt-Del, AltGr-NumLock, Stop-A, ... *
* ...  "Are you sure?"  ...   YES   ...   Phew ...   I'm out          *
***********************************************************************