Bacula-users

Re: [Bacula-users] critical error -- tape labels get corrupted, previous backups unreadable

2012-02-07 15:07:35
Subject: Re: [Bacula-users] critical error -- tape labels get corrupted, previous backups unreadable
From: Martin Simmons <martin AT lispworks DOT com>
To: bacula-users AT lists.sourceforge DOT net
Date: Tue, 7 Feb 2012 20:05:50 GMT
>>>>> On Mon, 06 Feb 2012 20:15:10 -0500, mark bergman said:
> 
> In the message dated: Mon, 06 Feb 2012 12:43:41 GMT,
> The pithy ruminations from Martin Simmons on 
> <Re: [Bacula-users] critical error -- tape labels get corrupted, previous 
> backu
> ps unreadable> were:
> 
> => 
> => 
> => > When the current backup is finished, I'll extract the beginning data
> => > on each of 003231 and 000312. Is there anything you recommend in terms
> => > of checking the data on tape to determine whether the tape begins with
> => > random garbage (possibly caused by the shutdown, startup, scsi reset,
> => > etc.) or if it begins with valid bacula data that happened to overwrite
> => > the label instead of being appended?
> => 
> => Do you have a File device defined in the SD?  If so, label a new File 
> volume
> 
> No.
> 
> => and then append the data from the start of the tape to the end of the file
> => volume using dd and cat.  You can then examine the file volume using bls 
> -v -j
> => (the File label will allow bls to read it).
> 
> 
> Can I do this against a tape directly?

You could try copying the data from a freshly labeled tape, appending the data
from the start of the bad tape and then writing it back the start of another
tape.

It would be much simpler to add the File device though.


> 
> => 
> => 
> => > Does anyone have suggestions of how to troubleshoot this further,
> => > or how to make the daemon startup process more resistant to causing
> => > any corruption?
> => 
> => The important information missing is whether 000312 was already corrupted 
> at
> => 01-Feb 20:11.  You could add some commands to the startup part of
> 
> 
> Hmmm....The only way that I could imagine that happening is if:
> 
>       bacula loads the tape as needed
> 
>       bacula reads the volume label
> 
>       {somehow the tape is rewound, either when the tape is first loaded, or
>       after some backups are written}
> 
>       bacula writes to tape

Yes.


> The only thing outside of bacula that touches the tape drive in any way is the
> /etc/init.d/bacula-sd script, which unloads any tapes before starting the
> daemon & after shutting down the daemon.
> 
> => /etc/init.d/bacula-sd script before it unloads all tapes.  E.g. do mt 
> status,
> => mt rewind and grab a copy of the first few blocks on any loaded tapes.
> 
> Sure. I'm thinking that I may modify /opt/bacula/scripts/mtx-changer to
> replace the "unload" operation with:
> 
>       mt rewind
>       dd if=$TAPE of=/opt/bacula/working/dump_$VOLUMEID.`date '+%Y-%m-%d_%T'` 
> ibs=64k count=1024
>       mtx -f $ctl load $slot $drive
> 
> Is that a suitable number of blocks to dump?

Yes, that should be plenty.


> => 
> => Also, you say that infrastructure1 server crashes.  Maybe the crash caused 
> the
> => tape to be rewound and some buffer flushed to start of the tape?
> 
> I can't see how...
> 
>       if there was unwritten data in a buffer within the memory of the
>       server infrastructure1, then when the server crashes it wouldn't
>       get written to tape. The 'infrastucture' machines are part of
>       an HA cluster...in this crash, the other nodes determined that
>       infrastructure1 had lost communication with the quorum disk,
>       and they powered off the node...even if that action reset the
>       fibre loop and caused the tape library to rewind both tapes
>       (unlikely), I don't know how any buffers on the infrastructure1
>       server could be written when the power was out.
> 
>       if there was unwritten data in a buffer within the memory of
>       the tape library, then I believe it must be written before any
>       rewind command will be honored. If infrastructure1 sends
>       data to the tape drive, that data is buffered, infrastructure1 then
>       crashes, infrastructure2 runs /etc/init.d/bacula-sd (which ejects tapes,
>       thereby rewinding them)...the data within the buffer in the tape
>       drive would still be written before the rewind/eject command was 
> executed.

Yes, that would be true in an ideal world.  OTOH, it probably depends on the
nature of the crash.  All kinds of undesirable things might happen before the
crash itself (including SCSI resets etc).

__Martin

------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users