Bacula-users

Re: [Bacula-users] critical error -- tape labels get corrupted, previous backups unreadable

2012-02-06 20:17:14
Subject: Re: [Bacula-users] critical error -- tape labels get corrupted, previous backups unreadable
From: mark.bergman AT uphs.upenn DOT edu
To: Martin Simmons <martin AT lispworks DOT com>
Date: Mon, 06 Feb 2012 20:15:10 -0500
In the message dated: Mon, 06 Feb 2012 12:43:41 GMT,
The pithy ruminations from Martin Simmons on 
<Re: [Bacula-users] critical error -- tape labels get corrupted, previous backu
ps unreadable> were:


Martin,

Thanks again for continuing to respond...I appreciate the feedback and
troubleshooting help.


=> >>>>> On Fri, 03 Feb 2012 20:04:44 -0500, mark bergman said:
=> > 
=> > I've added more logging to /etc/init.d/bacula-sd to confirm when tapes are
=> > ejected and to timestamp the SCSI release commands.
=> > 
=> > Is it possible that bacula flagged tapes 003231 and 000312 as being in
=> > the drives because they were loaded when the server crashed, even though
=> > they were later ejected (outside of bacula's control)? Could this cause
=> > bacula to believe that the tapes were at EOT when they do get loaded, and
=> > bacula then immediately begins writing (corrupting the label)? [Unlikely
=> > that bacula would try to write before reading the label, and would then
=> > read the label after corrupting the tapes.]
=> 
=> I don't see how this could happen.  Bacula issues a rewind command when it

I don't see how it could happen either....but I'm searching for any
explanation.

=> mounts a tape and should then know that the tape is at the start.

That's what I'd expect too.


=> 
=> 
=> > When the current backup is finished, I'll extract the beginning data
=> > on each of 003231 and 000312. Is there anything you recommend in terms
=> > of checking the data on tape to determine whether the tape begins with
=> > random garbage (possibly caused by the shutdown, startup, scsi reset,
=> > etc.) or if it begins with valid bacula data that happened to overwrite
=> > the label instead of being appended?
=> 
=> Do you have a File device defined in the SD?  If so, label a new File volume

No.

=> and then append the data from the start of the tape to the end of the file
=> volume using dd and cat.  You can then examine the file volume using bls -v 
-j
=> (the File label will allow bls to read it).


Can I do this against a tape directly?

=> 
=> 
=> > Does anyone have suggestions of how to troubleshoot this further,
=> > or how to make the daemon startup process more resistant to causing
=> > any corruption?
=> 
=> The important information missing is whether 000312 was already corrupted at
=> 01-Feb 20:11.  You could add some commands to the startup part of


Hmmm....The only way that I could imagine that happening is if:

        bacula loads the tape as needed

        bacula reads the volume label

        {somehow the tape is rewound, either when the tape is first loaded, or
        after some backups are written}

        bacula writes to tape

The only thing outside of bacula that touches the tape drive in any way is the
/etc/init.d/bacula-sd script, which unloads any tapes before starting the
daemon & after shutting down the daemon.

=> /etc/init.d/bacula-sd script before it unloads all tapes.  E.g. do mt status,
=> mt rewind and grab a copy of the first few blocks on any loaded tapes.

Sure. I'm thinking that I may modify /opt/bacula/scripts/mtx-changer to
replace the "unload" operation with:

        mt rewind
        dd if=$TAPE of=/opt/bacula/working/dump_$VOLUMEID.`date '+%Y-%m-%d_%T'` 
ibs=64k count=1024
        mtx -f $ctl load $slot $drive

Is that a suitable number of blocks to dump? I've got the dumps from 5
corrupted tapes, and I'm trying to see if they have anything in common (for
example, maybe the first 128k is corrupted, followed by valid data from dumps
that should have been appended to the tape).

=> 
=> Also, you say that infrastructure1 server crashes.  Maybe the crash caused 
the
=> tape to be rewound and some buffer flushed to start of the tape?

I can't see how...

        if there was unwritten data in a buffer within the memory of the
        server infrastructure1, then when the server crashes it wouldn't
        get written to tape. The 'infrastucture' machines are part of
        an HA cluster...in this crash, the other nodes determined that
        infrastructure1 had lost communication with the quorum disk,
        and they powered off the node...even if that action reset the
        fibre loop and caused the tape library to rewind both tapes
        (unlikely), I don't know how any buffers on the infrastructure1
        server could be written when the power was out.

        if there was unwritten data in a buffer within the memory of
        the tape library, then I believe it must be written before any
        rewind command will be honored. If infrastructure1 sends
        data to the tape drive, that data is buffered, infrastructure1 then
        crashes, infrastructure2 runs /etc/init.d/bacula-sd (which ejects tapes,
        thereby rewinding them)...the data within the buffer in the tape
        drive would still be written before the rewind/eject command was 
executed.
        
Thanks again for your help,

Mark

=> 
=> __Martin
=> 

------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users