Bacula-users

Re: [Bacula-users] critical error -- tape labels get corrupted, previous backups unreadable

2012-02-07 12:00:44
Subject: Re: [Bacula-users] critical error -- tape labels get corrupted, previous backups unreadable
From: Brian Debelius <bdebelius AT intelesyscorp DOT com>
To: mark.bergman AT uphs.upenn DOT edu
Date: Tue, 07 Feb 2012 11:58:13 -0500
Although it will generate lots output, have you tried turning on 
debugging on the DIR and SD to see if anything shows up there?


On 2/6/2012 8:15 PM, mark.bergman AT uphs.upenn DOT edu wrote:
> In the message dated: Mon, 06 Feb 2012 12:43:41 GMT,
> The pithy ruminations from Martin Simmons on
> <Re: [Bacula-users] critical error -- tape labels get corrupted, previous 
> backu
> ps unreadable>  were:
>
>
> Martin,
>
> Thanks again for continuing to respond...I appreciate the feedback and
> troubleshooting help.
>
>
> =>  >>>>>  On Fri, 03 Feb 2012 20:04:44 -0500, mark bergman said:
> =>  >
> =>  >  I've added more logging to /etc/init.d/bacula-sd to confirm when tapes 
> are
> =>  >  ejected and to timestamp the SCSI release commands.
> =>  >
> =>  >  Is it possible that bacula flagged tapes 003231 and 000312 as being in
> =>  >  the drives because they were loaded when the server crashed, even 
> though
> =>  >  they were later ejected (outside of bacula's control)? Could this cause
> =>  >  bacula to believe that the tapes were at EOT when they do get loaded, 
> and
> =>  >  bacula then immediately begins writing (corrupting the label)? 
> [Unlikely
> =>  >  that bacula would try to write before reading the label, and would then
> =>  >  read the label after corrupting the tapes.]
> =>
> =>  I don't see how this could happen.  Bacula issues a rewind command when it
>
> I don't see how it could happen either....but I'm searching for any
> explanation.
>
> =>  mounts a tape and should then know that the tape is at the start.
>
> That's what I'd expect too.
>
>
> =>
> =>
> =>  >  When the current backup is finished, I'll extract the beginning data
> =>  >  on each of 003231 and 000312. Is there anything you recommend in terms
> =>  >  of checking the data on tape to determine whether the tape begins with
> =>  >  random garbage (possibly caused by the shutdown, startup, scsi reset,
> =>  >  etc.) or if it begins with valid bacula data that happened to overwrite
> =>  >  the label instead of being appended?
> =>
> =>  Do you have a File device defined in the SD?  If so, label a new File 
> volume
>
> No.
>
> =>  and then append the data from the start of the tape to the end of the file
> =>  volume using dd and cat.  You can then examine the file volume using bls 
> -v -j
> =>  (the File label will allow bls to read it).
>
>
> Can I do this against a tape directly?
>
> =>
> =>
> =>  >  Does anyone have suggestions of how to troubleshoot this further,
> =>  >  or how to make the daemon startup process more resistant to causing
> =>  >  any corruption?
> =>
> =>  The important information missing is whether 000312 was already corrupted 
> at
> =>  01-Feb 20:11.  You could add some commands to the startup part of
>
>
> Hmmm....The only way that I could imagine that happening is if:
>
>       bacula loads the tape as needed
>
>       bacula reads the volume label
>
>       {somehow the tape is rewound, either when the tape is first loaded, or
>       after some backups are written}
>
>       bacula writes to tape
>
> The only thing outside of bacula that touches the tape drive in any way is the
> /etc/init.d/bacula-sd script, which unloads any tapes before starting the
> daemon&  after shutting down the daemon.
>
> =>  /etc/init.d/bacula-sd script before it unloads all tapes.  E.g. do mt 
> status,
> =>  mt rewind and grab a copy of the first few blocks on any loaded tapes.
>
> Sure. I'm thinking that I may modify /opt/bacula/scripts/mtx-changer to
> replace the "unload" operation with:
>
>       mt rewind
>       dd if=$TAPE of=/opt/bacula/working/dump_$VOLUMEID.`date '+%Y-%m-%d_%T'` 
> ibs=64k count=1024
>       mtx -f $ctl load $slot $drive
>
> Is that a suitable number of blocks to dump? I've got the dumps from 5
> corrupted tapes, and I'm trying to see if they have anything in common (for
> example, maybe the first 128k is corrupted, followed by valid data from dumps
> that should have been appended to the tape).
>
> =>
> =>  Also, you say that infrastructure1 server crashes.  Maybe the crash 
> caused the
> =>  tape to be rewound and some buffer flushed to start of the tape?
>
> I can't see how...
>
>       if there was unwritten data in a buffer within the memory of the
>       server infrastructure1, then when the server crashes it wouldn't
>       get written to tape. The 'infrastucture' machines are part of
>       an HA cluster...in this crash, the other nodes determined that
>       infrastructure1 had lost communication with the quorum disk,
>       and they powered off the node...even if that action reset the
>       fibre loop and caused the tape library to rewind both tapes
>       (unlikely), I don't know how any buffers on the infrastructure1
>       server could be written when the power was out.
>
>       if there was unwritten data in a buffer within the memory of
>       the tape library, then I believe it must be written before any
>       rewind command will be honored. If infrastructure1 sends
>       data to the tape drive, that data is buffered, infrastructure1 then
>       crashes, infrastructure2 runs /etc/init.d/bacula-sd (which ejects tapes,
>       thereby rewinding them)...the data within the buffer in the tape
>       drive would still be written before the rewind/eject command was 
> executed.
>       
> Thanks again for your help,
>
> Mark
>
> =>
> =>  __Martin
> =>
>
> ------------------------------------------------------------------------------
> Keep Your Developer Skills Current with LearnDevNow!
> The most comprehensive online learning library for Microsoft developers
> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
> Metro Style Apps, more. Free future releases when you subscribe now!
> http://p.sf.net/sfu/learndevnow-d2d
> _______________________________________________
> Bacula-users mailing list
> Bacula-users AT lists.sourceforge DOT net
> https://lists.sourceforge.net/lists/listinfo/bacula-users


------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users