Bacula-users

Re: [Bacula-users] critical error -- tape labels get corrupted, previous backups unreadable

2012-01-24 18:26:58
Subject: Re: [Bacula-users] critical error -- tape labels get corrupted, previous backups unreadable
From: mark.bergman AT uphs.upenn DOT edu
To: Steve Ellis <ellis AT brouhaha DOT com>
Date: Tue, 24 Jan 2012 18:25:09 -0500
In the message dated: Tue, 24 Jan 2012 14:30:44 PST,
The pithy ruminations from Steve Ellis on 
<Re: [Bacula-users] critical error -- tape labels get corrupted, previous 
backups 
unreadable> were:
=> On 1/24/12 2:22 PM, mark.bergman AT uphs.upenn DOT edu wrote:
=> > In the message dated: Tue, 24 Jan 2012 19:09:15 GMT,
=> > The pithy ruminations from Martin Simmons on
=> >
=> >
=> > Thanks for replying.
=> >
=> >
=> > <Re: [Bacula-users] critical error -- tape labels get corrupted, previous 
backu
=> > ps unreadable>  were:
=> > =>  >>>>>  On Mon, 23 Jan 2012 18:47:31 -0500, mark bergman said:
=> > =>  >
=> > =>  >  I'm experiencing a critical problem where tape labels on volumes 
with data
=> > =>  >  get corrupted, leaving all data on the tape inaccessible to bacula.
=> > =>  >
=> > =>  >  I'm running bacula 5.2.2 built from source, under Linux (CentOS 5.7
=> > =>  >  x86_64).
=> > =>  >
=> > =>  >  This problem has happened with approximately 15 tapes over 
approximately 6
=> > =>  >  months, mostly new LTO-4 media, but some LTO-3 media that's being 
reused.
=> > =>  >  The problem is sporadic, appearing in approximately 1 out of 60 
tapes
=> > =>  >  per week.
=> > =>  >
=> > =>  >  I do not think the issue is related to the physical media or the 
tape
=> > =>  >  drives. One tape was last written successfully when in drive 0, 
then appears
=> > =>  >  corrupt when a later job tries to use is in drive 1. Another tape 
was last
=> > =>  >  written successfully when in drive 1, then appears corrupt when a 
later job
=> > =>  >  tries to use it in drive 0.
=> > =>
=> > =>  Why do think it isn't a hardware problem?
=> > =>
=> >
=> > I don't think it's a hardware problem because:
=> >
=> >    the vast majority of tape access (read or write) doesn't result
=> >    in corrupted labels
=> >
=> >    there aren't SCSI, tape, or bacula errors reported during backups
=> >    (within Bacula, the OS, or the tape library console)
=> >
=> >    the tapes are readable--though the data is not usable by bacula
=> >
=> >    the problem occurs on tapes that have been written and read in
=> >    both drives (this doesn't rule out some common element in the
=> >    tape library)
=> >
=> Perhaps someone else already suggested this and I missed it--this looks 
=> like somehow the tapes were rewound behind bacula's back--could that 
=> explain the behavior you are seeing?

Thanks for suggesting this. I appreciate the feedback.

Yeah, it would explain the symptom, but if I understand it correctly,
this would require:

        bacula loads a tape with a valid label

        writes N backup jobs to the tape

        "something" rewinds the tape

        bacula writes to the beginning of the tape, corrupting the label (but
        believing the job to be successful)

        bacula unloads the tape

        at some later point, bacula loads the tape for another
        job and cannot read the label

It is difficult to think of a scenario where "something rewinds" but
does not unload the tape.

We don't have any software other than bacula that reads/writes from tape.

Attempts to access the tape drives (not the autochanger) manually with 'mt'
while bacula-sd is running are blocked as bacula-sd has a lock on the tape
devices.

It is possible to use "mtx" to unload tapes from the drives while bacula is
running, and I believe that unloading an LTO tape implies that it is rewound.

However, I can't think of any scenario where a tape is unloaded without
updating the "in changer" flag in the database, and where "update slots" is
not called after the tape is unloaded, and where bacula tries to append to the
same tape, and where the tape is loaded without triggering an attempt to read
the label, and the 'append' therefore overwrites the beginning of the
tape...but maybe that's possible.

I may just change the bacula-dir and bacula-sd init scripts to call
mtx-changer and unload all drives before starting either daemon. This would
help ensure consistency, regardless of which daemon starts first or is later
restarted.

Thanks,

Mark

=> -se
=> 

------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users