Bacula-users

Re: [Bacula-users] critical error -- tape labels get corrupted, previous backups unreadable

2012-02-03 20:07:27
Subject: Re: [Bacula-users] critical error -- tape labels get corrupted, previous backups unreadable
From: mark.bergman AT uphs.upenn DOT edu
To: Martin Simmons <martin AT lispworks DOT com>
Date: Fri, 03 Feb 2012 20:04:44 -0500
In the message dated: Mon, 30 Jan 2012 19:12:11 GMT,
The pithy ruminations from Martin Simmons on 
<Re: [Bacula-users] critical error -- tape labels get corrupted, previous backu
ps unreadable> were:
=> >>>>> On Wed, 25 Jan 2012 16:55:24 -0500, mark bergman said:
=> > 
=> > The fact that the server rebooted between job 42856 and job 42879 makes
=> > me wonder if something in the fibre/SCSI initialization or bacula-sd
=> > startup causes tapes to be rewound but not ejected.
=> 
=> The bacula-sd startup will rewind the tape, but Bacula should know about 
that.
=> 
=> If there is a fibre/SCSI reset without Bacula knowing about it, then that
=> could cause it.

The problem happened again. Two tapes seem to be corrupted. (I say
"seem to be" because bacula loads the correct tape, but then asks for
another tape or to have that media labeled. I cannot check the suspect
tapes as a backup is running, so 'btape' is unusable.)

Here's a timetable of the events from the bacula-dir and mtx logs,
with comments {in curly brackets}:

        {bacula-sd/bacula-dir server working successfully for ~8 days}
        {mtx log shows which volumes are loaded}
    20120129-06:46:20 Doing mtx -f /dev/changer-ml6000 load 4 0
        {VolumeTag=003231}
    20120130-00:37:40 Doing mtx -f /dev/changer-ml6000 load 7 1
        {VolumeTag=000312}
    20120130-00:38:08 Device /dev/tape1-ml6000 - not ready, retrying...
    20120130-00:38:09 Device /dev/tape1-ml6000 - not ready, retrying...
    20120131-15:59:55 Doing mtx -f /dev/changer-ml6000 -- to get count of slots
    20120131-15:59:55 Parms: /dev/changer-ml6000 slots 0 /dev/tape0-ml6000 0
    20120131-16:00:05 Doing mtx -f /dev/changer-ml6000 -- to list volumes
    20120131-16:00:05 Parms: /dev/changer-ml6000 list 0 /dev/tape0-ml6000 0
        {no mtx log entries until 20120201-19:41:10}


    01-Feb 06:45 sbia-infr-vbacula JobId 43324: Start Backup JobId 43324, 
Job=corfu-full.2012-02-01_06.45.00_00
    01-Feb 06:45 sbia-infr-vbacula JobId 43324: Using Device "ml6000-drv1"
    01-Feb 06:45 sbia-infr-vbacula JobId 43324: Committing spooled data to 
Volume "000312". Despooling 180,926,597 bytes ...
        {last successful backup to volume 000312, drive 1}

    01-Feb 06:45 sbia-infr-vbacula JobId 43325: Start Backup JobId 43325, 
Job=archive.2012-02-01_06.45.00_01
    01-Feb 06:45 sbia-infr-vbacula JobId 43325: Using Device "ml6000-drv0"
    01-Feb 06:45 sbia-infr-vbacula JobId 43325: Committing spooled data to 
Volume "003231". Despooling 404 bytes ...
        {last backup before infrastructure1 shutdown}


    Feb 1 18:44 infrastructure1 server running bacula-sd & bacula-dir crashes

    Feb  1 18:47 bacula-sd starts on infrastructure 3 (HA-initiated startup)
            the /etc/init.d/bacula-sd script was previously
            modified to do the following before starting the daemon
            and after shutting down the daemon:

                unload all tapes

                release any SCSI reservations on each tape drive
                        sg_prevent --allow --verbose /dev/tape0-ml6000
                        sg_prevent --allow --verbose /dev/tape1-ml6000


    Feb  1 18:47 bacula-dir starts on infrastructure 3 (HA-initiated startup)

    Feb 1 18:49 infrastructure1 server starts

    Feb  1 18:52:25 bacula-dir stopped on infrastructure 3 (manual shutdown)

    Feb  1 18:52:25 bacula-sd stopped on infrastructure 3 (manual shutdown)

    Feb 1 18:53 bacula-sd daemon started on infrastructure1 (normal startup), 
PID 8083

        {mtx logs show query of devices in changer}
    20120201-18:53:47 Doing mtx -f /dev/changer-ml6000 0 -- to find what is 
loaded
    20120201-18:53:47 Parms: /dev/changer-ml6000 loaded 0 /dev/tape0-ml6000 0
    20120201-18:53:57 Doing mtx -f /dev/changer-ml6000 0 -- to find what is 
loaded
    20120201-18:53:57 Parms: /dev/changer-ml6000 loaded 0 /dev/tape0-ml6000 0

    Feb 1 18:54 bacula-dir daemon started on infrastructure1 (normal startup), 
PID 8118

        {first backup begins on bacula-sd/bacula-dir server}

    01-Feb 19:41 sbia-infr-vbacula JobId 43326: Start Backup JobId 43326, 
Job=agora-full.2012-02-01_19.41.00_02
    01-Feb 19:41 sbia-infr-vbacula JobId 43326: Using Device "ml6000-drv0"

        {this should be empty from /etc/init.d/bacula-sd}

    01-Feb 19:41 sbia-infr-vbacula JobId 43326: 3301 Issuing autochanger 
"loaded? drive 0" command.
    01-Feb 19:41 sbia-infr-vbacula JobId 43326: 3302 Autochanger "loaded?  
drive 0", result: nothing loaded.
    01-Feb 19:41 sbia-infr-vbacula JobId 43326: 3304 Issuing autochanger "load 
slot 27, drive 0" command.

    20120201-19:41:10 Doing mtx -f /dev/changer-ml6000 load 27 0
        {there was no unload command issued by bacula...implying that the
        modified /etc/init.d/bacula-sd startup script unloaded the tape from
        slot 0}

    01-Feb 19:41 sbia-infr-vbacula JobId 43326: 3305 Autochanger "load slot 27, 
drive 0", status is OK.
    01-Feb 19:41 sbia-infr-vbacula JobId 43326: Volume "000335" previously 
written, moving to end of data.
    01-Feb 19:42 sbia-infr-vbacula JobId 43326: Ready to append to end of 
Volume "000335" at file=63.
    01-Feb 19:42 sbia-infr-vbacula JobId 43326: Spooling data ...

        {backup to volume 000335 in slot 0 is OK}

        {next backup job starts, uses drive 1}

    01-Feb 20:11 sbia-infr-vbacula JobId 43328: Start Backup JobId 43328, 
Job=home-2.2012-02-01_20.11.00_05
    01-Feb 20:11 sbia-infr-vbacula JobId 43328: Using Device "ml6000-drv1" 
    01-Feb 20:11 sbia-infr-vbacula JobId 43328: 3301 Issuing autochanger 
"loaded?  drive 1" command.
    01-Feb 20:11 sbia-infr-vbacula JobId 43328: 3302 Autochanger "loaded? drive 
1", result: nothing loaded.

        {as expected, nothing is loaded in drive 1}

    01-Feb 20:11 sbia-infr-vbacula JobId 43328: 3304 Issuing autochanger "load 
slot 7, drive 1" command.

    20120201-20:11:10 Doing mtx -f /dev/changer-ml6000 load 7 1
        {load required as the tape was unloaded by /etc/init.d/bacula-sd}

    01-Feb 20:11 sbia-infr-vbacula JobId 43328: 3305 Autochanger "load slot 7, 
drive 1", status is OK.
    01-Feb 20:11 sbia-infr-vbacula JobId 43328: Please mount Volume "000312" or 
label a new one for:

        {bacula cannot read the label on tape 000312}

        {used bconsole to change status of volume 000312 to "Disabled", bacula
        requests, loads, and uses a different tape}

    01-Feb 21:00 sbia-infr-vbacula JobId 43328: Recycled volume "000313"
    01-Feb 21:00 sbia-infr-vbacula JobId 43328: Using Volume "000313" from 
'Scratch' pool.
    01-Feb 21:00 sbia-infr-vbacula JobId 43328: 3307 Issuing autochanger 
"unload slot 7, drive 1" command.
    01-Feb 21:01 sbia-infr-vbacula JobId 43328: 3304 Issuing autochanger "load 
slot 14, drive 1" command.
    01-Feb 21:01 sbia-infr-vbacula JobId 43328: 3305 Autochanger "load slot 14, 
drive 1", status is OK.
    01-Feb 21:01 sbia-infr-vbacula JobId 43328: Recycled volume "000313" on 
device "ml6000-drv1" (/dev/tape1-ml6000), all previous data lost.
    01-Feb 21:01 sbia-infr-vbacula JobId 43328: Spooling data ...


        {after several backups are written to tape 00313 in drive 1, bacula
        attempts to use 003231 for a job in a different pool}

    02-Feb 06:45 sbia-infr-vbacula JobId 43347: Start Backup JobId 43347, 
Job=archive.2012-02-02_06.45.00_34
    02-Feb 06:48 sbia-infr-vbacula JobId 43347: 3307 Issuing autochanger 
"unload slot 14, drive 1" command.
    02-Feb 06:50 sbia-infr-vbacula JobId 43347: Using Device "ml6000-drv1"
    02-Feb 06:50 sbia-infr-vbacula JobId 43347: 3301 Issuing autochanger 
"loaded? drive 1" command.
    02-Feb 06:50 sbia-infr-vbacula JobId 43347: 3302 Autochanger "loaded?  
drive 1", result: nothing loaded.
    02-Feb 06:50 sbia-infr-vbacula JobId 43347: 3304 Issuing autochanger "load 
slot 4, drive 1" command.
    02-Feb 06:50 sbia-infr-vbacula JobId 43347: 3305 Autochanger "load slot 4, 
drive 1", status is OK.
    02-Feb 06:51 sbia-infr-vbacula JobId 43347: Please mount Volume "003231" or 
label a new one for:


Based on this, it looks like the corruption is somehow associated with
some combination of the bacula-sd and bacula-dir stopping and starting
or with the SCSI release command....but that was as close as I was to
finding a solution earlier.

I've added more logging to /etc/init.d/bacula-sd to confirm when tapes are
ejected and to timestamp the SCSI release commands.

Is it possible that bacula flagged tapes 003231 and 000312 as being in
the drives because they were loaded when the server crashed, even though
they were later ejected (outside of bacula's control)? Could this cause
bacula to believe that the tapes were at EOT when they do get loaded, and
bacula then immediately begins writing (corrupting the label)? [Unlikely
that bacula would try to write before reading the label, and would then
read the label after corrupting the tapes.]

When the current backup is finished, I'll extract the beginning data
on each of 003231 and 000312. Is there anything you recommend in terms
of checking the data on tape to determine whether the tape begins with
random garbage (possibly caused by the shutdown, startup, scsi reset,
etc.) or if it begins with valid bacula data that happened to overwrite
the label instead of being appended?

Does anyone have suggestions of how to troubleshoot this further,
or how to make the daemon startup process more resistant to causing
any corruption?

Any help is greatly appreciated.

Thanks,

Mark

=> 
=> 
=> > Is there any other information that I can provide? Do you suggest running
=> > bacula-sd or bacula-dir at any particular debugging level to try to get
=> > more detail if this happens again?
=> 
=> I suspect that are no debug messages that would show anything useful here,
=> because bacula-sd is probably unaware of the problem.
=> 
=> If you have some way to query the actual tape position periodically then that
=> would be useful (I don't know if Linux has something similar, but FreeBSD's 
mt
=> status command reports this and it can be used on the control mode device 
even
=> if Bacula has the drive open).
=> 
=> __Martin
> 

------------------------------------------------------------------------------
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-dev2
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users