Bacula-users

Re: [Bacula-users] critical error -- tape labels get corrupted, previous backups unreadable

2012-01-24 17:24:22
Subject: Re: [Bacula-users] critical error -- tape labels get corrupted, previous backups unreadable
From: mark.bergman AT uphs.upenn DOT edu
To: Martin Simmons <martin AT lispworks DOT com>
Date: Tue, 24 Jan 2012 17:22:21 -0500
In the message dated: Tue, 24 Jan 2012 19:09:15 GMT,
The pithy ruminations from Martin Simmons on 


Thanks for replying.


<Re: [Bacula-users] critical error -- tape labels get corrupted, previous backu
ps unreadable> were:
=> >>>>> On Mon, 23 Jan 2012 18:47:31 -0500, mark bergman said:
=> > 
=> > I'm experiencing a critical problem where tape labels on volumes with data
=> > get corrupted, leaving all data on the tape inaccessible to bacula.
=> > 
=> > I'm running bacula 5.2.2 built from source, under Linux (CentOS 5.7
=> > x86_64).
=> > 
=> > This problem has happened with approximately 15 tapes over approximately 6
=> > months, mostly new LTO-4 media, but some LTO-3 media that's being reused.
=> > The problem is sporadic, appearing in approximately 1 out of 60 tapes
=> > per week.
=> > 
=> > I do not think the issue is related to the physical media or the tape
=> > drives. One tape was last written successfully when in drive 0, then 
appears
=> > corrupt when a later job tries to use is in drive 1. Another tape was last
=> > written successfully when in drive 1, then appears corrupt when a later job
=> > tries to use it in drive 0.
=> 
=> Why do think it isn't a hardware problem?
=> 

I don't think it's a hardware problem because:

        the vast majority of tape access (read or write) doesn't result
        in corrupted labels

        there aren't SCSI, tape, or bacula errors reported during backups
        (within Bacula, the OS, or the tape library console)

        the tapes are readable--though the data is not usable by bacula

        the problem occurs on tapes that have been written and read in
        both drives (this doesn't rule out some common element in the
        tape library)

=> Bacula only looks at the label when a volume is mounted, so it could be
=> written unsuccessfully but you wouldn't know that until later.

Interesting... thanks for bringing this up...I'm checking the logs
(which only go back to early Dec) to see if any of the corrupted tapes
got unloaded, reloaded and then written to successfully....

[pause]

Some of the tapes that are not corrupt have gone through multiple
load/unload cycles for different jobs....so the act of reloading a tape in
order to append new jobs does not always cause corruption in the label.

The initial label (via "label barcodes", assigning the tape to the
Scratch pool) must be valid, or bacula would detect the corruption
when it later loads the tape and uses is for its first job.

Does bacula actually relabel a tape with a working pool (Full,
Incremental, or Archive), or will it continue to have the pool name
(Scratch) that was assigned during "label barcodes"?

I'm also going to load all the other tapes in the changer and
check their labels with btape and dump.

[pause]

Done. I loaded, dumped the first ~640MB from the tape, and used 'btape' to
read the label from each tape in the changer (35). There were no I/O errors,
all tapes were 'readable', though 4 have corrupt labels.

=> 
=> 
=> > Here are the log records for a particular volume. It was labeled about
=> > Dec 22, 2011. First used on Jan 4 2012. Used successfully for 10 jobs
=> > (350.49GB), then the label was corrupted.
=> > 
=> > ------------------------------
=> > 04-Jan 06:24 sbia-infr-vbacula JobId 42676: Using Volume "004090" from 
'Scratch' pool.
=> > 04-Jan 06:25 sbia-infr-vbacula JobId 42676: Wrote label to prelabeled 
Volume "004090" on devic


What does "Wrote label to prelabeled Volume" mean, exactly? Should the label
be changed from when it was first written with "label barcodes"?

The labels on new tapes that have been used but are not corrupted show a pool
of "Scratch". However, the database shows that those tapes have been used for
jobs in the working pools (Full, Incremental, Archive). For example, the
database shows that volume 004056 is 'full', has ~1513GB of data from 4 
backups, and
is in the "Full" pool, but the label shows the pool as "Scratch".

It appears that:

        label barcodes successfully labels a tape and puts it into the Scratch
        pool

        when the tape is first used, bacula logs the message "Wrote
        label to prelabed Volume", but the label is not updated ('btape
        readlabel' reports Scratch), and jobs can be written to the tape

        at some time the label on some tapes gets corrupted

        when the tape is reloaded, bacula detects the corruption 


=> e "ml6000-drv1" (/dev/tape1-ml6000)
=> > 04-Jan 06:25 sbia-infr-vbacula JobId 42676: New volume "004090" mounted on 
device "ml6000-drv1
=> " (/dev/tape1-ml6000) at 04-Jan-2012 06:25.
=> 
=> Is /dev/tape1-ml6000 a non-rewinding device (like /dev/nst0)?

Yes, so successive writes to the same tape without unloading should
append. 

The tape hardware consists of 2x LTO-4 drives in a 40 slot Dell ML6000
(rebranded Adic) changer.

=> 
=> 
=> > At this point, the volume 004090 is unusable.  Running 'btape' on that 
volume reports 
=> > ----------------------------
=> > [root@sbia-infr1 working]# ../bin/btape -v ml6000-drv0
=> > Tape block granularity is 1024 bytes.
=> > btape: butil.c:290 Using device: "ml6000-drv0" for writing.
=> > 23-Jan 18:14 btape JobId 0: 3301 Issuing autochanger "loaded? drive 0"
=> > command.
=> > 23-Jan 18:14 btape JobId 0: 3302 Autochanger "loaded? drive 0", result is 
Slot
=> > 9.
=> > btape: btape.c:477 open device "ml6000-drv0" (/dev/tape0-ml6000): OK
=> > *readlabel
=> > btape: btape.c:526 Volume has no label.
=> > 
=> > Volume Label:
=> > Id                : **error**VerNo             : 0
=> > VolName           : 
=> > PrevVolName       : 
=> > VolFile           : 0
=> > LabelType         : Unknown 0
=> > LabelSize         : 0
=> > PoolName          : 
=> > MediaType         : 
=> > PoolType          : 
=> > HostName          : 
=> > Date label written: -4712-01-01 at 00:00
=> > ----------------------------
=> > 
=> > 
=> > 
=> > 
=> > However, there _is_ data on the tape. I'm able to read the tape via dd
=> > (ibs=64k). The ASCII data at the beginning of the tape shows fragments of 
the
=> > Bacula label and data that corresponds to some of the backups:
=> 
=> The output of
=> 
=> od -tx1 /tmp/vol4090.header | head -n 40

Sure.

0000000 00 6a 49 7b 00 00 01 88 00 00 00 00 42 42 30 32
0000020 00 00 00 2b 4f 0e 1c 72 ff ff ff fc ff ff 58 6b
0000040 00 00 00 9a 42 61 63 75 6c 61 20 31 2e 30 20 69
0000060 6d 6d 6f 72 74 61 6c 0a 00 00 00 00 0b 00 00 a7
0000100 95 00 04 b6 67 65 24 6f 0e 00 00 00 00 00 00 00
0000120 00 41 72 63 68 69 76 65 00 42 61 63 6b 75 70 00
0000140 61 72 63 68 69 76 65 00 73 62 69 61 2d 69 6e 66
0000160 72 2d 76 6e 66 73 32 00 61 72 63 68 69 76 65 2e
0000200 32 30 31 32 2d 30 31 2d 31 33 5f 30 36 2e 34 35
0000220 2e 30 30 5f 30 31 00 61 72 63 68 69 76 65 00 00
0000240 00 00 42 00 00 00 49 74 6e 49 70 52 68 4e 35 4a
0000260 56 46 58 68 77 2f 2b 30 2f 73 41 73 44 00 ff ff
0000300 ff fb ff ff 58 6b 00 00 00 be 42 61 63 75 6c 61
0000320 20 31 2e 30 20 69 6d 6d 6f 72 74 61 6c 0a 00 00
0000340 00 00 0b 00 00 a7 95 00 04 b6 67 65 26 18 80 00
0000360 00 00 00 00 00 00 00 41 72 63 68 69 76 65 00 42
0000400 61 63 6b 75 70 00 61 72 63 68 69 76 65 00 73 62
0000420 69 61 2d 69 6e 66 72 2d 76 6e 66 73 32 00 61 72
0000440 63 68 69 76 65 2e 32 30 31 32 2d 30 31 2d 31 33
0000460 5f 30 36 2e 34 35 2e 30 30 5f 30 31 00 61 72 63
0000500 68 69 76 65 00 00 00 00 42 00 00 00 49 74 6e 49
0000520 70 52 68 4e 35 4a 56 46 58 68 77 2f 2b 30 2f 73
0000540 41 73 44 00 00 00 00 00 00 00 00 00 00 00 00 00
0000560 00 00 00 00 00 00 00 00 00 00 00 54 00 00 00 53
0000600 00 00 00 00 00 00 00 54 00 00 00 00 00 00 00 00
0000620 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0176000

=> 
=> might be useful, to see why Bacula rejects it.

I hope that's more useful to you than me! :) 

I can provide dumps from headers of 3 other tapes with corrupted labels.

Thanks,

Mark

=> 
=> __Martin

------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users