Bacula-users

Re: [Bacula-users] critical error -- tape labels get corrupted, previous backups unreadable

2012-01-25 14:08:15
Subject: Re: [Bacula-users] critical error -- tape labels get corrupted, previous backups unreadable
From: Martin Simmons <martin AT lispworks DOT com>
To: bacula-users AT lists.sourceforge DOT net
Date: Wed, 25 Jan 2012 19:05:44 GMT
>>>>> On Tue, 24 Jan 2012 17:22:21 -0500, mark bergman said:
> 
> In the message dated: Tue, 24 Jan 2012 19:09:15 GMT,
> The pithy ruminations from Martin Simmons on 

> I don't think it's a hardware problem because:
> 
>       the vast majority of tape access (read or write) doesn't result
>       in corrupted labels
> 
>       there aren't SCSI, tape, or bacula errors reported during backups
>       (within Bacula, the OS, or the tape library console)
> 
>       the tapes are readable--though the data is not usable by bacula
> 
>       the problem occurs on tapes that have been written and read in
>       both drives (this doesn't rule out some common element in the
>       tape library)

OK.


> => Bacula only looks at the label when a volume is mounted, so it could be
> => written unsuccessfully but you wouldn't know that until later.
> 
> Interesting... thanks for bringing this up...I'm checking the logs
> (which only go back to early Dec) to see if any of the corrupted tapes
> got unloaded, reloaded and then written to successfully....
> 
> [pause]
> 
> Some of the tapes that are not corrupt have gone through multiple
> load/unload cycles for different jobs....so the act of reloading a tape in
> order to append new jobs does not always cause corruption in the label.
> 
> The initial label (via "label barcodes", assigning the tape to the
> Scratch pool) must be valid, or bacula would detect the corruption
> when it later loads the tape and uses is for its first job.
> 
> Does bacula actually relabel a tape with a working pool (Full,
> Incremental, or Archive), or will it continue to have the pool name
> (Scratch) that was assigned during "label barcodes"?
> 
> I'm also going to load all the other tapes in the changer and
> check their labels with btape and dump.
> 
> [pause]
> 
> Done. I loaded, dumped the first ~640MB from the tape, and used 'btape' to
> read the label from each tape in the changer (35). There were no I/O errors,
> all tapes were 'readable', though 4 have corrupt labels.
> 
> => 
> => 
> => > Here are the log records for a particular volume. It was labeled about
> => > Dec 22, 2011. First used on Jan 4 2012. Used successfully for 10 jobs
> => > (350.49GB), then the label was corrupted.
> => > 
> => > ------------------------------
> => > 04-Jan 06:24 sbia-infr-vbacula JobId 42676: Using Volume "004090" from 
> 'Scratch' pool.
> => > 04-Jan 06:25 sbia-infr-vbacula JobId 42676: Wrote label to prelabeled 
> Volume "004090" on devic
> 
> 
> What does "Wrote label to prelabeled Volume" mean, exactly?

Bacula always rewrites the label when a prelabeled volume is used for the
first time (but not when appending).  This is necessary because the Bacula
label is stored in the same logical tape file as the backup data, so there is
no way to start writing after the existing label.


>                                                             Should the label
> be changed from when it was first written with "label barcodes"?
> 
> The labels on new tapes that have been used but are not corrupted show a pool
> of "Scratch". However, the database shows that those tapes have been used for
> jobs in the working pools (Full, Incremental, Archive). For example, the
> database shows that volume 004056 is 'full', has ~1513GB of data from 4 
> backups, and
> is in the "Full" pool, but the label shows the pool as "Scratch".

Leaving Scratch in the tape label sounds like a bug.


> => 
> => 
> => > At this point, the volume 004090 is unusable.  Running 'btape' on that 
> volume reports 
> => > ----------------------------
> => > [root@sbia-infr1 working]# ../bin/btape -v ml6000-drv0
> => > Tape block granularity is 1024 bytes.
> => > btape: butil.c:290 Using device: "ml6000-drv0" for writing.
> => > 23-Jan 18:14 btape JobId 0: 3301 Issuing autochanger "loaded? drive 0"
> => > command.
> => > 23-Jan 18:14 btape JobId 0: 3302 Autochanger "loaded? drive 0", result 
> is Slot
> => > 9.
> => > btape: btape.c:477 open device "ml6000-drv0" (/dev/tape0-ml6000): OK
> => > *readlabel
> => > btape: btape.c:526 Volume has no label.
> => > 
> => > Volume Label:
> => > Id                : **error**VerNo             : 0
> => > VolName           : 
> => > PrevVolName       : 
> => > VolFile           : 0
> => > LabelType         : Unknown 0
> => > LabelSize         : 0
> => > PoolName          : 
> => > MediaType         : 
> => > PoolType          : 
> => > HostName          : 
> => > Date label written: -4712-01-01 at 00:00
> => > ----------------------------
> => > 
> => > 
> => > 
> => > 
> => > However, there _is_ data on the tape. I'm able to read the tape via dd
> => > (ibs=64k). The ASCII data at the beginning of the tape shows fragments 
> of the
> => > Bacula label and data that corresponds to some of the backups:
> => 
> => The output of
> => 
> => od -tx1 /tmp/vol4090.header | head -n 40
> 
> Sure.
> 
> 0000000 00 6a 49 7b 00 00 01 88 00 00 00 00 42 42 30 32
> 0000020 00 00 00 2b 4f 0e 1c 72 ff ff ff fc ff ff 58 6b
> 0000040 00 00 00 9a 42 61 63 75 6c 61 20 31 2e 30 20 69
> 0000060 6d 6d 6f 72 74 61 6c 0a 00 00 00 00 0b 00 00 a7
> 0000100 95 00 04 b6 67 65 24 6f 0e 00 00 00 00 00 00 00
> 0000120 00 41 72 63 68 69 76 65 00 42 61 63 6b 75 70 00
> 0000140 61 72 63 68 69 76 65 00 73 62 69 61 2d 69 6e 66
> 0000160 72 2d 76 6e 66 73 32 00 61 72 63 68 69 76 65 2e
> 0000200 32 30 31 32 2d 30 31 2d 31 33 5f 30 36 2e 34 35
> 0000220 2e 30 30 5f 30 31 00 61 72 63 68 69 76 65 00 00
> 0000240 00 00 42 00 00 00 49 74 6e 49 70 52 68 4e 35 4a
> 0000260 56 46 58 68 77 2f 2b 30 2f 73 41 73 44 00 ff ff
> 0000300 ff fb ff ff 58 6b 00 00 00 be 42 61 63 75 6c 61
> 0000320 20 31 2e 30 20 69 6d 6d 6f 72 74 61 6c 0a 00 00
> 0000340 00 00 0b 00 00 a7 95 00 04 b6 67 65 26 18 80 00
> 0000360 00 00 00 00 00 00 00 41 72 63 68 69 76 65 00 42
> 0000400 61 63 6b 75 70 00 61 72 63 68 69 76 65 00 73 62
> 0000420 69 61 2d 69 6e 66 72 2d 76 6e 66 73 32 00 61 72
> 0000440 63 68 69 76 65 2e 32 30 31 32 2d 30 31 2d 31 33
> 0000460 5f 30 36 2e 34 35 2e 30 30 5f 30 31 00 61 72 63
> 0000500 68 69 76 65 00 00 00 00 42 00 00 00 49 74 6e 49
> 0000520 70 52 68 4e 35 4a 56 46 58 68 77 2f 2b 30 2f 73
> 0000540 41 73 44 00 00 00 00 00 00 00 00 00 00 00 00 00
> 0000560 00 00 00 00 00 00 00 00 00 00 00 54 00 00 00 53
> 0000600 00 00 00 00 00 00 00 54 00 00 00 00 00 00 00 00
> 0000620 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> *
> 0176000
> 
> => 
> => might be useful, to see why Bacula rejects it.
> 
> I hope that's more useful to you than me! :) 

That contains the following:

Begin Job Session Record:
JobId             : 42901
VerNum            : 11
PoolName          : Archive
PoolType          : Backup
JobName           : archive
ClientName        : sbia-infr-vnfs2
Job (unique name) : archive.2012-01-13_06.45.00_01
FileSet           : archive
JobType           : B
JobLevel          : I
Date written      : 13-Jan-2012 11:45

End Job Session Record:
JobId             : 42901
VerNum            : 11
PoolName          : Archive
PoolType          : Backup
JobName           : archive
ClientName        : sbia-infr-vnfs2
Job (unique name) : archive.2012-01-13_06.45.00_01
FileSet           : archive
JobType           : B
JobLevel          : I
JobFiles          : 0
JobBytes          : 0
StartBlock        : 0
EndBlock          : 0
StartFile         : 84
EndFile           : 83
JobErrors         : 0
JobStatus         : T
Date written      : 13-Jan-2012 11:45

The records are also marked as a "continuation" but I don't know if that has
any significance.

According to your previous email, that job was indeed written to the tape:

> 12-Jan 12:56 sbia-infr-vbacula JobId 42879: Committing spooled data to Volume 
> "004090". Despooling 5,914 bytes ...
> 13-Jan 06:45 sbia-infr-vbacula JobId 42901: Committing spooled data to Volume 
> "004090". Despooling 404 bytes ...

Did anything unusual happen around (or between) those lines in the log?

Was the tape in the same drive for both jobs?

It looks very much like the tape was rewound between them.

__Martin

------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users