Bacula-users

Re: [Bacula-users] critical error -- tape labels get corrupted, previous backups unreadable

2012-01-25 16:57:37
Subject: Re: [Bacula-users] critical error -- tape labels get corrupted, previous backups unreadable
From: mark.bergman AT uphs.upenn DOT edu
To: Martin Simmons <martin AT lispworks DOT com>
Date: Wed, 25 Jan 2012 16:55:24 -0500
In the message dated: Wed, 25 Jan 2012 19:05:44 GMT,
The pithy ruminations from Martin Simmons on 
<Re: [Bacula-users] critical error -- tape labels get corrupted, previous backu
ps unreadable> were:
=> >>>>> On Tue, 24 Jan 2012 17:22:21 -0500, mark bergman said:
=> > 
=> > In the message dated: Tue, 24 Jan 2012 19:09:15 GMT,
=> > The pithy ruminations from Martin Simmons on 
=> 
=> > I don't think it's a hardware problem because:
=> > 
=> >    the vast majority of tape access (read or write) doesn't result
=> >    in corrupted labels
=> > 
=> >    there aren't SCSI, tape, or bacula errors reported during backups
=> >    (within Bacula, the OS, or the tape library console)
=> > 
=> >    the tapes are readable--though the data is not usable by bacula
=> > 
=> >    the problem occurs on tapes that have been written and read in
=> >    both drives (this doesn't rule out some common element in the
=> >    tape library)
=> 
=> OK.
=> 
=> 

        [SNIP!]

=> > => 
=> > => > Here are the log records for a particular volume. It was labeled about
=> > => > Dec 22, 2011. First used on Jan 4 2012. Used successfully for 10 jobs
=> > => > (350.49GB), then the label was corrupted.
=> > => > 
=> > => > ------------------------------
=> > => > 04-Jan 06:24 sbia-infr-vbacula JobId 42676: Using Volume "004090" 
from 'Scratch' pool.
=> > => > 04-Jan 06:25 sbia-infr-vbacula JobId 42676: Wrote label to prelabeled 
Volume "004090" on 
=> devic
=> > 
=> > 
=> > What does "Wrote label to prelabeled Volume" mean, exactly?
=> 
=> Bacula always rewrites the label when a prelabeled volume is used for the
=> first time (but not when appending).  This is necessary because the Bacula
=> label is stored in the same logical tape file as the backup data, so there is
=> no way to start writing after the existing label.
=> 
=> 
=> >                                                             Should the 
label
=> > be changed from when it was first written with "label barcodes"?
=> > 
=> > The labels on new tapes that have been used but are not corrupted show a 
pool
=> > of "Scratch". However, the database shows that those tapes have been used 
for
=> > jobs in the working pools (Full, Incremental, Archive). For example, the
=> > database shows that volume 004056 is 'full', has ~1513GB of data from 4 
backups, and
=> > is in the "Full" pool, but the label shows the pool as "Scratch".
=> 
=> Leaving Scratch in the tape label sounds like a bug.


All of the uncorrupted tapes have the label Scratch, even those that
have been used for many jobs.

I will report this as a bug.

=> 
=> 
=> > => 
=> > => 
=> > => > At this point, the volume 004090 is unusable.  Running 'btape' on 
that volume reports 

        [SNIP!]

=> > => 
=> > => The output of
=> > => 
=> > => od -tx1 /tmp/vol4090.header | head -n 40
=> > 
=> > Sure.
=> > 
=> > 0000000 00 6a 49 7b 00 00 01 88 00 00 00 00 42 42 30 32

        [SNIP!]

=> > 0000620 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
=> > *
=> > 0176000
=> > 
=> > => 
=> > => might be useful, to see why Bacula rejects it.
=> > 
=> > I hope that's more useful to you than me! :) 
=> 
=> That contains the following:
=> 
=> Begin Job Session Record:
=> JobId             : 42901
=> VerNum            : 11
=> PoolName          : Archive
=> PoolType          : Backup
=> JobName           : archive
=> ClientName        : sbia-infr-vnfs2
=> Job (unique name) : archive.2012-01-13_06.45.00_01
=> FileSet           : archive
=> JobType           : B
=> JobLevel          : I
=> Date written      : 13-Jan-2012 11:45
=> 
=> End Job Session Record:
=> JobId             : 42901
=> VerNum            : 11
=> PoolName          : Archive
=> PoolType          : Backup
=> JobName           : archive
=> ClientName        : sbia-infr-vnfs2
=> Job (unique name) : archive.2012-01-13_06.45.00_01
=> FileSet           : archive
=> JobType           : B
=> JobLevel          : I
=> JobFiles          : 0
=> JobBytes          : 0
=> StartBlock        : 0
=> EndBlock          : 0
=> StartFile         : 84
=> EndFile           : 83
=> JobErrors         : 0
=> JobStatus         : T
=> Date written      : 13-Jan-2012 11:45
=> 
=> The records are also marked as a "continuation" but I don't know if that has
=> any significance.
=> 
=> According to your previous email, that job was indeed written to the tape:
=> 
=> > 12-Jan 12:56 sbia-infr-vbacula JobId 42879: Committing spooled data to 
Volume "004090". Despoo
=> ling 5,914 bytes ...
=> > 13-Jan 06:45 sbia-infr-vbacula JobId 42901: Committing spooled data to 
Volume "004090". Despoo
=> ling 404 bytes ...
=> 
=> Did anything unusual happen around (or between) those lines in the log?

Not that I can see. Many other backups ran (some failed due to concurrency
issues, and the fd seemed to be down on one client) in that interval.

No other jobs accessed ml600-drv1 (which had volume 004090 at 12-Jan 12:56).

=> 
=> Was the tape in the same drive for both jobs?

Yes, and not unloaded between those jobs.

Luckily I already had mtx logging enabled. As far as I can tell, volume 004090 
was
loaded into drive1 at 20120110-05:46:40. There were no other mtx commands
referring to drive1 until 20120113-19:15:04 when the tape was unloaded.

The tape from the same slot was loaded again on 20120115-16:53:52, when the
corruption was detected.

=> 
=> It looks very much like the tape was rewound between them.

Hmmm....I don't know what could do that, besides bacula itself. We don't
use any other software that access tapes.

According to the database (query 14), media 004090 contains:

+--------+---------+---------------------+------+-------+-----------+-------------------+--------+
| JobId  | Name    | StartTime           | Type | Level | Files     | Bytes     
        | Status |
+--------+---------+---------------------+------+-------+-----------+-------------------+--------+
| 42,676 | archive | 2012-01-03 04:45:48 | B    | F     | 3,431,911 | 
1,752,538,705,339 | T      |
| 42,724 | archive | 2012-01-05 06:46:58 | B    | I     |         1 |           
      0 | T      |
| 42,746 | archive | 2012-01-06 06:50:18 | B    | I     |         1 |           
      0 | T      |
| 42,768 | archive | 2012-01-07 06:47:35 | B    | I     |         1 |           
      0 | T      |
| 42,790 | archive | 2012-01-08 06:47:38 | B    | I     |         1 |           
      0 | T      |
| 42,812 | archive | 2012-01-09 06:45:04 | B    | I     |         1 |           
      0 | T      |
| 42,831 | archive | 2012-01-10 05:46:30 | B    | D     |    88,332 |    
35,222,612,986 | T      |
| 42,856 | archive | 2012-01-11 06:45:04 | B    | I     |         1 |           
      0 | T      |
| 42,879 | archive | 2012-01-12 06:45:04 | B    | I     |        36 |           
      0 | T      |
| 42,901 | archive | 2012-01-13 06:45:01 | B    | I     |         0 |           
      0 | T      |
+--------+---------+---------------------+------+-------+-----------+-------------------+--------+

The zero-byte backups are normal--this is an archive fileset, and it changes
very rarely.


Here is an analysis of the logs of all activity involving tape 004090 and 
simultaneous system events:
-----------------------------------------------------------------------------------------------------

04-Jan 06:24 sbia-infr-vbacula JobId 42676: Using Volume "004090" from 
'Scratch' pool.
04-Jan 06:25 sbia-infr-vbacula JobId 42676: 3305 Autochanger "load slot 9, 
drive 1", status is OK.
        initial load of volume to continue job 42676


04-Jan 08:23 sbia-infr-vbacula JobId 42676: Committing spooled data to Volume 
"004090". Despooling 37,003,975,390 bytes ...
04-Jan 08:28 sbia-infr-vbacula JobId 42676: Despooling elapsed time = 00:05:16, 
Transfer rate = 117.1 M Bytes/second
04-Jan 08:28 sbia-infr-vbacula JobId 42676: Sending spooled attrs to the 
Director. Despooling 1,255,585,818 bytes ...
        job 42676 completed

04-Jan 20:11 sbia-infr-vbacula JobId 42705: 3307 Issuing autochanger "unload 
slot 9, drive 1" command.
        volume 004090 unloaded

05-Jan 06:47 sbia-infr-vbacula JobId 42724: 3305 Autochanger "load slot 9, 
drive 1", status is OK.
05-Jan 06:47 sbia-infr-vbacula JobId 42724: Volume "004090" previously written, 
moving to end of data.
05-Jan 06:49 sbia-infr-vbacula JobId 42724: Ready to append to end of Volume 
"004090" at file=69.
        tape loaded to run job 42724, label read correctly

05-Jan 11:05 sbia-infr-vbacula JobId 42724: Sending spooled attrs to the 
Director. Despooling 160 bytes ...
        job 42724 complete

05-Jan 11:05 sbia-infr-vbacula JobId 42723: 3307 Issuing autochanger "unload 
slot 9, drive 1" command.
        tape 004090 unloaded

06-Jan 06:50 sbia-infr-vbacula JobId 42746: 3305 Autochanger "load slot 9, 
drive 1", status is OK.
06-Jan 06:51 sbia-infr-vbacula JobId 42746: Volume "004090" previously written, 
moving to end of data.
06-Jan 06:52 sbia-infr-vbacula JobId 42746: Ready to append to end of Volume 
"004090" at file=70.
        tape loaded to run job 42746, label read correctly

06-Jan 12:08 sbia-infr-vbacula JobId 42746: Sending spooled attrs to the 
Director. Despooling 160 bytes ...
        job 42746 complete

06-Jan 20:11 sbia-infr-vbacula JobId 42748: 3307 Issuing autochanger "unload 
slot 9, drive 1" command.
        tape 004090 unloaded

07-Jan 06:48 sbia-infr-vbacula JobId 42768: 3305 Autochanger "load slot 9, 
drive 1", status is OK.
07-Jan 06:48 sbia-infr-vbacula JobId 42768: Volume "004090" previously written, 
moving to end of data.
07-Jan 06:50 sbia-infr-vbacula JobId 42768: Ready to append to end of Volume 
"004090" at file=71.
        tape loaded to run job 42768, label read correctly
        
07-Jan 12:07 sbia-infr-vbacula JobId 42768: Sending spooled attrs to the 
Director. Despooling 160 bytes ...
        job 42768 complete

07-Jan 20:11 sbia-infr-vbacula JobId 42769: 3307 Issuing autochanger "unload 
slot 9, drive 1" command.
        tape 004090 unloaded

08-Jan 06:48 sbia-infr-vbacula JobId 42790: 3305 Autochanger "load slot 9, 
drive 1", status is OK.
08-Jan 06:48 sbia-infr-vbacula JobId 42790: Volume "004090" previously written, 
moving to end of data.
08-Jan 06:50 sbia-infr-vbacula JobId 42790: Ready to append to end of Volume 
"004090" at file=72.
        tape loaded to run job 42790, label read correctly

08-Jan 11:40 sbia-infr-vbacula JobId 42790: Sending spooled attrs to the 
Director. Despooling 160 bytes ...
        job 42790 complete

09-Jan 06:45 sbia-infr-vbacula JobId 42812: Start Backup JobId 42812, 
Job=archive.2012-01-09_06.45.00_08
09-Jan 06:45 sbia-infr-vbacula JobId 42812: Using Device "ml6000-drv1"
09-Jan 06:45 sbia-infr-vbacula JobId 42812: Spooling data ...
09-Jan 11:12 sbia-infr-vbacula JobId 42812: Job write elapsed time = 04:27:31, 
Transfer rate = 0  Bytes/second
09-Jan 11:12 sbia-infr-vbacula JobId 42812: Committing spooled data to Volume 
"004090". Despooling 495 bytes ...
09-Jan 11:12 sbia-infr-vbacula JobId 42812: Despooling elapsed time = 00:00:01, 
Transfer rate = 495  Bytes/second
09-Jan 11:12 sbia-infr-vbacula JobId 42812: Sending spooled attrs to the 
Director. Despooling 160 bytes ...
        tape remained in drive 1, used for job 42812


09-Jan 22:07 sbia-infr-vbacula JobId 42818: 3307 Issuing autochanger "unload 
slot 9, drive 1" command.
        tape 004090 unloaded

10-Jan 05:47 sbia-infr-vbacula JobId 42831: 3305 Autochanger "load slot 9, 
drive 1", status is OK.
10-Jan 05:47 sbia-infr-vbacula JobId 42831: Volume "004090" previously written, 
moving to end of data.
10-Jan 05:49 sbia-infr-vbacula JobId 42831: Ready to append to end of Volume 
"004090" at file=74.
        tape loaded to run job 42831, label read correctly
        
10-Jan 23:58 sbia-infr-vbacula JobId 42831: Sending spooled attrs to the 
Director. Despooling 32,744,074 bytes ...
        job 42831 complete

11-Jan 06:45 sbia-infr-vbacula JobId 42856: Start Backup JobId 42856, 
Job=archive.2012-01-11_06.45.00_11
11-Jan 06:45 sbia-infr-vbacula JobId 42856: Using Device "ml6000-drv1"
11-Jan 06:45 sbia-infr-vbacula JobId 42856: Spooling data ...
11-Jan 12:58 sbia-infr-vbacula JobId 42856: Job write elapsed time = 06:13:44, 
Transfer rate = 0  Bytes/second
11-Jan 12:58 sbia-infr-vbacula JobId 42856: Committing spooled data to Volume 
"004090". Despooling 495 bytes ...
11-Jan 12:58 sbia-infr-vbacula JobId 42856: Despooling elapsed time = 00:00:01, 
Transfer rate = 495  Bytes/second
11-Jan 12:58 sbia-infr-vbacula JobId 42856: Sending spooled attrs to the 
Director. Despooling 160 bytes ...
        tape remained in drive 1, used for job 42856

** Jan 11 17:45 server rebooted **

12-Jan 06:45 sbia-infr-vbacula JobId 42879: Start Backup JobId 42879, 
Job=archive.2012-01-12_06.45.00_35
12-Jan 06:45 sbia-infr-vbacula JobId 42879: Using Device "ml6000-drv1"
12-Jan 06:45 sbia-infr-vbacula JobId 42879: Volume "004090" previously written, 
moving to end of data.
12-Jan 06:45 sbia-infr-vbacula JobId 42879: Ready to append to end of Volume 
"004090" at file=83.
12-Jan 06:45 sbia-infr-vbacula JobId 42879: Spooling data ...
12-Jan 12:56 sbia-infr-vbacula JobId 42879: Job write elapsed time = 06:10:45, 
Transfer rate = 0  Bytes/second
12-Jan 12:56 sbia-infr-vbacula JobId 42879: Committing spooled data to Volume 
"004090". Despooling 5,914 bytes ...
12-Jan 12:56 sbia-infr-vbacula JobId 42879: Despooling elapsed time = 00:00:01, 
Transfer rate = 5.914 K Bytes/second
12-Jan 12:56 sbia-infr-vbacula JobId 42879: Sending spooled attrs to the 
Director. Despooling 10,271 bytes ...
        tape remained in drive 1, used for job 42876
        

13-Jan 06:45 sbia-infr-vbacula JobId 42901: Start Backup JobId 42901, 
Job=archive.2012-01-13_06.45.00_01
13-Jan 06:45 sbia-infr-vbacula JobId 42901: Using Device "ml6000-drv1"
13-Jan 06:45 sbia-infr-vbacula JobId 42901: Spooling data ...
13-Jan 06:45 sbia-infr-vbacula JobId 42901: Job write elapsed time = 00:00:01, 
Transfer rate = 0  Bytes/second
13-Jan 06:45 sbia-infr-vbacula JobId 42901: Committing spooled data to Volume 
"004090". Despooling 404 bytes ...
13-Jan 06:45 sbia-infr-vbacula JobId 42901: Despooling elapsed time = 00:00:01, 
Transfer rate = 404  Bytes/second
13-Jan 06:45 sbia-infr-vbacula JobId 42901: Sending spooled attrs to the 
Director. Despooling 0 bytes ...
        tape remained in drive 1, used for job 42901


13-Jan 19:15 sbia-infr-vbacula JobId 42902: 3307 Issuing autochanger "unload 
slot 9, drive 1" command.
        tape 004090 unloaded


14-Jan 06:45 sbia-infr-vbacula JobId 42924: Start Backup JobId 42924, 
Job=archive.2012-01-14_06.45.00_32
15-Jan 16:52 sbia-infr-vbacula JobId 42924: 3307 Issuing autochanger "unload 
slot 13, drive 1" command.
15-Jan 16:53 sbia-infr-vbacula JobId 42924: Using Device "ml6000-drv1"
15-Jan 16:53 sbia-infr-vbacula JobId 42924: 3301 Issuing autochanger "loaded? 
drive 1" command.
15-Jan 16:53 sbia-infr-vbacula JobId 42924: 3302 Autochanger "loaded? drive 1", 
result: nothing loaded.
15-Jan 16:53 sbia-infr-vbacula JobId 42924: 3304 Issuing autochanger "load slot 
9, drive 1" command.
15-Jan 16:54 sbia-infr-vbacula JobId 42924: 3305 Autochanger "load slot 9, 
drive 1", status is OK.
15-Jan 16:54 sbia-infr-vbacula JobId 42924: Please mount Volume "004090" or 
label a new one for:
        attempt to use tape for job 42924, label corrupt

-----------------------------------------------------------------------------------------------------


The fact that the server rebooted between job 42856 and job 42879 makes
me wonder if something in the fibre/SCSI initialization or bacula-sd
startup causes tapes to be rewound but not ejected.

I find it interesting that the label was overwritten by job 42901. If the
server reboot (and SCSI bus reset, fibre loop initialization, bacula-sd
restart, bacula-dir restart) caused the tape[s] in the drive[s] to rewind,
wouldn't the label on tape 004090 have been overwritten by job 42879?

I don't see any indication that the tape was ejected or rewound between
jobs 42879 and 42901, yet that's where the corruption seems to have
happened.

Is there any other information that I can provide? Do you suggest running
bacula-sd or bacula-dir at any particular debugging level to try to get
more detail if this happens again?

I plan to do the following:

        change the daemon startup order, so that bacula-sd starts BEFORE 
bacula-dir

        change /etc/init.d/bacula-sd to unload all tape drives before starting 
the
        daemon

Thanks for helping look into this,

Mark


=> 
=> __Martin

------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users