Bacula-users

[Bacula-users] help needed with tracking down tape drive problem

2008-10-04 15:12:33
Subject: [Bacula-users] help needed with tracking down tape drive problem
From: W.Braun AT seg DOT de
To: bacula-users AT lists.sourceforge DOT net
Date: Mon, 29 Sep 2008 15:47:04 +0200
Hello, Bacula-Users!

I am aware that this is more or less off-topic, but the experience of 
other
tape drive users is needed for help. ;-)

The company I am working for is using Bacula on a Debian Etch system
together with a Tandberg StorageLibrary T40 containing two IBM LTO 3 
drives
(full height) for nearly a year now. (Bacula is self-compiled, not one of
Debian's packages.)

Two months ago, the nightly backups resp. migration jobs (from filestorage
onto tape) suddenly got stuck. The symptom: tapes get tagged by Bacula 
with
an error when they are reused, that is, when they are fast-forwarded to 
the
end of previously written data. (Tapes recycled and therefore written from
the beginning are no problem.) Additionally, the T40 presents some "RAS
tickets" to us (including "drive diagnostics required"), time correlated
with the error tagging. This happens on _both_ tape drives of the library.

An example:
·Excerpt from Bacula's messages:
26-Jul 00:17 s008-sd JobId 28678: Volume "SE0032L3" previously written, 
moving to end of data.
26-Jul 00:26 s008-sd JobId 28678: Error: Unable to position to end of data 
on device "lib03-drive0" (/dev/nst0): ERR=dev.c:896 ioctl MTEOM error on 
"lib03-drive0" (/dev/nst0). ERR=Input/output error.

26-Jul 00:26 s008-sd JobId 28678: Marking Volume "SE0032L3" in Error in 
Catalog.

·correspondig kernel messages:
Jul 26 00:22:55 s008 kernel: st0: Current: sense key: Medium Error
Jul 26 00:22:55 s008 kernel:     Additional sense: Recorded entity not 
found
Jul 26 00:22:55 s008 kernel: Info fld=0x7ffb4c
Jul 26 00:26:48 s008 kernel: st0: Current: sense key: Medium Error
Jul 26 00:26:48 s008 kernel:     Additional sense: Recorded entity not 
found

·excerpt from the T40's "RAS Ticket Log" (it's clock had been behind 
around
an hour at that time, don't wonder at that)
07.25.08 23:21:24 R6003  Drive 1 Tape Alert 3. Hard Read and Write errors. 
Volume Tag: SE0032
07.25.08 23:21:24 R6039  Drive 1 Tape Alert 39. Drive diagnostic required.


Nothing of soft- nor hardware had been changed the days (weeks)
before the first occurrence. A first test had me defining a little test 
job
(of just 3 GB) and trying. The problem occured repeatedly, also with brand
new tapes (well, also a year old, but unpacked freshly) - running the job,
running it again, ejecting tape, running job again: Bacula reinserts tape,
tries to append data -> "error" ...

After some days of testing, including using IBM's drive test program which
found no problem (approved by Tandberg after analysing the binary log
files), we brought the tape library to Tandberg for checking. We gave two 
of
our tapes "in error state" with it (one was of the "new" ones). These 
tapes
generated RAS tickets in their drives, also. They wrote some terabyte onto
their tapes, found no problem, and gave us back the library together with
two of their own, "working" tapes. (At that time, because of some lack of
information, they thought that the problem are our tapes. In fact, this
revealed that the tapes themselves get written in some faulty manner.)

Meanwhile, after reading Bacula's messages more exact and some other 
manual
tests, that showed that the error (more precisely: the symptom) occurs 
only
when seeking forward to EOD, but can't be reproduced every try, I wrote 
some
little Bash scripts for more extensive testing, including one using btape
(for if it's a Bacula problem), one using mt together with dd. The latter
script produced the error also (on command 'mt eod') - therefore it seems 
to
be off-topic for this list now ...

Annoyingly, the error occurs only at around 5 per cent of the tries - but
seemingly nearly every time Bacula itself is used. The last weeks, using 
the
mt-dd-test-script alone, the error occurs always accompanied with 3 RAS
tickets ("hard read and write errors" and twice "drive diagnostics
required").

First question: Has anybody had the same or similar problems?

Until today, we replaced the following to eliminate erroneous or 
troublesome
components (software also despite nothing had been changed before) - but:
After every of these changes I started up one of my test scripts and
got/have the already well-known error behaviour! :-|

·Bacula (upto version 2.4.2)
·firmware of tape library
·firmware of tape drives 
 1. with a newer version from Tandberg's web site
 2. with a newer version already approved by Tandberg, but not yet 
released
    officially at that time
 3. with an even newer version from IBM's web site
·SCSI cables and terminators
·the SCSI HBA
 1. with same model from same manufacturer (Adaptec)
 2. with a model from another manufacturer (LSI), to check for a 
problematic
    kernel driver
·Debian's kernel from 2.6.18 upto 2.6.24 ("etchnhalf")
·one tape drive (temporarily - another drive by courtesy of Tandberg)
·the computer's main board (had been planned for some time already because
 of performance reasons)
·even our storage configuration for Bacula regarding "Hardware End of
 Medium" and "Fast Forward Space File", besides btape's test found no
 problem

We also eliminated some other soft- and hardware components by using
·Knoppix live DVD (verson 5.3.1)
·openSUSE live CD 11.0 instead of Knoppix, because Knoppix is based on
 Debian <:-)
·even another computer - a normal workstation, supplemented with an SCSI 
HBA
 (again: once Adaptec, once LSI) and connected to the T40's drives, 
started
 with Knoppix' DVD

Still left to try:
·changing my test script to detect the error case and then reading back in
 the data from tape for comparison (if and where data itself becomes
 corrupted)
·placing the backup system in another room to avoid some weird
 electromagnetic influences (admittedly an oddly idea)

Second question: Do you have any other ideas, what and _how_ to test?

"Desperate" greetings :-} from Germany,
Wolfgang Braun

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users

<Prev in Thread] Current Thread [Next in Thread>
  • [Bacula-users] help needed with tracking down tape drive problem, W . Braun <=