Hello, Bacula-Users!
I am aware that this is more or less off-topic, but the experience of
other
tape drive users is needed for help. ;-)
The company I am working for is using Bacula on a Debian Etch system
together with a Tandberg StorageLibrary T40 containing two IBM LTO 3
drives
(full height) for nearly a year now. (Bacula is self-compiled, not one of
Debian's packages.)
Two months ago, the nightly backups resp. migration jobs (from filestorage
onto tape) suddenly got stuck. The symptom: tapes get tagged by Bacula
with
an error when they are reused, that is, when they are fast-forwarded to
the
end of previously written data. (Tapes recycled and therefore written from
the beginning are no problem.) Additionally, the T40 presents some "RAS
tickets" to us (including "drive diagnostics required"), time correlated
with the error tagging. This happens on _both_ tape drives of the library.
An example:
·Excerpt from Bacula's messages:
26-Jul 00:17 s008-sd JobId 28678: Volume "SE0032L3" previously written,
moving to end of data.
26-Jul 00:26 s008-sd JobId 28678: Error: Unable to position to end of data
on device "lib03-drive0" (/dev/nst0): ERR=dev.c:896 ioctl MTEOM error on
"lib03-drive0" (/dev/nst0). ERR=Input/output error.
26-Jul 00:26 s008-sd JobId 28678: Marking Volume "SE0032L3" in Error in
Catalog.
·correspondig kernel messages:
Jul 26 00:22:55 s008 kernel: st0: Current: sense key: Medium Error
Jul 26 00:22:55 s008 kernel: Additional sense: Recorded entity not
found
Jul 26 00:22:55 s008 kernel: Info fld=0x7ffb4c
Jul 26 00:26:48 s008 kernel: st0: Current: sense key: Medium Error
Jul 26 00:26:48 s008 kernel: Additional sense: Recorded entity not
found
·excerpt from the T40's "RAS Ticket Log" (it's clock had been behind
around
an hour at that time, don't wonder at that)
07.25.08 23:21:24 R6003 Drive 1 Tape Alert 3. Hard Read and Write errors.
Volume Tag: SE0032
07.25.08 23:21:24 R6039 Drive 1 Tape Alert 39. Drive diagnostic required.
Nothing of soft- nor hardware had been changed the days (weeks)
before the first occurrence. A first test had me defining a little test
job
(of just 3 GB) and trying. The problem occured repeatedly, also with brand
new tapes (well, also a year old, but unpacked freshly) - running the job,
running it again, ejecting tape, running job again: Bacula reinserts tape,
tries to append data -> "error" ...
After some days of testing, including using IBM's drive test program which
found no problem (approved by Tandberg after analysing the binary log
files), we brought the tape library to Tandberg for checking. We gave two
of
our tapes "in error state" with it (one was of the "new" ones). These
tapes
generated RAS tickets in their drives, also. They wrote some terabyte onto
their tapes, found no problem, and gave us back the library together with
two of their own, "working" tapes. (At that time, because of some lack of
information, they thought that the problem are our tapes. In fact, this
revealed that the tapes themselves get written in some faulty manner.)
Meanwhile, after reading Bacula's messages more exact and some other
manual
tests, that showed that the error (more precisely: the symptom) occurs
only
when seeking forward to EOD, but can't be reproduced every try, I wrote
some
little Bash scripts for more extensive testing, including one using btape
(for if it's a Bacula problem), one using mt together with dd. The latter
script produced the error also (on command 'mt eod') - therefore it seems
to
be off-topic for this list now ...
Annoyingly, the error occurs only at around 5 per cent of the tries - but
seemingly nearly every time Bacula itself is used. The last weeks, using
the
mt-dd-test-script alone, the error occurs always accompanied with 3 RAS
tickets ("hard read and write errors" and twice "drive diagnostics
required").
First question: Has anybody had the same or similar problems?
Until today, we replaced the following to eliminate erroneous or
troublesome
components (software also despite nothing had been changed before) - but:
After every of these changes I started up one of my test scripts and
got/have the already well-known error behaviour! :-|
·Bacula (upto version 2.4.2)
·firmware of tape library
·firmware of tape drives
1. with a newer version from Tandberg's web site
2. with a newer version already approved by Tandberg, but not yet
released
officially at that time
3. with an even newer version from IBM's web site
·SCSI cables and terminators
·the SCSI HBA
1. with same model from same manufacturer (Adaptec)
2. with a model from another manufacturer (LSI), to check for a
problematic
kernel driver
·Debian's kernel from 2.6.18 upto 2.6.24 ("etchnhalf")
·one tape drive (temporarily - another drive by courtesy of Tandberg)
·the computer's main board (had been planned for some time already because
of performance reasons)
·even our storage configuration for Bacula regarding "Hardware End of
Medium" and "Fast Forward Space File", besides btape's test found no
problem
We also eliminated some other soft- and hardware components by using
·Knoppix live DVD (verson 5.3.1)
·openSUSE live CD 11.0 instead of Knoppix, because Knoppix is based on
Debian <:-)
·even another computer - a normal workstation, supplemented with an SCSI
HBA
(again: once Adaptec, once LSI) and connected to the T40's drives,
started
with Knoppix' DVD
Still left to try:
·changing my test script to detect the error case and then reading back in
the data from tape for comparison (if and where data itself becomes
corrupted)
·placing the backup system in another room to avoid some weird
electromagnetic influences (admittedly an oddly idea)
Second question: Do you have any other ideas, what and _how_ to test?
"Desperate" greetings :-} from Germany,
Wolfgang Braun
-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users
|