Hi,
I'm unable to recover all the data on an SDLT-2 tape from drive 2 of 4,
but I *can* recover the same data from any of the other drives in the
same library!!! When using the afflicted drive, the recovery fails on
the first 277 files but then succeeds with the remaining ones. However,
the recover is completely successful when run on any of the other 3
drives. Go figure! All the drives are the same type and have the same
firmware. Has anyone ever seen this situation before? I've provided all
details and error messages further below. This tape was originally
labeled on this library, and there have been no changes to the software
or equipment.
1. How could this phenomena happen?
The closest I've seen to this type of behavior was one time when I had a
problem inventorying several tapes in an LTO tape library. There was one
drive wherein NetWorker would complain when loading 3 different LTO
tapes, but I could inventory these same tapes in the other LTO drives
with no complaints. Other than those 3 tapes, though, all the other
tapes worked fine in that drive. It was the weirdest thing.
2. Quantum suggested running their diagnostic tool against the drive,
but there are no guarantees that it will fail the test. If it does,
great, but if not, I still don't trust it. So they're sending me a new
drive, and I was just gonna swap it out and retry. Does this sound
reasonable? What conclusions can I draw from all this?
<<< Details >>>
We're running NetWorker 7.2.2 using Solaris primary backup server and
Linux storage node with an attached Quantum M1800 SDLT tape library (4
SDLT-600 drives). Hardware compression is enabled. No client side
compression is used. Drives 1-2 are daisy chained and attached to
channel A of a dual channel HBA, and drives 3-4 are likewise daisy
chained and use channel B of the same HBA. The *afflicted* drive is
drive 2 of 4. All backups use the storage node, and data is written
directly to tape, no VTL. All the drives have the same firmware and are
identical. The NetWorker 'inquire' tool reports the same information for
each drive, and each is configured in nwadmin the same way. They all
have the same settings. We use variable block size as indicated in our
/etc/stinit.def file which I've provided at the end. Also, nwadmin
indicates that the Volume block size is 128 KB for all our labeled
tapes, including the one I was working with. Furthermore, the drive has
not indicated that it needs to be cleaned, but I went ahead and cleaned
it and re-tried and same problem.
I recently ran a large level full backup (934 GB) that spanned 4 SDLT-2
tapes. I then cloned the save set. mminfo shows nothing suspect as far
as the save set flags or clone flags. I then marked the original save
set 'suspect' and then spot checked by using nwrecover to recover a few
directories from the clone copy - all OK. But then I decided to recover
one more directory, and then ... Wham! NetWorker threw out a bunch of
error messages like this:
<<< ERROR MESSAGES >>>
Recovering 5111 files within /path1/data into /path2/recover_test
Volumes needed (all on-line):
vol_c001 at rd=snode:/dev/nst1
Requesting 5111 file(s), this may take a while...
Error encountered on the following files by NSR server `server': can not
read record 5142 of file 30 on sdlt600 tape vol_c001
./dir1/file1 @ Wed Mar 21 16:33:30 2007
./dir1/file2 @ Wed Mar 21 16:33:30 2007
./dir1/file3 @ Wed Mar 21 16:33:30 2007
..................................... ad infinitum
./dir1/file277@ Wed Mar 21 16:33:30 2007
nwrecover: Requesting remaining 4834 file(s) from NSR server `server'
./dir1/file278
./dir1/file279
./dir1/file280
........................................ ad infinitum
./dir/file5111
nwrecover: Unable to read checksum from save stream
Received 153 file(s) from NSR server `server'
Recover errors with 277 file(s)
Recover completion time: Thu Mar 29 16:31:58 2007
NetWorker also logged an error like this to both the /nsr/logs/messages
and daemon.log files:
Mar 29 15:35:48 server root: [ID 702911 daemon.notice] NetWorker media:
(notice) Volume "vol_c001" on device "rd=snode:/dev/nst1": Cannot decode
block. Verify the device configuration. Tape positioning by record is
disabled.
NetWorker then unloaded the tape, changed the 'Enabled' field from 'Yes'
to 'Service' and then loaded the tape into another drive and recovered
the remaining files. I then re-tested using each of the other 3 devices
and no complaints. Not one peep! I then tried a save set recover and it
fails, too with the following error:
recover: xdr checksum failed
Error encountered by NSR server `server': can not read record 1 of file
2 on sdlt600 tape vol_c001
Received 0 file(s) from NSR server `server'
Recover completion time: Fri Mar 30 17:54:27 2007
Finally, I changed the original save set back to 'notsuspect',
re-enabled the drive and tried recovering the same data in the same
drive. It loads the original tape, positions itself but then fails
immediately, generating the following messages in /nsr/logs/daemon.log:
04/01/07 23:05:11 nsrd: media notice: Volume "vol001" on device
"rd=snode:/dev/nst1": Cannot decode block. Verify the device
configuration. Tape positioning by record is disabled.
04/01/07 23:05:11 nsrd: media info: can not read record 3689 of file 30
on sdlt600 tape vol001
and the following in the messages log:
Apr 1 23:05:11 server root: [ID 702911 daemon.notice] NetWorker media:
(warning) rd=snode:/dev/nst1 reading: Success
Apr 1 23:05:12 server last message repeated 5 times
Apr 1 23:05:12 server root: [ID 702911 daemon.notice] NetWorker media:
(notice) Volume "vol001" on device "rd=snode:/dev/nst1": Cannot decode
block. Verify the device configuration. Tape positioning by record is
disabled.
Apr 1 23:05:12 server root: [ID 702911 daemon.notice] NetWorker media:
(info) can not read record 3689 of file 30 on sdlt600 tape vol001
Apr 1 23:05:12 server root: [ID 702911 daemon.notice] NetWorker media:
(warning) rd=snode:/dev/nst1 reading: Success
Apr 1 23:05:14 server last message repeated 14 times
Apr 1 23:05:14 server root: [ID 702911 daemon.notice] NetWorker device
disabled: (warning) Device rd=snode:/dev/nst1 is automatically disabled.
Apr 1 23:05:14 server root: [ID 702911 daemon.notice] consecutive
errors (21) exceeded the maximum consecutive errors allowed.
Apr 1 23:05:14 server root: [ID 702911 daemon.notice] Please fix the
device or set a higher value for the Max consecutive errors
Apr 1 23:05:14 server root: [ID 702911 daemon.notice] attribute in the
device resource.
It then ejects the tape, puts the device back into 'Service', and then
loads the appropriate clone volume (the one I'd previously played with)
into another drive
and recovers all the data just dandy, no errors. I then retried the
recovery using the original tape but this time in another drive, and
this works flawlessly, no errors.
I should note that I've been using this tape library since Nov 2006, and
I've seen only one error on drives 1 and 3 and none on 4. Drive 2,
however, has had several errors during the last week on several tapes,
not just the ones I was testing. The /nsr/logs/messages and daemon.log
files recorded some messages for these. However, the "Consecutive
errors" under the device under nwadmin still showed 0 until I tried to
recover the aforementioned data. The errors I've seen in the log files
for drive 2 have all been like this:
Mar 22 01:08:06 server root: [ID 702911 daemon.notice] NetWorker media:
(notice) Volume "volume_name"on device "rd=snode:/dev/nst1": Cannot
decode block. Verify the device configuration. Tape positioning by
record is disabled
If anyone has any ideas or recommendations, I'd love to hear them.
Thanks.
George
<<< /etc/stinit.def >>>
# Global Keywords and Values
drive-buffering=1
#scsi2logical=1
no-wait=0
buffering=0
async-writes=0
read-ahead=1
two-fms=0
auto-lock=0
fast-eom=1
can-bsr=1
noblklimits=0
# can-partitions=0
# QUANTUM SDLT600
manufacturer=QUANTUM model="SDLT600" {
timeout=3600 # 1 hour timeout
long-timeout=14400 # 4 hour long timeout
can-partitions=0
mode1 blocksize=0 density=0x4A compression=1 # SDLT600 density,
compression on
mode2 blocksize=0 density=0x4A compression=0 # SDLT600 density,
compression off
mode3 blocksize=0 density=0x49 compression=1 # SDLT320 density,
compression on
mode4 blocksize=0 density=0x48 compression=1 # SDLT220 density,
compression on
--
George Sinclair - NOAA/NESDIS/National Oceanographic Data Center
SSMC3 4th Floor Rm 4145 | Voice: (301) 713-3284 x210
1315 East West Highway | Fax: (301) 713-3301
Silver Spring, MD 20910-3282 | Web Site: http://www.nodc.noaa.gov/
- Any opinions expressed in this message are NOT those of the US Govt. -
To sign off this list, send email to listserv AT listserv.temple DOT edu and type
"signoff networker" in the body of the email. Please write to networker-request
AT listserv.temple DOT edu if you have any problems with this list. You can access the
archives at http://listserv.temple.edu/archives/networker.html or
via RSS at http://listserv.temple.edu/cgi-bin/wa?RSS&L=NETWORKER
|