[Networker] Very strange problem when recovering data -- need help

Hi,

I'm unable to recover all the data on an SDLT-2 tape from drive 2 of 4,but I *can* recover the same data from any of the other drives in thesame library!!! When using the afflicted drive, the recovery fails onthe first 277 files but then succeeds with the remaining ones. However,the recover is completely successful when run on any of the other 3drives. Go figure! All the drives are the same type and have the samefirmware. Has anyone ever seen this situation before? I've provided alldetails and error messages further below. This tape was originallylabeled on this library, and there have been no changes to the softwareor equipment.


1. How could this phenomena happen?

The closest I've seen to this type of behavior was one time when I had aproblem inventorying several tapes in an LTO tape library. There was onedrive wherein NetWorker would complain when loading 3 different LTOtapes, but I could inventory these same tapes in the other LTO driveswith no complaints. Other than those 3 tapes, though, all the othertapes worked fine in that drive. It was the weirdest thing.

2. Quantum suggested running their diagnostic tool against the drive,but there are no guarantees that it will fail the test. If it does,great, but if not, I still don't trust it. So they're sending me a newdrive, and I was just gonna swap it out and retry. Does this soundreasonable? What conclusions can I draw from all this?


<<< Details >>>

We're running NetWorker 7.2.2 using Solaris primary backup server andLinux storage node with an attached Quantum M1800 SDLT tape library (4SDLT-600 drives). Hardware compression is enabled. No client sidecompression is used. Drives 1-2 are daisy chained and attached tochannel A of a dual channel HBA, and drives 3-4 are likewise daisychained and use channel B of the same HBA. The *afflicted* drive isdrive 2 of 4. All backups use the storage node, and data is writtendirectly to tape, no VTL. All the drives have the same firmware and areidentical. The NetWorker 'inquire' tool reports the same information foreach drive, and each is configured in nwadmin the same way. They allhave the same settings. We use variable block size as indicated in our/etc/stinit.def file which I've provided at the end. Also, nwadminindicates that the Volume block size is 128 KB for all our labeledtapes, including the one I was working with. Furthermore, the drive hasnot indicated that it needs to be cleaned, but I went ahead and cleanedit and re-tried and same problem.

I recently ran a large level full backup (934 GB) that spanned 4 SDLT-2tapes. I then cloned the save set. mminfo shows nothing suspect as faras the save set flags or clone flags. I then marked the original saveset 'suspect' and then spot checked by using nwrecover to recover a fewdirectories from the clone copy - all OK. But then I decided to recoverone more directory, and then ... Wham! NetWorker threw out a bunch oferror messages like this:


<<< ERROR MESSAGES >>>

Recovering 5111 files within /path1/data into /path2/recover_test

Volumes needed (all on-line):
      vol_c001 at rd=snode:/dev/nst1
Requesting 5111 file(s), this may take a while...

Error encountered on the following files by NSR server `server': can notread record 5142 of file 30 on sdlt600 tape vol_c001

  ./dir1/file1 @ Wed Mar 21 16:33:30 2007
  ./dir1/file2 @ Wed Mar 21 16:33:30 2007
  ./dir1/file3 @ Wed Mar 21 16:33:30 2007
  ..................................... ad infinitum
  ./dir1/file277@ Wed Mar 21 16:33:30 2007
nwrecover: Requesting remaining 4834 file(s) from NSR server `server'
./dir1/file278
./dir1/file279
./dir1/file280
........................................ ad infinitum
./dir/file5111

nwrecover: Unable to read checksum from save stream
Received 153 file(s) from NSR server `server'
Recover errors with 277 file(s)
Recover completion time: Thu Mar 29 16:31:58 2007

NetWorker also logged an error like this to both the /nsr/logs/messagesand daemon.log files:

Mar 29 15:35:48 server root: [ID 702911 daemon.notice] NetWorker media:(notice) Volume "vol_c001" on device "rd=snode:/dev/nst1": Cannot decodeblock. Verify the device configuration. Tape positioning by record isdisabled.

NetWorker then unloaded the tape, changed the 'Enabled' field from 'Yes'to 'Service' and then loaded the tape into another drive and recoveredthe remaining files. I then re-tested using each of the other 3 devicesand no complaints. Not one peep! I then tried a save set recover and itfails, too with the following error:


recover: xdr checksum failed

Error encountered by NSR server `server': can not read record 1 of file2 on sdlt600 tape vol_c001

Received 0 file(s) from NSR server `server'
Recover completion time: Fri Mar 30 17:54:27 2007

Finally, I changed the original save set back to 'notsuspect',re-enabled the drive and tried recovering the same data in the samedrive. It loads the original tape, positions itself but then failsimmediately, generating the following messages in /nsr/logs/daemon.log:

04/01/07 23:05:11 nsrd: media notice: Volume "vol001" on device"rd=snode:/dev/nst1": Cannot decode block. Verify the deviceconfiguration. Tape positioning by record is disabled.04/01/07 23:05:11 nsrd: media info: can not read record 3689 of file 30on sdlt600 tape vol001


and the following in the messages log:

Apr 1 23:05:11 server root: [ID 702911 daemon.notice] NetWorker media:(warning) rd=snode:/dev/nst1 reading: Success

Apr  1 23:05:12 server last message repeated 5 times

Apr 1 23:05:12 server root: [ID 702911 daemon.notice] NetWorker media:(notice) Volume "vol001" on device "rd=snode:/dev/nst1": Cannot decodeblock. Verify the device configuration. Tape positioning by record isdisabled.Apr 1 23:05:12 server root: [ID 702911 daemon.notice] NetWorker media:(info) can not read record 3689 of file 30 on sdlt600 tape vol001Apr 1 23:05:12 server root: [ID 702911 daemon.notice] NetWorker media:(warning) rd=snode:/dev/nst1 reading: Success

Apr  1 23:05:14 server last message repeated 14 times

Apr 1 23:05:14 server root: [ID 702911 daemon.notice] NetWorker devicedisabled: (warning) Device rd=snode:/dev/nst1 is automatically disabled.Apr 1 23:05:14 server root: [ID 702911 daemon.notice] consecutiveerrors (21) exceeded the maximum consecutive errors allowed.Apr 1 23:05:14 server root: [ID 702911 daemon.notice] Please fix thedevice or set a higher value for the Max consecutive errorsApr 1 23:05:14 server root: [ID 702911 daemon.notice] attribute in thedevice resource.

It then ejects the tape, puts the device back into 'Service', and thenloads the appropriate clone volume (the one I'd previously played with)into another driveand recovers all the data just dandy, no errors. I then retried therecovery using the original tape but this time in another drive, andthis works flawlessly, no errors.

I should note that I've been using this tape library since Nov 2006, andI've seen only one error on drives 1 and 3 and none on 4. Drive 2,however, has had several errors during the last week on several tapes,not just the ones I was testing. The /nsr/logs/messages and daemon.logfiles recorded some messages for these. However, the "Consecutiveerrors" under the device under nwadmin still showed 0 until I tried torecover the aforementioned data. The errors I've seen in the log filesfor drive 2 have all been like this:

Mar 22 01:08:06 server root: [ID 702911 daemon.notice] NetWorker media:(notice) Volume "volume_name"on device "rd=snode:/dev/nst1": Cannotdecode block. Verify the device configuration. Tape positioning byrecord is disabled


If anyone has any ideas or recommendations, I'd love to hear them.

Thanks.

George

<<< /etc/stinit.def >>>
# Global Keywords and Values
drive-buffering=1
#scsi2logical=1
no-wait=0
buffering=0
async-writes=0
read-ahead=1
two-fms=0
auto-lock=0
fast-eom=1
can-bsr=1
noblklimits=0
# can-partitions=0

# QUANTUM SDLT600
manufacturer=QUANTUM model="SDLT600" {
timeout=3600 # 1 hour timeout
long-timeout=14400 # 4 hour long timeout
can-partitions=0

mode1 blocksize=0 density=0x4A compression=1 # SDLT600 density,compression onmode2 blocksize=0 density=0x4A compression=0 # SDLT600 density,compression offmode3 blocksize=0 density=0x49 compression=1 # SDLT320 density,compression onmode4 blocksize=0 density=0x48 compression=1 # SDLT220 density,compression on



--
George Sinclair - NOAA/NESDIS/National Oceanographic Data Center
SSMC3 4th Floor Rm 4145       | Voice: (301) 713-3284 x210
1315 East West Highway        | Fax:   (301) 713-3301
Silver Spring, MD 20910-3282  | Web Site:  http://www.nodc.noaa.gov/

- Any opinions expressed in this message are NOT those of the US Govt. -

To sign off this list, send email to listserv AT listserv.temple DOT edu and type 
"signoff networker" in the body of the email. Please write to networker-request 
AT listserv.temple DOT edu if you have any problems with this list. You can access the 
archives at http://listserv.temple.edu/archives/networker.html or
via RSS at http://listserv.temple.edu/cgi-bin/wa?RSS&L=NETWORKER