Networker

[Networker] Very strange problem when recovering data -- need help

2007-04-01 19:55:48
Subject: [Networker] Very strange problem when recovering data -- need help
From: George Sinclair <George.Sinclair AT NOAA DOT GOV>
To: NETWORKER AT LISTSERV.TEMPLE DOT EDU
Date: Sun, 1 Apr 2007 19:54:41 -0400
Hi,

I'm unable to recover all the data on an SDLT-2 tape from drive 2 of 4, but I *can* recover the same data from any of the other drives in the same library!!! When using the afflicted drive, the recovery fails on the first 277 files but then succeeds with the remaining ones. However, the recover is completely successful when run on any of the other 3 drives. Go figure! All the drives are the same type and have the same firmware. Has anyone ever seen this situation before? I've provided all details and error messages further below. This tape was originally labeled on this library, and there have been no changes to the software or equipment.

1. How could this phenomena happen?

The closest I've seen to this type of behavior was one time when I had a problem inventorying several tapes in an LTO tape library. There was one drive wherein NetWorker would complain when loading 3 different LTO tapes, but I could inventory these same tapes in the other LTO drives with no complaints. Other than those 3 tapes, though, all the other tapes worked fine in that drive. It was the weirdest thing.

2. Quantum suggested running their diagnostic tool against the drive, but there are no guarantees that it will fail the test. If it does, great, but if not, I still don't trust it. So they're sending me a new drive, and I was just gonna swap it out and retry. Does this sound reasonable? What conclusions can I draw from all this?

<<< Details >>>
We're running NetWorker 7.2.2 using Solaris primary backup server and Linux storage node with an attached Quantum M1800 SDLT tape library (4 SDLT-600 drives). Hardware compression is enabled. No client side compression is used. Drives 1-2 are daisy chained and attached to channel A of a dual channel HBA, and drives 3-4 are likewise daisy chained and use channel B of the same HBA. The *afflicted* drive is drive 2 of 4. All backups use the storage node, and data is written directly to tape, no VTL. All the drives have the same firmware and are identical. The NetWorker 'inquire' tool reports the same information for each drive, and each is configured in nwadmin the same way. They all have the same settings. We use variable block size as indicated in our /etc/stinit.def file which I've provided at the end. Also, nwadmin indicates that the Volume block size is 128 KB for all our labeled tapes, including the one I was working with. Furthermore, the drive has not indicated that it needs to be cleaned, but I went ahead and cleaned it and re-tried and same problem.

I recently ran a large level full backup (934 GB) that spanned 4 SDLT-2 tapes. I then cloned the save set. mminfo shows nothing suspect as far as the save set flags or clone flags. I then marked the original save set 'suspect' and then spot checked by using nwrecover to recover a few directories from the clone copy - all OK. But then I decided to recover one more directory, and then ... Wham! NetWorker threw out a bunch of error messages like this:

<<< ERROR MESSAGES >>>

Recovering 5111 files within /path1/data into /path2/recover_test

Volumes needed (all on-line):
      vol_c001 at rd=snode:/dev/nst1
Requesting 5111 file(s), this may take a while...
Error encountered on the following files by NSR server `server': can not read record 5142 of file 30 on sdlt600 tape vol_c001
  ./dir1/file1 @ Wed Mar 21 16:33:30 2007
  ./dir1/file2 @ Wed Mar 21 16:33:30 2007
  ./dir1/file3 @ Wed Mar 21 16:33:30 2007
  ..................................... ad infinitum
  ./dir1/file277@ Wed Mar 21 16:33:30 2007
nwrecover: Requesting remaining 4834 file(s) from NSR server `server'
./dir1/file278
./dir1/file279
./dir1/file280
........................................ ad infinitum
./dir/file5111

nwrecover: Unable to read checksum from save stream
Received 153 file(s) from NSR server `server'
Recover errors with 277 file(s)
Recover completion time: Thu Mar 29 16:31:58 2007

NetWorker also logged an error like this to both the /nsr/logs/messages and daemon.log files:

Mar 29 15:35:48 server root: [ID 702911 daemon.notice] NetWorker media: (notice) Volume "vol_c001" on device "rd=snode:/dev/nst1": Cannot decode block. Verify the device configuration. Tape positioning by record is disabled.

NetWorker then unloaded the tape, changed the 'Enabled' field from 'Yes' to 'Service' and then loaded the tape into another drive and recovered the remaining files. I then re-tested using each of the other 3 devices and no complaints. Not one peep! I then tried a save set recover and it fails, too with the following error:

recover: xdr checksum failed
Error encountered by NSR server `server': can not read record 1 of file 2 on sdlt600 tape vol_c001
Received 0 file(s) from NSR server `server'
Recover completion time: Fri Mar 30 17:54:27 2007

Finally, I changed the original save set back to 'notsuspect', re-enabled the drive and tried recovering the same data in the same drive. It loads the original tape, positions itself but then fails immediately, generating the following messages in /nsr/logs/daemon.log:

04/01/07 23:05:11 nsrd: media notice: Volume "vol001" on device "rd=snode:/dev/nst1": Cannot decode block. Verify the device configuration. Tape positioning by record is disabled. 04/01/07 23:05:11 nsrd: media info: can not read record 3689 of file 30 on sdlt600 tape vol001

and the following in the messages log:

Apr 1 23:05:11 server root: [ID 702911 daemon.notice] NetWorker media: (warning) rd=snode:/dev/nst1 reading: Success
Apr  1 23:05:12 server last message repeated 5 times
Apr 1 23:05:12 server root: [ID 702911 daemon.notice] NetWorker media: (notice) Volume "vol001" on device "rd=snode:/dev/nst1": Cannot decode block. Verify the device configuration. Tape positioning by record is disabled. Apr 1 23:05:12 server root: [ID 702911 daemon.notice] NetWorker media: (info) can not read record 3689 of file 30 on sdlt600 tape vol001 Apr 1 23:05:12 server root: [ID 702911 daemon.notice] NetWorker media: (warning) rd=snode:/dev/nst1 reading: Success
Apr  1 23:05:14 server last message repeated 14 times
Apr 1 23:05:14 server root: [ID 702911 daemon.notice] NetWorker device disabled: (warning) Device rd=snode:/dev/nst1 is automatically disabled. Apr 1 23:05:14 server root: [ID 702911 daemon.notice] consecutive errors (21) exceeded the maximum consecutive errors allowed. Apr 1 23:05:14 server root: [ID 702911 daemon.notice] Please fix the device or set a higher value for the Max consecutive errors Apr 1 23:05:14 server root: [ID 702911 daemon.notice] attribute in the device resource.

It then ejects the tape, puts the device back into 'Service', and then loads the appropriate clone volume (the one I'd previously played with) into another drive and recovers all the data just dandy, no errors. I then retried the recovery using the original tape but this time in another drive, and this works flawlessly, no errors.

I should note that I've been using this tape library since Nov 2006, and I've seen only one error on drives 1 and 3 and none on 4. Drive 2, however, has had several errors during the last week on several tapes, not just the ones I was testing. The /nsr/logs/messages and daemon.log files recorded some messages for these. However, the "Consecutive errors" under the device under nwadmin still showed 0 until I tried to recover the aforementioned data. The errors I've seen in the log files for drive 2 have all been like this:

Mar 22 01:08:06 server root: [ID 702911 daemon.notice] NetWorker media: (notice) Volume "volume_name"on device "rd=snode:/dev/nst1": Cannot decode block. Verify the device configuration. Tape positioning by record is disabled

If anyone has any ideas or recommendations, I'd love to hear them.

Thanks.

George

<<< /etc/stinit.def >>>
# Global Keywords and Values
drive-buffering=1
#scsi2logical=1
no-wait=0
buffering=0
async-writes=0
read-ahead=1
two-fms=0
auto-lock=0
fast-eom=1
can-bsr=1
noblklimits=0
# can-partitions=0

# QUANTUM SDLT600
manufacturer=QUANTUM model="SDLT600" {
timeout=3600 # 1 hour timeout
long-timeout=14400 # 4 hour long timeout
can-partitions=0
mode1 blocksize=0 density=0x4A compression=1 # SDLT600 density, compression on mode2 blocksize=0 density=0x4A compression=0 # SDLT600 density, compression off mode3 blocksize=0 density=0x49 compression=1 # SDLT320 density, compression on mode4 blocksize=0 density=0x48 compression=1 # SDLT220 density, compression on


--
George Sinclair - NOAA/NESDIS/National Oceanographic Data Center
SSMC3 4th Floor Rm 4145       | Voice: (301) 713-3284 x210
1315 East West Highway        | Fax:   (301) 713-3301
Silver Spring, MD 20910-3282  | Web Site:  http://www.nodc.noaa.gov/
- Any opinions expressed in this message are NOT those of the US Govt. -
To sign off this list, send email to listserv AT listserv.temple DOT edu and type 
"signoff networker" in the body of the email. Please write to networker-request 
AT listserv.temple DOT edu if you have any problems with this list. You can access the 
archives at http://listserv.temple.edu/archives/networker.html or
via RSS at http://listserv.temple.edu/cgi-bin/wa?RSS&L=NETWORKER

<Prev in Thread] Current Thread [Next in Thread>