Large number of volumes marked unavailable - what keeps overwriting volume labels?

kolstet

ADSM.ORG Member
Joined
Apr 13, 2005
Messages
7
Reaction score
0
Points
0
Location
Orlando, FL
Website
Visit site
Please forgive me if this has been posted before. I'm fairly new to TSM and need some help.



"I'm sorry this letter is so long, but I did not have time to make it shorter." --Mark Twain



Here's our setup:

<UL>

<LI>Single TSM server

<LI>TSM 5.2 on IBM 346 and Red Hat AS 3.0

<LI>Dual SCSI (Adaptec AHA-3960D / AIC-7899A U160/m) card

<LI>SCSI-attached ADIC Scalar 100 Library with 3 LTO2 drives

<LI>Two drives on one SCSI chain, changer and other drive on other SCSI chain

[/list]



Over the past couple of months I've been getting a lot of volumes marked as unavailable. The tape will mount and then receive an ANR8355E (I/O error reading label for volume XXXX). Going back through the activity log, it seems that the previous use of the tape was successful, but the label keeps either being corrupted or misread.



I haven't matched this up with a specific drive, SCSI chain, or set of tapes. Every time this problem occurs, I recall the corresponding COPYPOOL tapes and restore the volume, eject the tape and delete it, then relabel the tape and check it back in as scratch, but the error keeps coming up on different tapes. Attempts to audit the volume or remark the tape as READWRITE fail because the volume label can't be read.



I'm assuming that something is corrupting my volume labels. The tape paths are set to /dev/IBMtapeX (raw device) and not /dev/IBMtapeXn (no-rewind device). Is it possible that the tape is being rewound when it shouldn't be and overwriting the volume label?



I don't want to call vendors in on this until I can narrow the problem down to a particular subsystem. If I call ADIC in to do a diagnostic, for example, they may charge me for the visit if they find out the problem is elsewhere.



The worst part about this is having to recall all my offsite tapes for these restores, leaving my eggs all in one basket during that time period until the next courier visit. On top of this, reclamation is unable to run on unavailable volumes, so I'm eating up tapes and running out of scratch very quickly. Finally, there have been a few tapes that I've been unable to completely restore. Am I correct in assuming that if I delete the data from those tapes, the newest versions of those objects will be stored on the next backup run, and I only lose my retention time?



I'm going to manually shut down TSM and try to access the tape drive directly with one of these tapes loaded, so I can get a low-level look at the first few KB of the tape and see if a volume label exists. I don't know in what format these volume labels are stored, so I'll have to compare with a known-good volume label; however any help you have would be GREATLY appreciated. Especially welcome is a method of repairing just the volume label without overwriting the rest so I can audit the tape and see if my data's still there...
 
I would confirm that all your devices have current micocode (or the best version), all the drives are up on their preventitive maintenance and they have a regular cleaning schedule.



If you get another error reading the label, try to re-write the label using the same drive.



-Aaron
 
Thanks for responding, Aaron.



The drives and library are at the current microcode.



The library and drives are less than a year old; how often do you recommend PM?



The library is set up to automatically run cleaning tapes without TSM intervention. I can't verify at this point that the cleanings are actually being performed, but once a tape errors out it can't be read in any drive no matter how many times I try, so it doesn't seem to be related to dirty heads.



When I eject, delete, and relabel the tape as new it works fine for a while - is there a way of relabeling without killing the data on the tape or the reference to it in the TSM database?



Thanks,

Tony
 
did you have a look on the system logs... SCSI connection log,

if you have something on it when the erreor redinfg the tape Label Occured...

it can be the drive SCSI wire...
 
Thanks May



I don't see unusual SCSI messages in syslog, and wouldn't expect to since this happens on different drives/cables/chains.



The actual on-tape label seems to be getting corrupted somehow. Once the label doesn't read once, it will never read correctly in any drive until I re-label the tape. Once the tape is re-labeled it works fine until this happens again. I'm never seeing write errors on these tapes, only the error when it tries to verify the volume label.



[Edit: added later] There is also no record of write errors for each affected volume, only read errors, which indicates to me that the hardware and TSM know nothing about the fact that the volume label is being overwritten.



I see a hardware problem as very unlikely at this point; it appears to be software or configuration that's causing the problem. I'm still suspicious that the tape is being rewound by the OS on each open/close and TSM is expecting to be using the no-rewind device pointer to the tape drive. Can anyone confirm what is the proper setup for this?



I'm also still looking for an answer as to whether or not I can relabel and then audit the volume without losing data.
 
Okay, I've been working with IBM and ADIC on this issue for some time now so thought I'd post up more detail for those who are interested.



This can happen to any tape: newly returned from offsite as scratch, in the scratch pool inside the library, or even one with data on it already.



The tape is loaded into the drive for a write operation. The label is verified successfully, the write operation is performed and completed. After the 5-minute mount retention expires, the tape is due to be ejected. At this time TSM attempts to reverify the volume label and finds that it doesn't match. It's at this time that the tape is marked as unavailable. Re-marking the tape as READWRITE or trying to audit the volume just produces more errors in verifying the volume label. If I remove and relabel the tape as a new one, it works fine.



The tape label is contained on the first 80 bytes of the tape. I tried going directly through the tape device (using linux 'dd') to read the first 80 bytes of the tape, but this returned with a 0-byte result and no errors.



It seems that an end-of-data marker is being written at the very beginning of the tape during a TSM write operation. No error is being logged, and this happens on all drives on different SCSI chains. Diagnostic tests (IBM 'itdt' utility, drive front-panel diagnostics, and ADIC library diagnostics) all come back successfully. TSM would never knowingly write to the first 80 bytes of the tape unless as part of a label operation. Somewhere between the kernel, device driver, tape microcode, and the tape itself we are getting out of sync and writing over a part of a tape that should never be written to.



IBM is chasing this down as a possible tape drive firmware bug (they're IBM drives). We are at the latest revision of firmware so this would be a new bugfix.



I'll continue to update this thread so others caught in this situation can have this as a reference. If anyone can offer any further advice PLEASE let me know.
 
Still working with IBM on this.



We disabled reclamation completely. The write operations are always considered successful to TSM, so it expires the offsite tapes and recalls them everytime it makes a new copypool tape to go out. When the tape is then marked bad 5 minutes later, we lose data. Disabling reclamation is a desparate measure to keep a good copy of our data offsite and minimize the chances of data loss. It is also causing us to eat up tapes like they're going out of style...



They asked me to make sure the "st" device driver (Linux generic scsi tape kernel module) is unloaded as they fear this may be interfering with the IBMtape driver. When pressed they could not identify their official method for disabling this driver - kept giving me "rmmod st", which only unloads the driver once. It comes back on reboot. I would assume that since they don't have a blessed procedure for doing this and it's not in the official install documentation for TSM that others are running this driver concurrently with IBMtape. Since we're supposedly the only company with this problem, this is unlikely.



Regardless, I unloaded (rmmod) the st driver and still saw a tape go unavailable that day. I sent them an activity log excerpt showing this to be the case.



They then asked me to disable the driver completely and check all unavailable volumes out of the library so that only the good ones remain. I have done this. Disabling the driver completely was done by adding the following to modules.conf:

alias st off

alias st0 off

alias st1 off

alias st2 off

alias char-major-9 off

I know that only one of those lines is probably all that's needed, but wanted to make sure all my bases were covered :)



Current theory is that this is a bug in the firmware/ucode on the tape drive but they're taking a "shotgun approach", trying to collect as much data as they can. They've asked me to reproduce the issue with trace enabled, and immediately collect the trace as well as logs from the IBMtape driver and the results from an "egather" diagnostic utility which seems to collect kernel modules loaded, installed software, processes running, etc. I started a manual stgpool backup and watched it.



This time, instead of an 8355 error, I've gotten a different error during dismount:



8831574:16:15:37.646 [34][output.c][5964]: ANR8950W Device /dev/IBMtape1, volume 000005L2 has issued the following Warning TapeAlert: The tape directory on the tape cartridge just unloaded has been corrupted. File search performance will be degraded. The tape directory can be rebuilt by reading all the data.~



This error is mentioned as one that was coming up a lot for people on a much older version of this drive firmware, but the problem was supposedly corrected by an upgrade that's already been applied in our environment.



IBM seems to think that this *may* be indicative of the same problem, and the difference in error messages is due to the "st" driver being unloaded. But I saw an 8355 with the "st" module unloaded before, so this doesn't seem logical. I duly sent them the trace and logs, let's see what they come up with...



Our VAR is loaning us a completely different library with completely different drives as an interim solution. Our plan is to leave the historical data on the current library, install the new one with primary and copy storage pools, and use the new one for all our backups. In the meantime the old library will be idle so that we can run whatever tests IBM wants on it.



Can anyone shed some light on the "st"/"IBMtape" conflict issue?



As promised I will continue to update this thread until the problem is resolved. If it saves one other person from running without reliable backup for 2-3 weeks like I am now, it's worth it...
 
About 4 or 5 posts down, you asked how often PM should be performed on a drive/library. I would think it depends on the type of device, how often it's used and the environment. A 4mm DAT drive is not designed to used 24/7/365 and a 3592 drive that is used once a week will have different PM schedules. If the drive/library is newish (a year or two old) I would expect to see PM about once a year. As a device gets older (or is used more often) parts wear faster. The IBM CE should be able to tell you exactly what the PM schedule for an IBM device is.



As for labeling a volume without destroying data... try to checkout the libvol with a check=no (so it doesn't try to verify the volume label) then perform:



label libvol {library} {volume} overwrite=yes checkin=scratch (or private)



You can also use the search=yes option to have TSM label all non-checked in volumes. The overwrite option will basiclly force TSM to re-write the label.



Good luck, I hope this gets fixed soon.
 
Thanks, Heada. Didn't know for sure if anyone is still monitoring this thread :)



Here's the latest:



To try to get back on an even keel, an IBM tech advised me to mark the 'unavailable' volumes as 'destroyed', then perform a restore stgpool for each storage pool. I did this, and of course there is some data that can't be brought back (copies in both storage pools are bad).



The good (or at least hopeful news): I've been running reclaim again since this time, and have not seen ONE tape go unavailable since we disabled the native 'st' driver in Linux. Looks like it may have been a driver conflict after all. However, we sometimes went a month between volume problems before, so management and I settled on a compromise: if we can one for four weeks of normal operation without a volume going south, we can close the ticket. Until then we don't know if the problem's been fixed.



As far as I understand, the firmware team is trying to set up a situation in the lab where the ST and IBMtape drivers are both loaded in attempt to reproduce the error. IBM thinks that 'rmmod' is an acceptable solution for getting rid of the st driver, even though this is a one-time command, and there's nothing about disabling the 'st' drive in the installation instructions or from the TSM-certified VAR that installed the system. Looks like they might need to add it....



I thought that our storage pool restore would help us at least get to a point where current (if not historical) copies of the data existed in the stgpools, so that we could do a restore if necessary. A quick test restore on our fileserver showed that this was not the case. We got 80-some GB back, but also lots of these two error messages:



ANR0836W No query restore processing session (session) for node (node) and (fsname) failed to retrieve file (filename) - file being skipped.

ANE4988W File (full path to file) is currently unavailable on server and has been skipped.



According to IBM, the restore stgpool is supposed to delete any volumes currently marked 'destroyed'. However, the volumes that could not be fully restored were NOT deleted, and for some reason TSM won't re-backup an object that exists in the stgpool, even if it's on a 'destroyed' tape. Maybe I'm just not getting something complex, but sounds like a really stupid set of logic in the code AFAIC.



At IBM's direction, I'm currently deleting these remaining 'destroyed' volumes with discardd=yes. After this is done I will relabel them and check them in as new. After tonight's backups we'll test again to see if everything got backed up properly this time. It'd be nice to have some trust in our backups again...



In other news, the IBMtaped daemon is dying on me several times a day ("Poll_Trace failed: Cannot Allocate Memory", but we have plenty). It got to the point where I wrote a script to bounce it because I was tired of connecting to VPN at odd hours to reset it. I was able to find out today (when my calls were FINALLY returned) that the IBMtaped process is only a monitor for trace information and not integral to the operation of the tape drives. I had turned the trace level up to 2 when we were trying to reproduce the issue with detailed information for IBM, and that may have contributed to its instability, but it did die occasionally even with trace set to 1. Anyone else seeing this?



As always, I will provide updates as I have them. If I can save even one person from going through this same experience with only tech support upon which to rely, it's worth my time. Lord knows I've had enough help of my own from Google and forums like this over the years.
 
Hi Guys,

I am watching this interesting topic with full attention because I am experiencing similar problem.

I have two crossdefined TSM 5.3.2 servers, one on SuSE Linux server 8, and other on AIX 5.3. Linux server has one LTO1 drive "manual library" and AIX has LTO2 drive (manual). I am experiencing same problem you have described here, but on BOTH servers, Linux and AIX, meaning I am not sure that blaming Linux driver is right path. Actually, I saw that problem more frequently on AIX than on Linux.

If there is any way to relabel the tape, and save data on it, with manual library, I would like to know... :cry:
 
Mita: Sorry to hear you're having problems with this. Hope this has helped somehow.



Since disabling the Linux 'st' driver, I've not had any more problems with unavailable volumes. The problem was once infrequent, so we're waiting until mid-May before considering the issue closed.



As I said before, the 'st' driver was disabled by adding the following to modules.conf (shotgun approach - probably not all of these lines are needed)



alias st off

alias st0 off

alias st1 off

alias st2 off

alias char-major-9 off



As far as the tapes that were marked bad, I haven't been able to get the data back. I did a 'restore stgpool' and restored as much as I could, but once the tape label is gone, I don't think there's any way to get it back.
 
I know this is thread is very old, but if you're still a participant on the board hopefully you'll see this.

Did this unloading of the st driver fix this issue? I have the same issue here, though the st driver for me is NOT loaded and never has been. I lose data a couple of times a week, minimum.
 
Back
Top