TSM / Library problem

Stephan

ADSM.ORG Senior Member
Joined
Jun 7, 2004
Messages
649
Reaction score
2
Points
0
Location
Great White North
Website
www.twosix.ca
HI.

I've opened up a PMR with IBM for this problem, but figured someone could maybe help here debugging as well..

To recap:
We updated our TSM installation from 5.3.5 to 5.4.3. We also upgrade our ATAPE driver to the newest release as well as the firmware to the drive and library (TS3500).

All went well, except that now, once in a while, i get in a batch a bunch of IO errors when TSM tries to mount a scratch tape...

4/9/2008 5:44:49 AM ANR8355E I/O error reading label for volume 600006 in drive TAPE02 (/dev/rmt1).
4/9/2008 5:45:07 AM ANR8778W Scratch volume 600006 changed to Private Status to prevent re-access.
4/9/2008 5:45:27 AM ANR8355E I/O error reading label for volume 600661 in drive TAPE02 (/dev/rmt1).
4/9/2008 5:45:45 AM ANR8778W Scratch volume 600661 changed to Private Status to prevent re-access.
4/9/2008 5:46:01 AM ANR8355E I/O error reading label for volume 600673 in drive TAPE02 (/dev/rmt1).
4/9/2008 5:46:19 AM ANR8778W Scratch volume 600673 changed to Private Status to prevent re-access.
4/9/2008 5:46:34 AM ANR8355E I/O error reading label for volume 600685 in drive TAPE02 (/dev/rmt1).
4/9/2008 5:46:52 AM ANR8778W Scratch volume 600685 changed to Private Status to prevent re-access.
4/9/2008 5:47:08 AM ANR8355E I/O error reading label for volume 600688 in drive TAPE02 (/dev/rmt1).
4/9/2008 5:47:27 AM ANR8778W Scratch volume 600688 changed to Private Status to prevent re-access.
4/9/2008 5:47:43 AM ANR8355E I/O error reading label for volume 600697 in drive TAPE02 (/dev/rmt1).
4/9/2008 5:48:02 AM ANR8778W Scratch volume 600697 changed to Private Status to prevent re-access.
4/9/2008 5:48:17 AM ANR8355E I/O error reading label for volume 601015 in drive TAPE02 (/dev/rmt1).
4/9/2008 5:48:35 AM ANR8778W Scratch volume 601015 changed to Private Status to prevent re-access.
4/9/2008 5:48:51 AM ANR8355E I/O error reading label for volume 601130 in drive TAPE02 (/dev/rmt1).
4/9/2008 5:49:09 AM ANR8778W Scratch volume 601130 changed to Private Status to prevent re-access.


First time i saw this happen, i decided to check out those tapes and relabel them...Worked for some but not all of them. I labelled new tapes and again, worked for some but not all of them. Those that did not work, i checked them out once more, checked them in and they worked...

Problem happens on pretty much all RMTs (above was on rmt1 but saw it happen on 3-4 more drives).

I've rmdev -l all rmts and smcs and ran cfgmgr to rediscover it all but no changes...

I could downgrade back to my original atape, would that help? IBM wants me to completly remove paths/drives and paths/library and then redo my configuration...

Any thoughts?

thanks.
 
Sounds like a twister. Are you sure, all rmt's reappeared in the same order (each wwid on the same rmt) as before? If not, the robot will put a tape in, say drive 1 of frame 1 but tsm will access drive 2 because the order of drives changed. Just compare the wwids (lsattr -El rmtX) with the "sh libr" output - or simply delete all path definitions and rerun them. A simple update with autodetect=yes will probably do the trick as well.

We have come to simply renaming (chdev -a new_name=....) all our rmts to their wwid and never faced such a problem since. Take care though. Chdev will badly mess up the odm if you use >16 character dev names.

PJ
 
hi,

this has happened to me a few times. I found out there seemed to be a problem, probably because i tried and label the tapes after checkin them in.
The solution that worked for me was to first checkout the tapes (with options force=yes and remove=no) and then force a new label process with the following command:
label libvol <library_name> search=yes labelsource=b checkin=scr overwrite=y

library went fine after that

hope this helps
max
 
Sounds like a twister. Are you sure, all rmt's reappeared in the same order (each wwid on the same rmt) as before? If not, the robot will put a tape in, say drive 1 of frame 1 but tsm will access drive 2 because the order of drives changed. Just compare the wwids (lsattr -El rmtX) with the "sh libr" output - or simply delete all path definitions and rerun them. A simple update with autodetect=yes will probably do the trick as well.

We have come to simply renaming (chdev -a new_name=....) all our rmts to their wwid and never faced such a problem since. Take care though. Chdev will badly mess up the odm if you use >16 character dev names.
PJ

Hi Pj.

I just checked the WWN and all seem to match fine. I might though, remove all and redo all paths and drives...Still waiting on IBM's take on this...
 
hi,

this has happened to me a few times. I found out there seemed to be a problem, probably because i tried and label the tapes after checkin them in.
The solution that worked for me was to first checkout the tapes (with options force=yes and remove=no) and then force a new label process with the following command:
label libvol <library_name> search=yes labelsource=b checkin=scr overwrite=y

library went fine after that

hope this helps
max

Hi max.

I've gone through the check out and re labelling of that tapes. Seemed to help for some tapes but not for others. Tapes that did not work before worked after the relabelling but then again some still did not read correctly...
weeeeiiird.

thanks for the input
 
Can you narrow it down to specific drives? If its a twister, you should have an even number of drives throwing errors. None of them should have a successful mount showing up in the summary for the time since you upgraded atape.

PJ
 
Hi PJ.

As mentionned earlier, it seems to happen on all drives. I saw it happening on rmt3, i switched the drive offline in TSM and it then started to fail on rmt4...

I just spoke with IBM. They want me to delete the path/drives and recreate them...We'll see if that helps...

steph
 
Hey Stephan,

I have the same situation and it is due to tape drive generation a shared media problem. For example, my environment consists of a 3584 tape library with 10 3592-E05 and 4 3592-J1A tape drives. All on-site media is written in native 3592-E05 format while my off-site is written in 3592-J1A format. The problem for me arises when I have E05 labeled tapes go back into scratch mode and a J1A drive attempts to read the label for re-use. Once the drive fails to read the label, the tape goes into private state.

What I do to clear the issue is the following:

1. Take my J1A drives off-line.
2. checkout libv libname remove=no checklabel=no vollist=(list of tapes)
3. label libv libname search=yes overwrite=yes labelsource=barcode checkin=scratch waitt=0 vollist=(list of tapes)
4. Check the count of scratch tapes.
5. Bring the J1A drives.

What does your tape library environment looks like?

Hope this helps you.
 
  • Like
Reactions: BBB
Hey Stephan,

I have the same situation and it is due to tape drive generation a shared media problem. For example, my environment consists of a 3584 tape library with 10 3592-E05 and 4 3592-J1A tape drives. All on-site media is written in native 3592-E05 format while my off-site is written in 3592-J1A format. The problem for me arises when I have E05 labeled tapes go back into scratch mode and a J1A drive attempts to read the label for re-use. Once the drive fails to read the label, the tape goes into private state.

What I do to clear the issue is the following:

1. Take my J1A drives off-line.
2. checkout libv libname remove=no checklabel=no vollist=(list of tapes)
3. label libv libname search=yes overwrite=yes labelsource=barcode checkin=scratch waitt=0 vollist=(list of tapes)
4. Check the count of scratch tapes.
5. Bring the J1A drives.

What does your tape library environment looks like?

Hope this helps you.


Hi Frunkster.

well, i have 9 LTO3 drives. But i do have 2 generation of LTO tapes. (LTO2 and LTO3). Technically, a LTO3 drive should be able to R/W on either LTO2 or LTO3...

My IO errors seem to happen though on either LTO2 or LTO3 tapes...(LTO2 are 6xxxxx and LTO3 are 7xxxxx)

4/10/2008 7:25:36 AM ANR8355E I/O error reading label for volume 700258 in drive TAPE07 (/dev/rmt6).
4/10/2008 8:02:23 AM ANR8355E I/O error reading label for volume 601006 in drive TAPE09 (/dev/rmt8).

Thanks

PS. This is really starting to annoy me...! :)
 
Hummm, you know what Frunkster, you may have something there after all...

As i mentioned, i have all LTO3 drives, but i do have a mix of LTO2 and LTO3 tapes...

I have 2 processes that are running this morning, a backupset (which is forced to LTO2 tapes) and a reclaim of a storage pool that, by stg pool definition, uses LTO3 tapes...

I was looking at it 'live' and the reclaim asked for another scratch tape, so it tried to mount a LTO2 and failed with my typical error, another LTO2 and another etc...all failed.

When it was done with all LTO2 in the library, it mounted an LTO3 and up it went continuing its reclaim...

I am waiting to see when the backupset asks for another tape, if it reacts the same way...

Hummm...I need to have 2 devclass defined for my 2 LTO types right? No other way around that...The problem, if i am right, is that TSM seems to grab the smallest tape number, which are LTO2 (6xxxxxx) and it tries to mount it, read the label and if it is looking for a LTO3, fails that tape because it is not a LTO3 and keeps mounting until it reaches a LTO3...but while doing that, it 'eats' up all my LTO2, that other processes like backupsets need...

BTW, this was working fine prior to updating TSM and the lib firmware...

Thoughts?
 
Stephan,

The problem we have is that we are not supposed to have "mixed media generations" tapes in the same library. Mixed media generations identified as media read by different generation of tape drives, i.e. LTO3 labeled tapes read by LTO2 drives. The LTO2 drives will not be able to identify the label of the LTO3 format tape and register an error. In the TSM 5.3 Administrators Guide, Chapter 6 explains the issues encountered with mixed media generations. The only way I see to avoid this issue, is to assign specific tape volumes to their respective tape storage pools and not have them pick up or return tapes to/from the scratch pool.

As far as to why it started showing up when you performed the TSM and firmware upgrades, I have no idea.....

Regards,
Frunskter
 
Stephan, any new info on this? After upgrade to 5.4.3 I'm seeing some of this, for example....
ANR8355E I/O error reading label for volume TSM130L3 in drive DR_6C11
(/dev/rmt/5st).

I've never had this, so I'm thinking it's more than coincidence that we're seeing this after the upgrade.
 
Stephan, any new info on this? After upgrade to 5.4.3 I'm seeing some of this, for example....
ANR8355E I/O error reading label for volume TSM130L3 in drive DR_6C11
(/dev/rmt/5st).

I've never had this, so I'm thinking it's more than coincidence that we're seeing this after the upgrade.

Hi Greg.

Well, i had quite a few emails go back in forth with IBM.

From what they are saying, since i have a mixed generation tape environment in a LTO3 drives setup, the problem is occurring when TSM is trying to mount a LTO3 and getting a LTO2 instead or vice-versa.
In my setup, I have a few Stgpools that are built specifically for LTO2 tapes or LTO3 tapes...One of those LTO2 stgpool for example, is an Archive pool, where i figured i'd archive what i need for 7 years, on LTO2 tapes, thus chipping away at my LTO2s until i am left with only LTO3 tapes in my environment...

This worked fine at 5.3 but now would not work at 5.4.3. My Devclass were setup as 1 LTO2_Devclass with ULTRIUM2C and a LTO3_Devclass with ULTRIUM3C.

My archiving was done to a specific MC which was associated with the LTO2_Devclass. but when i tried to archive, it requested a mount, picked up a LTO3 and since it was not the right type, "killed" that scratch tape by assigning it as PRIVATE...

As per IBM's suggestions, i set up my devclass with DRIVE instead of ULTRIUM2C or 3C and now, my drives are picking up what Tape they are mounting...but this defeats my setup of trying to get rid of my LTO2s since now, it picks up any scratch tapes...

It is a "temporary" solution if you ask me. it works, where, i don't have any errors and all is working...but i cannot see why this is different in 5.4.3...Why did it work before the upgrade and not now? Is 5.5 the same? i don't know.

If you want, i'll send you the emails that were sent back and forth...maybe this post is a bit hard to understand?? it might not make sense... :)

Steph
 
Last edited:
Back
Top