Tape error and life span of tapes

Mirdas_M

ADSM.ORG Member
Joined
Jan 4, 2007
Messages
48
Reaction score
0
Points
0
Location
Cameroon
Website
Visit site
Hi TSM gurus.



I did manage and administrate TSM/HP tape library in my company. We do use LTO 2 tape for back/restore.

Now the problem is that we are having many tapes that have read and write errors making it impossible to restore all of our backup data!

! see below query!





tsm: ATHENA>select count(*) as "Number of tapes with read and write errors" from volumes where read_errors!=0 or write_errors!=0



Number of tapes with read and write errors

------------------------------------------

40



and

tsm: ATHENA>select count(*) as "Total number of tapes" from volumes



Total number of tapes

---------------------

82

Now, moving data or audit volume <volume_name> fix=yes works but not on all tapes! My question now is:



1) If a tape has read and write errors, is the tape classify as faulty tape? What to do?

2) Is there any thing like "tape life span"? We have been using same tapes for 2-3 years! What to do?

Can any point me toa doc regarding tape management or let me know how to overcome this situation?

Thanks inadvance!

DID
 
Hi,



you need to first determine is if your problem is a media issue or a drive issue. For this you have to go through your activity log and try to determine if the same tape gets mounted to different drives and still give you read/write errors. If this is the case most probably the media is bad. If you see that most of the tapes are getting errors while they get mounted to a certain drive or drives, then this is a drive issue. Once you have done this you will have a clear picture. I am using LTO1 and LTO2 tapes that are now 3 years or more old and I have seen that the frequency of media error is going up in my enviroment
 
Thanks for your reply.

We do have a four tape drive! So if a tape can't be mounted properly, is it gonna have errors? Can you explain what courses read or write errors? From there, ai know ai can have a head start!

Thanks
 
There are a few main reason why a tape will have read/write errors.



1) dirty drive (most common)

2) faulty firmware

3) age of tape



There are others, but these are the main ones. If you clean your drives on a regular basis, make sure your microcode is current AND stable and cycle your tapes so that old tapes are replaced you shouldn't have very many errors.



I have over 5000 tapes and only have an error show up once every few months. Once the error does show up, the data on the tape is moved to another tape and the tape with the error is tested. If it passes the test, it is returned to the scratch pool but noted. If it fails, it is destroyed. If a tape has an error twice within a year, it is replaced and destroyed.



-Aaron
 
Hi Aaron.



Thanks for your reply. But I still have some questions!

Its true, I dont clean my drives that much! I do it only when I get an alert (drive signaling that its needs to be clean!). So how often should I clean the drives? One a week or o month?

By "make sure your microcode is current AND stable", what do you mean? and "cycle your tapes", do you mean I clean my drives? Do we clean tapes too? (Sorry for the silly question!).

What do you mean by "If it passes the test, it is returned to the scratch pool but noted."

What ai do normally with tapes with read/write errors is that, I try first moving the tapes, but most of the times, it doesnt work! So if ai understand you correctly, those tapes with read/write errors that can't be moved to the scrach pool should be destroyed?

Do you have a written or formal procedure that you follow? Please shared you experience with me!

Sending you some I/O errors output from the actlog and from there, u will know if its tape or drive errrors!

Thanks!



Date/Time Message

-------------------- ----------------------------------------------------------

01/11/07 12:52:09 ANR8302E I/O error on drive DRIVE2 (/dev/rmt/1mt)

(OP=WRITE, Error Number=145, CC=205, KEY=FF, ASC=FF,

ASCQ=FF, SENSE=**NONE**, Description=SCSI adapter

failure). Refer to Appendix D in the 'Messages' manual

for recommended action.

01/11/07 12:55:30 ANR8300E I/O error on library MTNHLIB (OP=C0106C03,

CC=205, KEY=FF, ASC=FF, ASCQ=FF, SENSE=**NONE**,

Description=SCSI adapter failure). Refer to Appendix D

in the 'Messages' manual for recommended action.

01/11/07 13:00:16 ANR8300E I/O error on library MTNHLIB (OP=C0106C03,

CC=304, KEY=02, ASC=04, ASCQ=01,

SENSE=70.00.02.00.00.00.00.14.00.00.00.00.04.01.00.00.00-

.00.00.00., Description=Changer failure). Refer to

Appendix D in the 'Messages' manual for recommended

action.

01/11/07 13:05:00 ANR8300E I/O error on library MTNHLIB (OP=C0106C03,

CC=304, KEY=02, ASC=04, ASCQ=01,

SENSE=70.00.02.00.00.00.00.14.00.00.00.00.04.01.00.00.00-

.00.00.00., Description=Changer failure). Refer to

Appendix D in the 'Messages' manual for recommended

action.

01/11/07 13:08:17 ANR8300E I/O error on library MTNHLIB (OP=C0106C03,

CC=301, KEY=0B, ASC=53, ASCQ=00,

SENSE=70.00.0B.00.00.00.00.14.00.00.00.00.53.00.00.00.00-

.00.00.00., Description=Cartridge load failure). Refer

to Appendix D in the 'Messages' manual for recommended

action.

01/11/07 14:24:11 ANR8355E I/O error reading label for volume EBW753L1 in

drive DRIVE4 (/dev/rmt/3mt).

01/11/07 16:56:48 ANR8355E I/O error reading label for volume EBW734L1 in

drive DRIVE3 (/dev/rmt/2mt).

01/11/07 17:09:30 ANR8353E 003: I/O error reading label of volume in drive

DRIVE4 (/dev/rmt/3mt).

01/11/07 17:15:07 ANR8355E I/O error reading label for volume EBW522L1 in

drive DRIVE4 (/dev/rmt/3mt).

01/11/07 17:47:41 ANR8302E I/O error on drive DRIVE4 (/dev/rmt/3mt)

(OP=LOCATE, Error Number=145, CC=205, KEY=FF, ASC=FF,

ASCQ=FF, SENSE=**NONE**, Description=SCSI adapter

failure). Refer to Appendix D in the 'Messages' manual

for recommended action.

01/11/07 17:59:41 ANR8300E I/O error on library MTNHLIB (OP=C0106C03,

CC=304, KEY=06, ASC=28, ASCQ=8D,

SENSE=70.00.06.00.00.00.00.14.00.00.00.00.28.8D.00.00.00-

.00.00.00., Description=Changer failure). Refer to

more... (<ENTER> to continue, 'C' to cancel)



Appendix D in the 'Messages' manual for recommended

action.

01/11/07 17:59:41 ANR8300E I/O error on library MTNHLIB (OP=C0106C03,

CC=314, KEY=05, ASC=3B, ASCQ=0E,

SENSE=70.00.05.00.00.00.00.14.00.00.00.00.3B.0E.00.C0.00-

.04.00.00., Description=The source slot or drive was

empty in an attempt to move a volume). Refer to Appendix

D in the 'Messages' manual for recommended action.

01/11/07 23:01:45 ANR8355E (Session: 185, Origin: FLAGSRV3) I/O error

reading label for volume EBW374L1 in drive DRIVE3

(/dev/rmt/4mt).

01/11/07 23:01:45 ANR8355E I/O error reading label for volume EBW374L1 in

drive DRIVE3 (/dev/rmt/2mt).

01/11/07 23:02:20 ANR8355E I/O error reading label for volume EBW374L1 in

drive DRIVE3 (/dev/rmt/2mt).

01/11/07 23:04:11 ANR8355E (Session: 185, Origin: FLAGSRV3) I/O error

reading label for volume EBW340L1 in drive DRIVE4

(/dev/rmt/3mt).

01/11/07 23:04:14 ANR8355E I/O error reading label for volume EBW340L1 in

drive DRIVE4 (/dev/rmt/3mt).

01/11/07 23:04:22 ANR8355E I/O error reading label for volume EBW340L1 in

drive DRIVE4 (/dev/rmt/3mt).

01/11/07 23:18:25 ANR8355E I/O error reading label for volume EBW340L1 in

drive DRIVE3 (/dev/rmt/2mt).

01/11/07 23:19:58 ANR8355E I/O error reading label for volume EBW374L1 in

drive DRIVE4 (/dev/rmt/3mt).

01/11/07 23:23:00 ANR8355E I/O error reading label for volume EBW361L1 in

drive DRIVE3 (/dev/rmt/2mt).

01/11/07 23:24:44 ANR8302E I/O error on drive DRIVE4 (/dev/rmt/3mt)

(OP=READ, Error Number=110, CC=403, KEY=08, ASC=14,

ASCQ=03, SENSE=F0.00.08.00.00.00.50.0E.00.00.00.00.14.03-

.00.00.2C.7E.00.00.00.00., Description=Media failure).

Refer to Appendix D in the 'Messages' manual for

recommended action.

01/11/07 23:24:44 ANR8355E I/O error reading label for volume EBW364L1 in

drive DRIVE4 (/dev/rmt/3mt).

01/12/07 09:42:09 ANR2017I Administrator MIRDASSOU issued command: QUERY

ACTLOG begind=-1 search=I/O

01/12/07 11:42:10 ANR8302E I/O error on drive DRIVE4 (/dev/rmt/3mt)

(OP=WEOF, Error Number=201, CC=412, KEY=02, ASC=3A,

ASCQ=00, SENSE=70.00.02.00.00.00.00.0E.00.00.00.00.3A.00-

.00.00.2C.6B.00.00.00.00., Description=Media not present

in drive). Refer to Appendix D in the 'Messages' manual

for recommended action.

01/12/07 11:42:10 ANR8300E I/O error on library MTNHLIB (OP=C0106C03,

CC=314, KEY=05, ASC=3B, ASCQ=0E,

SENSE=70.00.05.00.00.00.00.14.00.00.00.00.3B.0E.00.C0.00-

more... (<ENTER> to continue, 'C' to cancel)



.04.00.00., Description=The source slot or drive was

empty in an attempt to move a volume). Refer to Appendix

D in the 'Messages' manual for recommended action.

01/12/07 12:08:24 ANR8355E I/O error reading label for volume EBW340L1 in

drive DRIVE4 (/dev/rmt/3mt).

01/12/07 12:43:52 ANR2017I Administrator MIRDASSOU issued command: QUERY

ACTLOG search=I/O begind=-1



DID :grin:
 
There should be a recommended cleaning frequency for the drives you use. If you contact the manufacturer of the drives, they should tell you how often you need to clean them. Also, if your datacenter is dirty (it is amazing how dusty they can get) you may want to clean them more often. When I had 3590 tape drives, we cleaned them every day.



The microcode statement was about the microcode problems that alot of people have had with the early versions of LTO drives and libraries. There were problems that if you upgraded the drive code, you HAD to upgrade the library code as well. As I haven't had problems with microcode, I can't tell you what versions are good.



What I meant by tape cycling is to take tapes that have been offsite for a long time and move the data on them to new tapes so that they come back onsite to be used again. You don't want tapes sitting offsite for years and years and you also don't want a small number of onsite tapes to be used over and over again. Basically, spread the workload over all your tapes.



When a tape has an error, I test it (normally by writing tar data to it) to see if there is a media error on the tape. If when I test it there are no errors, then I return it to the scratch pool.



With those errors, you should be able to have either IBM or the drive manufacturer tell you exactly what is wrong with it. I don't know those codes so I can't tell you what is going on.



-Aaron
 
Back
Top