Re: [ADSM-L] Tape corruption

Hi Eric -
Was it a "real" write error like:  
        ANR8302E I/O error on drive DRIVE1 (mt0.0.0.5) with -blah- blah - 
ASC=FF, ASCQ= FF,  -glub -

Or an "unexpected" error like an ANR9999D?

TSM "believes" that VTL drive is a real tape drive.
And like you, I would expect a legitimate error to result in TSM killing the 
process, but backing out the write transaction (which in the case of a 
migration would leave the source block undamaged).  I've never seen TSM fail to 
do that properly with "real" tape drives.

So this is just speculation on my part, but my suspicion would be that the DD 
is not returning a true tape error write code back to the TSM server (possibly 
due to a problem with the DD emulation, or with the fibre or the tape driver or 
the HBA driver, etc ??)  

If it were me I'd open a PMR with the exact error codes you see on those write 
errors and get a response from support as to why.

Until you get a resolution you could also turn on CRC checking for the DD 
storage pool, and run AUDITs against the new volumes every day to detect errors 
that occur due to issues beyond TSM's "view" of the data.  It won't solve the 
problem, but will at least tell you if you have bad volumes, hopefully before 
the original data is gone and you can rerun the backup.
 
And please post your results back - inquiring minds are curious about this one!

Wanda


 

-----Original Message-----
From: ADSM: Dist Stor Manager [mailto:ADSM-L AT VM.MARIST DOT EDU] On Behalf Of 
Loon, EJ van (SPLXM) - KLM
Sent: Wednesday, July 09, 2014 8:01 AM
To: ADSM-L AT VM.MARIST DOT EDU
Subject: [ADSM-L] Tape corruption

Hi TSM-ers!
We are using two kind of virtual tape solutions. The old solution is based on 
the DL4100 from EMC, the new solution is based on the DataDomain form EMC.
Because we are using them as a virtual tape server one is also bound to the 
shortcomings of the tape protocol. If there is some kind of hick-up or 
congestion during the data transfer, there is no retry, the I/O just fails. I 
have seen multiple cases where migrations were temporarily saturating the fiber 
adapters of the Data Domain server or one of the switches in between and this 
caused an write I/O error. This resulted in a corrupted virtual tape which I 
think this is strange. If TSM is moving data from the diskpool to a tapepool 
and this I/O fails and results in an I/O error, why does it delete that data 
from the diskpool? The data block move should be rolled back is such a case, 
right?
In the case where we are using the DL4100 we are using a primary and a 
copypool. Most of the times the data which is lost was backed up earlier by the 
backup stgpool of the diskpool and thus can be restored from the copypool, but 
in the DataDomain situation we use the DD replication and then the data on the 
corrupted tapes is gone forever!
Thanks for any clarification in advance!
Kind regards,
Eric van Loon
AF/KLM Storage Engineering