Damaged Extents after upgrade to 8.1.19.000

So what did you do ? Upgrade only to 8.1.19 ? Or enabled storage rule replication being already on 8.1.19 ?
This is on 8.1.18.009. Discovered issues when reclaiming tapes for dedup tape copy. There were extents that could not be read, and little/no entries in the actlog. As we tried to fix bad extents, more of them appeared. It was not the same origin as you, but it goes to reading of blocks from stgpooldire.
 
This is on 8.1.18.009. Discovered issues when reclaiming tapes for dedup tape copy. There were extents that could not be read, and little/no entries in the actlog. As we tried to fix bad extents, more of them appeared. It was not the same origin as you, but it goes to reading of blocks from stgpooldire.
We saw something else not that fun.
In one server 5 containers with 126 bad extents.
Ran a new "full" backup on all files marked as damaged.
Checked again, less extents damaged. Ran a new stg rule repl. MORE extents damaged.
ran audit container with scan damage. NADA fixed.
ran audit container with scanall, then ALL extents on those containers were damaged.
So it's typical a code problem.

LEt's see if we will see 8.1.19.00* or a 8.1.20.100.
 
Just got this from IBM:

Good afternoon
Development are making the latest 8.1.19 code available shortly. We can advise when this is available.
 
Just got this from IBM:

Good afternoon
Development are making the latest 8.1.19 code available shortly. We can advise when this is available.
Hi,

I just got 8.1.19.005 due to a different issue, curious if that is the same one you will get.
 
Hi,

I just got 8.1.19.005 due to a different issue, curious if that is the same one you will get.
hm.... I'd like to see a 8.1.19.100 instead of a special .005 or so.
But let's see it's not communicated yet to us.
 
Wasn't your problem also related to damaged extents ?
No, not seen that in 8.1.19 - I've had that before though, I think we were running 8.1.14.200 when we had damaged extents. In this case the issue we are having is that the stgrule hangs.
 
IBM just said they have a fix for us, 8.1.19.005 , but no download link :)
So we will install this in the servers we have problem with.
The remaining part will be all damaged extents..... so sadly I guess we are facing a level5 audit.
Which for a 100 TB stgpool will run for ? Weeks, months ?
 
IBM just said they have a fix for us, 8.1.19.005 , but no download link :)
So we will install this in the servers we have problem with.
The remaining part will be all damaged extents..... so sadly I guess we are facing a level5 audit.
Which for a 100 TB stgpool will run for ? Weeks, months ?
Hi,

I manage anything from 48 to 2027 containers per day. Not alot to work on. It may give you a clue though.


Code:
Protect: TSMxxxxxxxx>select date(LASTAUDIT_DATE),count(*) from containers group by date(LASTAUDIT_DATE)                                              

 Unnamed[1]       Unnamed[2]
-----------     ------------
 2023-10-06              893
 2023-10-07             2027
 2023-10-08             1744
 2023-10-09             1213
 2023-10-10               48
 2023-10-11               29
                        7264

I have in this case a dedup pool of 139,182 G 86.7% util. I estimated about one week, but that was a bit on the low side.
 
Last edited:
We have managed to get all backups running, we did full backup on all vm's.
Yesterday we tested restore and boom. RC=14 no data available on server.

And we found new damaged extents, but this time the backups keep running but the restore fails.

This starts getting really scary. Feels like 8.1.19 is like one big virus.

10/10/2023 18:42:37 ANS4068I Restored virtual machine 'Jane' was backed up using a "VMware Tools with file system quiescing and application quiescing disabled" snapshot.
This is equivalent to a "crash-consistent" backup.
10/10/2023 18:48:37 ANS1314E File data currently unavailable on server
10/10/2023 18:48:37 ANS0361I DIAG: vmRestoreFillWriteBufferFromApi(): error reading from api, getData: rc=14
10/10/2023 18:48:37 ANS0361I DIAG: vmRestoreCommonRestoreExtentList(): Error restoring extent data: vmRestoreCommonRestoreExtent: rc=14.
10/10/2023 18:48:42 ANS0361I DIAG: vmRestoreMBRestoreSessionCallback(): vmRestoreCommonRestoreExtentList() failed with rc 14.
10/10/2023 18:48:50 ANS0361I DIAG: vmRestoreCommonProcessAllDATFiles(): one or more restore session threads failed: highest rc=14 .
10/10/2023 18:48:50 ANS0361I DIAG: vmRestoreCommonOptRestoreDisk(): dat file processor thread of vmname=Jane failed: highest rc=14 .
10/10/2023 18:53:48 ANS1314E File data currently unavailable on server
 
Hi,

I manage anything from 48 to 2027 containers per day. Not alot to work on. It may give you a clue though.


Code:
Protect: TSMxxxxxxxx>select date(LASTAUDIT_DATE),count(*) from containers group by date(LASTAUDIT_DATE)                                             

 Unnamed[1]       Unnamed[2]
-----------     ------------
 2023-10-06              893
 2023-10-07             2027
 2023-10-08             1744
 2023-10-09             1213
 2023-10-10               48
 2023-10-11               29
                        7264

I have in this case a dedup pool of 139,182 G 86.7% util. I estimated about one week, but that was a bit on the low side.
did you run a stgrule audit level5, or did you run audit per container one by one by script ?
 
We now have a situation where even if we run a full backup we can't do restore.
This is crap. Maybe create a new stgpool and point the mgmtclass to that one so we force full backups on everything. :(
 
Hi,

I just got 8.1.19.005 due to a different issue, curious if that is the same one you will get.
Guess the same as you got. But I really wonder why they didn't give that to us already last week since it was available for you already before. :(

TSM-ESEM-01>
q system

**************************************************
*** ---> SHOW BANNERS
**************************************************
*********************************************************************
* This is a Limited Availability TEMPORARY fix for *
* IT43969 - CONVERT STGPOOL FAILS WITH ANR9999D INCONSISTENT *
* CONTENT FOR ALI AS. *
* IT43893 - STORAGE POOL VOLUME OPERATIONS SLOW AFTER LARGE DATA *
* DELETION *
* IT44385 - AFTER STORAGE/SPECTRUM PROTECT SERVER UPGRADE TO *
* 8.1.17 OR HIGHER, PERFORMANCE CAN DECREASE IN CLOUD *
* WRITE OPERATIONS *
* IT44406 - DELETE STGPOOL FAILS WITH ANR3738I FOR *
* CLOUD-CONTAINER STGPOOL WITH DESTROYED NON-S3 CLOUD *
* CONTAINERS *
* IT44302 - SERVER COMMANDS OR PROCESS HANG WHEN SCRATCH VOLUMES *
* DELETION ARE ONGOING IN THE BACKGROUND *
* IT44599 - SPECTRUM PROTECT SERVER REPLICATION STGRULE HANG *
* IT44595 - REPLICATION CAN CRASH THE SERVER WHEN NO NODES ARE *
* CONFIGURED FOR REPLICATION *
* This cumulative efix server is based on code level *
* available with patch level 8.1.19.000 *
* *
*********************************************************************
 
Guess the same as you got. But I really wonder why they didn't give that to us already last week since it was available for you already before. :(

TSM-ESEM-01>
q system

**************************************************
*** ---> SHOW BANNERS
**************************************************
*********************************************************************
* This is a Limited Availability TEMPORARY fix for *
* IT43969 - CONVERT STGPOOL FAILS WITH ANR9999D INCONSISTENT *
* CONTENT FOR ALI AS. *
* IT43893 - STORAGE POOL VOLUME OPERATIONS SLOW AFTER LARGE DATA *
* DELETION *
* IT44385 - AFTER STORAGE/SPECTRUM PROTECT SERVER UPGRADE TO *
* 8.1.17 OR HIGHER, PERFORMANCE CAN DECREASE IN CLOUD *
* WRITE OPERATIONS *
* IT44406 - DELETE STGPOOL FAILS WITH ANR3738I FOR *
* CLOUD-CONTAINER STGPOOL WITH DESTROYED NON-S3 CLOUD *
* CONTAINERS *
* IT44302 - SERVER COMMANDS OR PROCESS HANG WHEN SCRATCH VOLUMES *
* DELETION ARE ONGOING IN THE BACKGROUND *
* IT44599 - SPECTRUM PROTECT SERVER REPLICATION STGRULE HANG *
* IT44595 - REPLICATION CAN CRASH THE SERVER WHEN NO NODES ARE *
* CONFIGURED FOR REPLICATION *
* This cumulative efix server is based on code level *
* available with patch level 8.1.19.000 *
* *
*********************************************************************
Yeah, not sure. What I don't understand though is which of these APARs are relevant to the issue you are having? Based on the description alone I don't see which would be applicable to the problem you are having.
 
Yeah, not sure. What I don't understand though is which of these APARs are relevant to the issue you are having? Based on the description alone I don't see which would be applicable to the problem you are having.
I fully agree, but the one that has something to do with stgrule hangning, might also fix other related stg-rule issues, I hope. If not I will let them know. Have just installed in 2 primary servers and one replication target. Will start a new stgrul repl in a couple of minutes, then we'll know in a day or 2 . But I am 100% sure that we FIRST need to fix the damaged extents. But I have asked and asked and asked.... no reply on that.
 
I fully agree, but the one that has something to do with stgrule hangning, might also fix other related stg-rule issues, I hope. If not I will let them know. Have just installed in 2 primary servers and one replication target. Will start a new stgrul repl in a couple of minutes, then we'll know in a day or 2 . But I am 100% sure that we FIRST need to fix the damaged extents. But I have asked and asked and asked.... no reply on that.
Yes you will need to fix them, from my experience at least. Whatever transaction has those extents marked as damaged will fail (with very little information in actlog, only in dsmffdc.log I found what the issue was), depends on what ReplBatchSize / ReplSizeThresh is set to the size of those transactions as I understand. We ended up seeing exactly the size of ReplBatchSize as number of files that failed to replicate.

After those damaged extents were fixed we didn't see any issues. What I mean to say, damaged extents affected only 50 files let's say, but we had ReplBatchSize of 9216 for example - then it would fail to replicate 9216 files - when we reduced to 4096 then it would replicate everything except 4096 files (since some damaged extents were in that transaction is my understanding).

We didn't have that many damaged extents though, not sure how many damaged extents you have.
 
Yes you will need to fix them, from my experience at least. Whatever transaction has those extents marked as damaged will fail (with very little information in actlog, only in dsmffdc.log I found what the issue was), depends on what ReplBatchSize / ReplSizeThresh is set to the size of those transactions as I understand. We ended up seeing exactly the size of ReplBatchSize as number of files that failed to replicate.

After those damaged extents were fixed we didn't see any issues. What I mean to say, damaged extents affected only 50 files let's say, but we had ReplBatchSize of 9216 for example - then it would fail to replicate 9216 files - when we reduced to 4096 then it would replicate everything except 4096 files (since some damaged extents were in that transaction is my understanding).

We didn't have that many damaged extents though, not sure how many damaged extents you have.
Why can't IBM explain this then ? We have between 50-220 extents in our servers we have problems with.
One issue we have is that we ran PROT/REPL and then our repl target crashed byound repair.
Then we set up a new server, and used stgrule repl. BAM damaged extents. So we have nothing here to do restore stgpool from. So the option we have is to backup all files with selective backup, force a new acive on to be made. Then remove damaged on the rest of the extents.

I will see what happens during this next stgrule repl, and after that I guess we need to remove the damaged extents.
 
So we have one stgpool with no damaged extents left.
Ran a new stgrul repl after added 005 fix.
Repl job still fails:
10/12/23, 15:01:54ANR1652E Replication failed. Total number unresolved extents is 1,141. (SESSION: 1540, PROCESS: 185, JOB: 1292)

Whut ?!?!? What is failing ?
 
Back
Top