uffeg
ADSM.ORG Member
Hi,
Got the fun myself. Running a audit level 5 to identify bad extents.
So what did you do ? Upgrade only to 8.1.19 ? Or enabled storage rule replication being already on 8.1.19 ?
Click the link above to access ADSM.ORG Acceptable Use Policy and forum rules which should be observed when using this website. Violators may be banned from this website. This notice will disappear after you have made at least 3 posts.
Hi,
Got the fun myself. Running a audit level 5 to identify bad extents.
This is on 8.1.18.009. Discovered issues when reclaiming tapes for dedup tape copy. There were extents that could not be read, and little/no entries in the actlog. As we tried to fix bad extents, more of them appeared. It was not the same origin as you, but it goes to reading of blocks from stgpooldire.So what did you do ? Upgrade only to 8.1.19 ? Or enabled storage rule replication being already on 8.1.19 ?
We saw something else not that fun.This is on 8.1.18.009. Discovered issues when reclaiming tapes for dedup tape copy. There were extents that could not be read, and little/no entries in the actlog. As we tried to fix bad extents, more of them appeared. It was not the same origin as you, but it goes to reading of blocks from stgpooldire.
Hi,Just got this from IBM:
Good afternoon
Development are making the latest 8.1.19 code available shortly. We can advise when this is available.
hm.... I'd like to see a 8.1.19.100 instead of a special .005 or so.Hi,
I just got 8.1.19.005 due to a different issue, curious if that is the same one you will get.
Wasn't your problem also related to damaged extents ?Hi,
I just got 8.1.19.005 due to a different issue, curious if that is the same one you will get.
No, not seen that in 8.1.19 - I've had that before though, I think we were running 8.1.14.200 when we had damaged extents. In this case the issue we are having is that the stgrule hangs.Wasn't your problem also related to damaged extents ?
Hi,IBM just said they have a fix for us, 8.1.19.005 , but no download link
So we will install this in the servers we have problem with.
The remaining part will be all damaged extents..... so sadly I guess we are facing a level5 audit.
Which for a 100 TB stgpool will run for ? Weeks, months ?
Protect: TSMxxxxxxxx>select date(LASTAUDIT_DATE),count(*) from containers group by date(LASTAUDIT_DATE)
Unnamed[1] Unnamed[2]
----------- ------------
2023-10-06 893
2023-10-07 2027
2023-10-08 1744
2023-10-09 1213
2023-10-10 48
2023-10-11 29
7264
did you run a stgrule audit level5, or did you run audit per container one by one by script ?Hi,
I manage anything from 48 to 2027 containers per day. Not alot to work on. It may give you a clue though.
Code:Protect: TSMxxxxxxxx>select date(LASTAUDIT_DATE),count(*) from containers group by date(LASTAUDIT_DATE) Unnamed[1] Unnamed[2] ----------- ------------ 2023-10-06 893 2023-10-07 2027 2023-10-08 1744 2023-10-09 1213 2023-10-10 48 2023-10-11 29 7264
I have in this case a dedup pool of 139,182 G 86.7% util. I estimated about one week, but that was a bit on the low side.
I added a stgrule for this.did you run a stgrule audit level5, or did you run audit per container one by one by script ?
Guess the same as you got. But I really wonder why they didn't give that to us already last week since it was available for you already before.Hi,
I just got 8.1.19.005 due to a different issue, curious if that is the same one you will get.
Yeah, not sure. What I don't understand though is which of these APARs are relevant to the issue you are having? Based on the description alone I don't see which would be applicable to the problem you are having.Guess the same as you got. But I really wonder why they didn't give that to us already last week since it was available for you already before.
TSM-ESEM-01>
q system
**************************************************
*** ---> SHOW BANNERS
**************************************************
*********************************************************************
* This is a Limited Availability TEMPORARY fix for *
* IT43969 - CONVERT STGPOOL FAILS WITH ANR9999D INCONSISTENT *
* CONTENT FOR ALI AS. *
* IT43893 - STORAGE POOL VOLUME OPERATIONS SLOW AFTER LARGE DATA *
* DELETION *
* IT44385 - AFTER STORAGE/SPECTRUM PROTECT SERVER UPGRADE TO *
* 8.1.17 OR HIGHER, PERFORMANCE CAN DECREASE IN CLOUD *
* WRITE OPERATIONS *
* IT44406 - DELETE STGPOOL FAILS WITH ANR3738I FOR *
* CLOUD-CONTAINER STGPOOL WITH DESTROYED NON-S3 CLOUD *
* CONTAINERS *
* IT44302 - SERVER COMMANDS OR PROCESS HANG WHEN SCRATCH VOLUMES *
* DELETION ARE ONGOING IN THE BACKGROUND *
* IT44599 - SPECTRUM PROTECT SERVER REPLICATION STGRULE HANG *
* IT44595 - REPLICATION CAN CRASH THE SERVER WHEN NO NODES ARE *
* CONFIGURED FOR REPLICATION *
* This cumulative efix server is based on code level *
* available with patch level 8.1.19.000 *
* *
*********************************************************************
I fully agree, but the one that has something to do with stgrule hangning, might also fix other related stg-rule issues, I hope. If not I will let them know. Have just installed in 2 primary servers and one replication target. Will start a new stgrul repl in a couple of minutes, then we'll know in a day or 2 . But I am 100% sure that we FIRST need to fix the damaged extents. But I have asked and asked and asked.... no reply on that.Yeah, not sure. What I don't understand though is which of these APARs are relevant to the issue you are having? Based on the description alone I don't see which would be applicable to the problem you are having.
Yes you will need to fix them, from my experience at least. Whatever transaction has those extents marked as damaged will fail (with very little information in actlog, only in dsmffdc.log I found what the issue was), depends on what ReplBatchSize / ReplSizeThresh is set to the size of those transactions as I understand. We ended up seeing exactly the size of ReplBatchSize as number of files that failed to replicate.I fully agree, but the one that has something to do with stgrule hangning, might also fix other related stg-rule issues, I hope. If not I will let them know. Have just installed in 2 primary servers and one replication target. Will start a new stgrul repl in a couple of minutes, then we'll know in a day or 2 . But I am 100% sure that we FIRST need to fix the damaged extents. But I have asked and asked and asked.... no reply on that.
Why can't IBM explain this then ? We have between 50-220 extents in our servers we have problems with.Yes you will need to fix them, from my experience at least. Whatever transaction has those extents marked as damaged will fail (with very little information in actlog, only in dsmffdc.log I found what the issue was), depends on what ReplBatchSize / ReplSizeThresh is set to the size of those transactions as I understand. We ended up seeing exactly the size of ReplBatchSize as number of files that failed to replicate.
After those damaged extents were fixed we didn't see any issues. What I mean to say, damaged extents affected only 50 files let's say, but we had ReplBatchSize of 9216 for example - then it would fail to replicate 9216 files - when we reduced to 4096 then it would replicate everything except 4096 files (since some damaged extents were in that transaction is my understanding).
We didn't have that many damaged extents though, not sure how many damaged extents you have.
10/12/23, 15:01:54 | ANR1652E Replication failed. Total number unresolved extents is 1,141. (SESSION: 1540, PROCESS: 185, JOB: 1292) |