Damaged Extents after upgrade to 8.1.19.000

uffeg · Oct 12, 2023

uffeg said:
So we have one stgpool with no damaged extents left.
Ran a new stgrul repl after added 005 fix.
Repl job still fails:

10/12/23, 15:01:54 ANR1652E Replication failed. Total number unresolved extents is 1,141. (SESSION: 1540, PROCESS: 185, JOB: 1292)

Whut ?!?!? What is failing ?

[10-12-2023 15:01:29.206][ FFDC_REPLICATION ]: [3219](nrfs.c:8659)(PROCESS: 185, SESSION: 1540, JOB: 1292) Skipping 2 objects for nodeId 59 fsId 4 first object skipped was 3338049051.

[10-12-2023 15:01:29.213][ FFDC_REPLICATION ]: [3245](nrfs.c:8659)(PROCESS: 185, SESSION: 1540, JOB: 1292) Skipping 1 objects for nodeId 65 fsId 1 first object skipped was 3338049052.

[10-12-2023 15:01:29.276][ FFDC_REPLICATION ]: [3240](nrfs.c:8659)(PROCESS: 185, SESSION: 1540, JOB: 1292) Skipping 2 objects for nodeId 65 fsId 2 first object skipped was 3338049055.

[10-12-2023 15:18:00.555][ FFDC_GENERAL_SERVER_ERROR ]: [4471](smnqr.c:2453)(SES SION: 2255) Integrity error 3010 for object 2834546162.

[10-12-2023 15:18:04.724][ FFDC_GENERAL_SERVER_ERROR ]: [4472](smnqr.c:2453)(SES SION: 2256) Integrity error 3010 for object 2834564634.

[10-12-2023 15:18:04.739][ FFDC_GENERAL_SERVER_ERROR ]: [4472](smnqr.c:2453)(SES SION: 2256) Integrity error 1101 for object 2834564634.

[10-12-2023 15:18:05.413][ FFDC_GENERAL_SERVER_ERROR ]: [4474](smnqr.c:2453)(SES SION: 2258) Integrity error 3010 for object 2834561472.

[10-12-2023 15:18:05.424][ FFDC_GENERAL_SERVER_ERROR ]: [4474](smnqr.c:2453)(SES SION: 2258) Integrity error 1101 for object 2834561472.

[10-12-2023 15:18:05.429][ FFDC_GENERAL_SERVER_ERROR ]: [4474](smnqr.c:2453)(SES SION: 2258) Integrity error 3010 for object 2834561472.

At least now we have something to work with.....

scr1pt · Oct 13, 2023

uffeg said:
[10-12-2023 15:01:29.206][ FFDC_REPLICATION ]: [3219](nrfs.c:8659)(PROCESS: 185, SESSION: 1540, JOB: 1292) Skipping 2 objects for nodeId 59 fsId 4 first object skipped was 3338049051.

[10-12-2023 15:01:29.213][ FFDC_REPLICATION ]: [3245](nrfs.c:8659)(PROCESS: 185, SESSION: 1540, JOB: 1292) Skipping 1 objects for nodeId 65 fsId 1 first object skipped was 3338049052.

[10-12-2023 15:01:29.276][ FFDC_REPLICATION ]: [3240](nrfs.c:8659)(PROCESS: 185, SESSION: 1540, JOB: 1292) Skipping 2 objects for nodeId 65 fsId 2 first object skipped was 3338049055.

[10-12-2023 15:18:00.555][ FFDC_GENERAL_SERVER_ERROR ]: [4471](smnqr.c:2453)(SES SION: 2255) Integrity error 3010 for object 2834546162.

[10-12-2023 15:18:04.724][ FFDC_GENERAL_SERVER_ERROR ]: [4472](smnqr.c:2453)(SES SION: 2256) Integrity error 3010 for object 2834564634.

[10-12-2023 15:18:04.739][ FFDC_GENERAL_SERVER_ERROR ]: [4472](smnqr.c:2453)(SES SION: 2256) Integrity error 1101 for object 2834564634.

[10-12-2023 15:18:05.413][ FFDC_GENERAL_SERVER_ERROR ]: [4474](smnqr.c:2453)(SES SION: 2258) Integrity error 3010 for object 2834561472.

[10-12-2023 15:18:05.424][ FFDC_GENERAL_SERVER_ERROR ]: [4474](smnqr.c:2453)(SES SION: 2258) Integrity error 1101 for object 2834561472.

[10-12-2023 15:18:05.429][ FFDC_GENERAL_SERVER_ERROR ]: [4474](smnqr.c:2453)(SES SION: 2258) Integrity error 3010 for object 2834561472.

At least now we have something to work with.....

You could give this a try:

SELECT node_name FROM nodes where node_id IN (59, 65)

def stgrule repl_failures <target> act=norepl
def subrule node_59 repl_failures <node_name> act=repl
def subrule node_65 repl_failures <node_name> act=repl

Then:
Start stgrule repl_failures forcerec=full

Perhaps you have already ran forcereconsile though?

uffeg · Oct 13, 2023

scr1pt said:
You could give this a try:

SELECT node_name FROM nodes where node_id IN (59, 65)

def stgrule repl_failures <target> act=norepl
def subrule node_59 repl_failures <node_name> act=repl
def subrule node_65 repl_failures <node_name> act=repl

Then:
Start stgrule repl_failures forcerec=full

Perhaps you have already ran forcereconsile though?

Yep already tested, still same amount of extents failing. And this is a stgpool we are migrating to a new server, which never had anything replicated before with old prot/repl or anything.

uffeg · Oct 16, 2023

IBM wrote late friday afternoon:

Solution:
List damaged containers:
Q DAMAGED STGPOOL_NAME TYPE=CONTAINER
Then:
RESET CONTAINERSIZE STGPOOL_NAME ( do this when the server isn't too busy !! My recommendation )

Then audit container with action=scanall one by one.

NOW all the damaged extents got reset back !! Finally !!
So right now, we have no damaged extents any more.

But still our repl jobs miss some extents so that problem is left unsolved. Jobs fail. But don't hang, and no more damaged extents either.

Trident · Oct 16, 2023

Hi,

Good to hear that you got extents back. Looks like a task that will take some time and consume a lot of disk IO.

Did you run audit one by one with wait=yes (as in a script)

uffeg · Oct 16, 2023

Trident said:
Hi,

Good to hear that you got extents back. Looks like a task that will take some time and consume a lot of disk IO.

Did you run audit one by one with wait=yes (as in a script)

Not that many, so I did it one by one.

Yes it's I/O intensive. We managed to lock one server completely, all sessions hangning and we had to kill it, and restart.

So maybe this shall run with disable session first, and no other activities. .
And not run move container or audits of course.

uffeg · Oct 17, 2023

We still have a container sallad. Tried a VM restore, BAM failed. And again damaged extents show up.
So the only thing that works is the backups. No VM restores, or at least not the ones we have tried to restore.
IBM suggest until all i solved run full vm backups every now and then.

We did that for a couple of days, and restore worked. After that incr for a week, and when we test restore , NOPE. New damaged extents.

We are giving up real soon.

So 2 options
1: set up a new server.
2: create a new stgpool and point the copygroups to the new stgpool. Will force new full backups I guess.
Problem is then IF we need an old backup the domains and mgmtclasses are now pointing to the new stgpool.

I need som serious advice here, how would you solve this mess ?

uffeg · Oct 19, 2023

We ran a new reset containersize when no move containers or anything else was running.
And then audit container on the 5 containers that were left with damaged extent.
Now we can both backup and restore. So I guess case is closed

Trident · Oct 20, 2023

IBM just posted this apar:

https://www.ibm.com/support/pages/a...SSGSG7-_-R&mync=R&mynp=OCSSGSG7&myns=swgother

IT44791: MOVE CONTAINER REPORTS DAMAGED EXTENTS WHICH ARE NOT DAMAGED

uffeg · Oct 23, 2023

Trident said:
IBM just posted this apar:

https://www.ibm.com/support/pages/a...SSGSG7-_-R&mync=R&mynp=OCSSGSG7&myns=swgother

IT44791: MOVE CONTAINER REPORTS DAMAGED EXTENTS WHICH ARE NOT DAMAGED

Nice, sadly I can't read the article. Do you have access to it ? Can you perhaps email it to me ? You have my work email since before.

/Ulf

uffeg · Oct 24, 2023

So we have solved the damaged extents. But it's weeks since we got the stgrule repl jobs to run during this period of complete mess. We have 8.1.19.005 fix installed everywhere.
So hanging repl jobs shall not be an issue.
What happens when we try to run the jobs that hasn't been running for a while ?
THEY HANG of course..... up until 8.1.14 Spectrum Protect has been running with absolutely no issues whatsoever. Now we are drowning in problems. Sickening this.

Trident · Oct 24, 2023

Hi,
I have just recovered several 1000's of bad extents from local dedup copy pool. Got them all back. Took close to two weeks to scan them all. At the end, 47 containers where in error. Then about 1h30m to repair all extents with two worker's.

uffeg · Oct 26, 2023

Trident said:
Hi,
I have just recovered several 1000's of bad extents from local dedup copy pool. Got them all back. Took close to two weeks to scan them all. At the end, 47 containers where in error. Then about 1h30m to repair all extents with two worker's.

It's not fun at all.
We started a new migration from some old stgpooldirs to new ones using move container.
Of course we got a lot of new damaged extents.
So this morning at 05.00 i did a reset containersize on the two stgpools, then ran audit container action=scanall on the ones with damaged extents. And all is fine.
BUT we have again several vm backups with RC=14 due to the damaged extents.

So we are stuck, we can't migrate away from the old storage solution because we get damaged extents.

Damaged Extents after upgrade to 8.1.19.000

uffeg

scr1pt

uffeg

uffeg

Trident

TSM/Storge dude

uffeg

uffeg

uffeg

Trident

TSM/Storge dude

IT44791: MOVE CONTAINER REPORTS DAMAGED EXTENTS WHICH ARE NOT DAMAGED

uffeg

IT44791: MOVE CONTAINER REPORTS DAMAGED EXTENTS WHICH ARE NOT DAMAGED

uffeg

Trident

TSM/Storge dude

uffeg

Data Privacy Impact Assessment

Sponsor ADSM.ORG

Navigation Menu

NordVPN 3 Months FREE

Forum statistics

Damaged Extents after upgrade to 8.1.19.000

TSM/Storge dude

TSM/Storge dude

IT44791: MOVE CONTAINER REPORTS DAMAGED EXTENTS WHICH ARE NOT DAMAGED​

IT44791: MOVE CONTAINER REPORTS DAMAGED EXTENTS WHICH ARE NOT DAMAGED​

TSM/Storge dude

Data Privacy Impact Assessment

Sponsor ADSM.ORG

Navigation Menu

NordVPN 3 Months FREE

Forum statistics

IT44791: MOVE CONTAINER REPORTS DAMAGED EXTENTS WHICH ARE NOT DAMAGED

IT44791: MOVE CONTAINER REPORTS DAMAGED EXTENTS WHICH ARE NOT DAMAGED