Damaged Extents after upgrade to 8.1.19.000

So we have one stgpool with no damaged extents left.
Ran a new stgrul repl after added 005 fix.
Repl job still fails:
10/12/23, 15:01:54ANR1652E Replication failed. Total number unresolved extents is 1,141. (SESSION: 1540, PROCESS: 185, JOB: 1292)

Whut ?!?!? What is failing ?
[10-12-2023 15:01:29.206][ FFDC_REPLICATION ]: [3219](nrfs.c:8659)(PROCESS: 185, SESSION: 1540, JOB: 1292) Skipping 2 objects for nodeId 59 fsId 4 first object skipped was 3338049051.

[10-12-2023 15:01:29.213][ FFDC_REPLICATION ]: [3245](nrfs.c:8659)(PROCESS: 185, SESSION: 1540, JOB: 1292) Skipping 1 objects for nodeId 65 fsId 1 first object skipped was 3338049052.

[10-12-2023 15:01:29.276][ FFDC_REPLICATION ]: [3240](nrfs.c:8659)(PROCESS: 185, SESSION: 1540, JOB: 1292) Skipping 2 objects for nodeId 65 fsId 2 first object skipped was 3338049055.


[10-12-2023 15:18:00.555][ FFDC_GENERAL_SERVER_ERROR ]: [4471](smnqr.c:2453)(SES SION: 2255) Integrity error 3010 for object 2834546162.

[10-12-2023 15:18:04.724][ FFDC_GENERAL_SERVER_ERROR ]: [4472](smnqr.c:2453)(SES SION: 2256) Integrity error 3010 for object 2834564634.

[10-12-2023 15:18:04.739][ FFDC_GENERAL_SERVER_ERROR ]: [4472](smnqr.c:2453)(SES SION: 2256) Integrity error 1101 for object 2834564634.

[10-12-2023 15:18:05.413][ FFDC_GENERAL_SERVER_ERROR ]: [4474](smnqr.c:2453)(SES SION: 2258) Integrity error 3010 for object 2834561472.

[10-12-2023 15:18:05.424][ FFDC_GENERAL_SERVER_ERROR ]: [4474](smnqr.c:2453)(SES SION: 2258) Integrity error 1101 for object 2834561472.

[10-12-2023 15:18:05.429][ FFDC_GENERAL_SERVER_ERROR ]: [4474](smnqr.c:2453)(SES SION: 2258) Integrity error 3010 for object 2834561472.


At least now we have something to work with.....
 
[10-12-2023 15:01:29.206][ FFDC_REPLICATION ]: [3219](nrfs.c:8659)(PROCESS: 185, SESSION: 1540, JOB: 1292) Skipping 2 objects for nodeId 59 fsId 4 first object skipped was 3338049051.

[10-12-2023 15:01:29.213][ FFDC_REPLICATION ]: [3245](nrfs.c:8659)(PROCESS: 185, SESSION: 1540, JOB: 1292) Skipping 1 objects for nodeId 65 fsId 1 first object skipped was 3338049052.

[10-12-2023 15:01:29.276][ FFDC_REPLICATION ]: [3240](nrfs.c:8659)(PROCESS: 185, SESSION: 1540, JOB: 1292) Skipping 2 objects for nodeId 65 fsId 2 first object skipped was 3338049055.


[10-12-2023 15:18:00.555][ FFDC_GENERAL_SERVER_ERROR ]: [4471](smnqr.c:2453)(SES SION: 2255) Integrity error 3010 for object 2834546162.

[10-12-2023 15:18:04.724][ FFDC_GENERAL_SERVER_ERROR ]: [4472](smnqr.c:2453)(SES SION: 2256) Integrity error 3010 for object 2834564634.

[10-12-2023 15:18:04.739][ FFDC_GENERAL_SERVER_ERROR ]: [4472](smnqr.c:2453)(SES SION: 2256) Integrity error 1101 for object 2834564634.

[10-12-2023 15:18:05.413][ FFDC_GENERAL_SERVER_ERROR ]: [4474](smnqr.c:2453)(SES SION: 2258) Integrity error 3010 for object 2834561472.

[10-12-2023 15:18:05.424][ FFDC_GENERAL_SERVER_ERROR ]: [4474](smnqr.c:2453)(SES SION: 2258) Integrity error 1101 for object 2834561472.

[10-12-2023 15:18:05.429][ FFDC_GENERAL_SERVER_ERROR ]: [4474](smnqr.c:2453)(SES SION: 2258) Integrity error 3010 for object 2834561472.


At least now we have something to work with.....
You could give this a try:

SELECT node_name FROM nodes where node_id IN (59, 65)

def stgrule repl_failures <target> act=norepl
def subrule node_59 repl_failures <node_name> act=repl
def subrule node_65 repl_failures <node_name> act=repl

Then:
Start stgrule repl_failures forcerec=full

Perhaps you have already ran forcereconsile though?
 
You could give this a try:

SELECT node_name FROM nodes where node_id IN (59, 65)

def stgrule repl_failures <target> act=norepl
def subrule node_59 repl_failures <node_name> act=repl
def subrule node_65 repl_failures <node_name> act=repl

Then:
Start stgrule repl_failures forcerec=full

Perhaps you have already ran forcereconsile though?
Yep already tested, still same amount of extents failing. And this is a stgpool we are migrating to a new server, which never had anything replicated before with old prot/repl or anything.
 
IBM wrote late friday afternoon:

Solution:
List damaged containers:
Q DAMAGED STGPOOL_NAME TYPE=CONTAINER
Then:
RESET CONTAINERSIZE STGPOOL_NAME ( do this when the server isn't too busy !! My recommendation )

Then audit container with action=scanall one by one.

NOW all the damaged extents got reset back !! Finally !!
So right now, we have no damaged extents any more.

But still our repl jobs miss some extents so that problem is left unsolved. Jobs fail. But don't hang, and no more damaged extents either.
 
Hi,

Good to hear that you got extents back. Looks like a task that will take some time and consume a lot of disk IO.

Did you run audit one by one with wait=yes (as in a script)
 
Hi,

Good to hear that you got extents back. Looks like a task that will take some time and consume a lot of disk IO.

Did you run audit one by one with wait=yes (as in a script)
Not that many, so I did it one by one.

Yes it's I/O intensive. We managed to lock one server completely, all sessions hangning and we had to kill it, and restart.

So maybe this shall run with disable session first, and no other activities. .
And not run move container or audits of course.
 
We still have a container sallad. Tried a VM restore, BAM failed. And again damaged extents show up.
So the only thing that works is the backups. No VM restores, or at least not the ones we have tried to restore.
IBM suggest until all i solved run full vm backups every now and then. :(

We did that for a couple of days, and restore worked. After that incr for a week, and when we test restore , NOPE. New damaged extents.

We are giving up real soon.

So 2 options
1: set up a new server.
2: create a new stgpool and point the copygroups to the new stgpool. Will force new full backups I guess.
Problem is then IF we need an old backup the domains and mgmtclasses are now pointing to the new stgpool.

I need som serious advice here, how would you solve this mess ?
 
We ran a new reset containersize when no move containers or anything else was running.
And then audit container on the 5 containers that were left with damaged extent.
Now we can both backup and restore. So I guess case is closed
 
So we have solved the damaged extents. But it's weeks since we got the stgrule repl jobs to run during this period of complete mess. We have 8.1.19.005 fix installed everywhere.
So hanging repl jobs shall not be an issue.
What happens when we try to run the jobs that hasn't been running for a while ?
THEY HANG of course..... up until 8.1.14 Spectrum Protect has been running with absolutely no issues whatsoever. Now we are drowning in problems. Sickening this.
 
Hi,
I have just recovered several 1000's of bad extents from local dedup copy pool. Got them all back. Took close to two weeks to scan them all. At the end, 47 containers where in error. Then about 1h30m to repair all extents with two worker's.
 
Hi,
I have just recovered several 1000's of bad extents from local dedup copy pool. Got them all back. Took close to two weeks to scan them all. At the end, 47 containers where in error. Then about 1h30m to repair all extents with two worker's.
It's not fun at all.
We started a new migration from some old stgpooldirs to new ones using move container.
Of course we got a lot of new damaged extents.
So this morning at 05.00 i did a reset containersize on the two stgpools, then ran audit container action=scanall on the ones with damaged extents. And all is fine.
BUT we have again several vm backups with RC=14 due to the damaged extents.

So we are stuck, we can't migrate away from the old storage solution because we get damaged extents.
 
Back
Top