Damaged Extents after upgrade to 8.1.19.000

uffeg

Active Newcomer
Joined
Sep 23, 2002
Messages
6
Reaction score
0
Points
0
Location
Eskilstuna Sweden
Website
http
PREDATAR Control23

We upgraded from 8.1.18 to 8.1.19.000 to be able to run storage rule replication without them hanging. And not being able to cancel the processes, having to restart Spectrum to release the hanging jobs/sessions.
First storage rule repl job, we found 1 container with damaged extents. The job failed of course.
Another server has several damaged containers.

At the same time we have 2 server where it runs fine.
All server runnig prot/repl runs without issues.

Then we did setup a complete new replication target server, and started a new storage rule repl. Bam, damaged extents.
And this server has several customers, so the other two is still running protect/replicate to another target server.
It has 3 storage pools one per customer. No damaged extents in the pools running old prot/repl.
4 containers damaged in the pool that we run storage rule repl on.

We now have 3 running cases for 2 weeks. And it seems tough to solve.

Is this something anyone else has found ?
Or are we the only one running 8.1.19.000 and storage rule repl ?

/Ulf @Atea Sweden
 
PREDATAR Control23

Hi,

Can you share some details from the actlog (and other error messages) about these issues?

No messages during repl job.
It just simply fails replication everything.

ANR1652E Replication failed. Total number unresolved extents is 123,641.
Files replicated: 403,749 of 1,025,307. Files updated: 279,770 of 287,955. Files deleted: 189,277 of 189,277. Amount replicated: 1,556 GB of 1,970 GB. Amount transferred: 430 GB. Elapsed time: 0 Days, 0 Hours, 58 Minutes. (SESSION: 182859, PROCESS: 3001, JOB: 77)

I have another server that just simply stops replication in the middle of everything.
Also here it found damaged extents in the first repl job after upgrade to 8.1.19.000
That one I have to cancel the job, and of course that is not working either.
Job stays "terminating" even if that should be solved in 8.1.19.000
Having to reboot both source and target server.
 
PREDATAR Control23

No messages during repl job.
It just simply fails replication everything.

ANR1652E Replication failed. Total number unresolved extents is 123,641.
Files replicated: 403,749 of 1,025,307. Files updated: 279,770 of 287,955. Files deleted: 189,277 of 189,277. Amount replicated: 1,556 GB of 1,970 GB. Amount transferred: 430 GB. Elapsed time: 0 Days, 0 Hours, 58 Minutes. (SESSION: 182859, PROCESS: 3001, JOB: 77)

I have another server that just simply stops replication in the middle of everything.
Also here it found damaged extents in the first repl job after upgrade to 8.1.19.000
That one I have to cancel the job, and of course that is not working either.
Job stays "terminating" even if that should be solved in 8.1.19.000
Having to reboot both source and target server.

Are there no entries with errors when looking in actlog searching for job id and/or proc num?
I guess you have found this: IT42584 (that should have been fixed) in your version.
 
PREDATAR Control23

Are there no entries with errors when looking in actlog searching for job id and/or proc num?
I guess you have found this: IT42584 (that should have been fixed) in your version.
We have seen I/O error opening file. But at the same time I can copy that file with damaged extent to another folder in OS.

We started running storage rule repl in 4 servers. Got same issue in all 4 of them with damaged extents.

In 2 servers we have solved it by running a new FULL backup on the servers having damaged extents.
After that the repl jobs work.

But now we have it in a very big server where it affect Imagebackups only, can't do backup since the control info lays on the damaged extents. We are running a FULL VM backup an all VM's as I write this.
But I am afraid we will not be able to restore beyond today.

Anyone here who has done a repl node with repair on a DC vm node containing hundreds of VM's ?
 
PREDATAR Control23

WE have been running FULL VM backups and now we tested incremental and no issue, and no damaged extent anymore. On monday we will try to restore beyond todays full backups and se what happens.
Will be fun to see what happens to the vm's we have 10 years retention sets on if they are of no use anymore or not.

I have a gut feeling 8.1.19.000 will be drawn back, or we will quite fast see 8.1.19.100.
 
PREDATAR Control23

can you run a audit?
Yes, but we get this:
ANR4891I:Audit Container has encountered an I/O error for container /PHYFILE06_NFS/TSMDIR/TSMfile00/06/0f/00000000000f06e1.dcf in container storage pool TSMDIR while attempting to read a data extent.
At the same time we can copy that file. So it exists on disk. WE now have the same thing in 5 different server, and as soon as we start running STGRUL replicattion instead of Protect/Replicate node we get that same error in the source server. Never any issues in the target servers.
8.1.20.000 just arrived but doeasn't look like that would help.
 
PREDATAR Control23

Verify the integrity of the disk where the container resides by running disk checks and looking for any hardware issues or file system corruption on the NFS device.

1.Check File Permissions:
Ensure that the appropriate file permissions and ownership are set for the container and the directory it resides in.

2.Check Disk Space:
Ensure that the disk where the container is located has enough free space to accommodate new data extents.

Check for OS or Hardware Issues:
Investigate if there are any known operating system or hardware issues that could be causing I/O errors. On the NFS directory - move a file to it.
 
Top