Damaged files in directory-container stgpool - always the same filename

alexp36

ADSM.ORG Member
Joined
Jun 14, 2018
Messages
15
Reaction score
0
Points
0
Hi,

I'm seeing an odd problem on the TSM server I look after.
The server manages around 250 Windows and Linux clients.

When I run q damaged <STGPOOLNAME> type=node it shows over 200 damaged files, across 50 nodes.
When I run q damaged <STGPOOLNAME> type=inventory it shows something interesting:
- every single one of the 200+ damaged files, across 50 different client nodes is a file called "LSASRV.MOF"

An example file:
\WINDOWS\WINSXS\AMD64_MICROSOFT-WINDOWS-LSA-MOF_31BF3856AD364E35_10.0.14393.0_NONE_2843A530B7F9FD68\LSASRV.MOF

So - all 200+ damaged files are on Windows nodes of varying OS levels, and every file is the same, with the exception that they are in a few different(but similar) locations.

Note: This issue appeared fairly soon (a couple of hours) after a repair storagepool command was run.
The repair command was: repair stgpool <STGPOOLNAME> srclocation=replserver, and appeared to be 100% successful.

It repaired 1 damaged file(which was the only damaged file showing in TSM at that time), and queries on the storage pool after the repair initially showed 0 damaged files.

Prior to running the repair, we had run an audit on the storage pool, which took about 4 days to run, so don't really want to start an audit again now if there is a better option, or some other investigation which can be tried first.

It seems it must be something about the file itself, rather than actual damage in TSM, given there is not a single other type of file being reported as damaged.

Has anyone had a similar problem with Windows files in TSM? Any suggestions on what could be causing this? Anything else I should be looking at?


TIA
 
So - all 200+ damaged files are on Windows nodes of varying OS levels, and every file is the same, with the exception that they are in a few different(but similar) locations.
That's almost normal behaviour if there is one or many damaged extents. For example, take every Windows machine has Notepad.exe backed up. With deduplication, most of the extents for that file are common across multiple machines. So if one extent of that file is damaged, all the versions of that files for all clients will be also damaged.
 
Thanks for the reply.

I'm not sure I really understand though - are you saying that in the case of more than one client having an identical file, TSM only keeps 1 copy of that file, and calls it a backup for every client that has a copy of that file?
So, if TSM is aware that the copy it has is damaged, why doesn't it back the file up again from a different client, which has a clean copy of the file?
 
I'm not sure I really understand though - are you saying that in the case of more than one client having an identical file, TSM only keeps 1 copy of that file, and calls it a backup for every client that has a copy of that file?
Yes, that's called deduplication. But it's actually at the extent/chunk level, not at the file level. During a backup, a file is broken into extents/chunks. Then for each chunk, it fingerprints it, checks if it's already stored on the server, if yes, it references the existing extent, and discards the one coming in. If the extent doesn't exist on the server, it stores it and references it.
So, if TSM is aware that the copy it has is damaged, why doesn't it back the file up again from a different client, which has a clean copy of the file?
If it's an active file, normally yet. If it's an inactive file, no.
 
Thanks marclant, that makes sense.
So, this file is showing as active, but the status of it doesn't seem to be changing. Further development - all of the "other" damaged files have now disappeared in TSM, with no further action from myself, not even an audit.
Now there is just 1 damaged file, which I have confirmed is the exact same original file that was showing as damaged(same object ID, etc.) prior to the repair that I ran.

The message in the actlog when it re-marked the file as damaged was:
ANR3690W Invalid header found during replication of objects in pool <POOLNAME>. These objects have been marked damaged.

I've got an IBM case open about this too, btw. Their suggestion at this point has been to run an audit again on the container with "scanall", which I've done, and also to try running the repair again.
I've asked them for more info, because I already did the repair, and it hasn't achieved anything. Not a permanent fix, anyway.

Sorry about my somewhat slow replies in here. As you've probably guessed, I'm in a different timezone(southern hemisphere). And TSM is only one of many duties I have keeping me occupied :).
 
Back
Top