Request for understanding of the "At Risk" category for TSM nodes

channdeep

Newcomer
Joined
Apr 5, 2018
Messages
1
Reaction score
0
Points
0
Dear Community,

This is my first post, and sorry to bother with a long post ! I am pretty much a newbie in TSM, so kindly forgive for any obvious mistake/ wrong understandings. I mostly use the OC GUI for administration and monitoring the TSM (I use CLI for some basic tasks only).

My objective is that I want my TSM OC (and Daily Protection Report) to show all the TSM nodes in green color and with zero "At Risk" nodes. I have already worked with respective server owners for few nodes which needed complete decommissioning, or some OPT file reconfiguration etc, and they have been streamlined. Now, I am still left, with few nodes intermittently, and few nodes permanently, falling in the "At Risk" category.

The challenge is that I see somewhat inconsistent behavior of TSM engine to treat a node as "At Risk". I have this feeling because, for a same error and warning message - the TSM shows some nodes as good green; whereas shows some node in "At Risk". I observed the server logs in OC for few days in a row, but unable to make a conclusion - and request your help.

Earlier, my understanding was that this was due to the open files no node, which TSM will treat as "At Risk" as unable to copy them due to unsaved changes etc - but this understanding seems wrong, or rather I am still confused. Below are some of my notes and my current understandings/ observations:-

-------
a. “At Risk” category nodes are different than “Warning” category nodes.
b. It is not correct to consider that every “Warning” will be treated in “At Risk” category.
c. Now, for producing the message “the object is in use by another process” – PFB the logs from SERVER_1 where I intentionally kept a file opened with unsaved changes - “Testing_TSM_Open_File_Copy.txt”. Still, the logs show it as [Sent]. This was without configuring the option - “Open File support”. Still, the node appears as good green in the report.
i. 05/04/2018 21:12:32 Normal File--> 26 \\ SERVER_1\e$\Backups \Testing_TSM_Open_File_Copy.txt [Sent]
d. Also, noticed today in few other random windows servers (SERVER_2 and SERVER_3) with same warning messages in server logs – still both appear as good green in Daily Protection Report/ OC. Hence, looks like that we should not do too much effort for this specific message of “the object is in use by another process”. What say?
e. It is possible that, on some day, one specific node has both: 1) “At Risk” error 2) “Warning” message – however, it falls in “At Risk” category only because of “At Risk” error.
f. And, now, we believe that only focus is needed on real errors, which do make the node fall in the category of “At Risk”; and makes our daily report dirty, which when observed now for few days, are due to below errors generally:
i. file not found
ii. Object changed during processing. Object skipped.
iii. file is temporarily unavailable
iv. Object name '/backup1//characterizations/20117 - PSC-1pct Pt-TiSi-H2Z2-048811, tørret prøve.pdf' contains one or more unrecognized characters and is not valid.
v. Node not communicating with TSM at all.
-------

PFA the screenshot for reference that how I see the server logs view.

Then, I further make more research to understand by categorising different errors as below (Server names changed, but kept a copy at my end of actual names for correlation later):

------
All error nodes in the server logs are:-

1) ANE4037E: Object changed during processing. Object skipped.
Server_C
Server_E
2) ANE4008E: file is temporarily unavailable
Server_F
3) ANE4987E: the object is in use by another process
Server_G
Server_D
4) ANE4005E: file not found
Server_A
5) ANE4042E: file name contains unrecognized characters and is not valid
Server_B

Now, the nodes shown as "At Risk":-

Server_A
Server_B
Server_C
Server_D

So, again I get confused that:-
1) When Server_C appears as "At Risk" for error ANE4037E, then why not Server_E appears same (or vice-versa)?
2) When Server_D appears as "At Risk" for error ANE4987E, then why not Server_G appears same (or vice-versa)?
3) Why Server_F not shown as "At Risk"?

Because of this, I am unable to categorise, and make myself understand that, ok channdeep, just treat these error codes as "At Risk", and these error codes as NOT "At Risk".
------

Many thanks in advance for any comments/ guidance.

Best regards,
channdeep.
 

Attachments

  • TSM_ADSM.jpg
    TSM_ADSM.jpg
    265.6 KB · Views: 7
Hi and welcome!

By default I think the OC treats warnings and skipped objects as 'at risk'. You can turn this behavior off under the settings up in the right hand corner. It is up to you/your team to determine if said warnings/skipped files justify going for all green. Basically it will accept Return Code 4 and 8's as OK. You will still see them as 'Yellow' in the OC overview for client nodes. Or at least I do on 7.1.7.0.
Return Code 12's will still display atrisk as those generally are a failure of some sort (VSS can help for windows nodes).

a. “At Risk” category nodes are different than “Warning” category nodes.
b. It is not correct to consider that every “Warning” will be treated in “At Risk” category.

See above about settings in the OC.

c. Now, for producing the message “the object is in use by another process” – PFB the logs from SERVER_1 where I intentionally kept a file opened with unsaved changes - “Testing_TSM_Open_File_Copy.txt”. Still, the logs show it as [Sent]. This was without configuring the option - “Open File support”. Still, the node appears as good green in the report.
i. 05/04/2018 21:12:32 Normal File--> 26 \\ SERVER_1\e$\Backups \Testing_TSM_Open_File_Copy.txt [Sent]
I've never had luck testing with text files. Word or Excel generally I can trigger an open file warning, an active database will trigger open file warning as well.

Beyond that, what you posted for various servers - it really depends on whats going on with those systems at the time of backup.
Object in use - generally TSM will do a retry, so it may still kick out a warning but successfully back it up. Default is 3 retries I think.
Changed during processing is a painful one if you don't use open file support. Log files that are wrote to contently or other files of that nature are my pain points, but again open file support helps here.

ANE4008E: file is temporarily unavailable -- No idea on that one. I've yet to see that in my environment. Almost sounds like a disk or remote mounted filesystem causing issue.
ANE4042E: file name contains unrecognized characters and is not valid almost sounds like language variable isn't set properly. By default the client/schedule service(?) use the default language provided by the os for both Windows and *nix environments. All of my servers are en_US so don't have much experience in that area sorry.

I know I skipped about and perhaps over some things, but the above should at least get you pointed into the right direction.

In additon to the server logs, the dsmsched.log on the client that is reporting warnings/failures is something else you will want to be looking at. While a lot of info gets reported back to the server, the client log will tell you if it retried and other good info.

Hope it helps!
 
Back
Top