Protect and Replicate wont run

illllm

ADSM.ORG Member
Joined
Jan 9, 2018
Messages
153
Reaction score
2
Points
0
A few days ago, we had emergency maintenance on our storage array for a different issue. We had to halt TSM. We stopped all replications that were running and issued the halt command. After 1 hour and 30 mins, the dsmserv service was still running so we had to kill it. Now we see that protect stage is not moving data. Tried the forcereconcile option but it does not work. Any suggestions of experiences in this would be great help. IBM support is of no help as they are 8 to 5:30 on weekdays only and every time I upload logs, they take forever to respond. So its one response a day while our data to replicate is piling up at 100 TB a day. Has anyone had issues with replication?
 
Hi,
Not alot to work on. What version is running? Any output from actlog that can give us a clue. Any entries in dsmserv.err file?

There is a lot of bug fixes. But, if you are working with IBM, you shoud not muddy the waters with upgrading your system to a newer release.
 
ANR0985I Process 1600 for Replicate Node ( As Secondary ) running in the BACKGROUND completed with completion state FAILURE.

this is the only message.

TSM 8.1.1
 
What about the activity log on the source server? What do you get in the log when you run a protect command? Do you still have server to server communication? (try running a command remotely)

Replication doesn't seem to provide good info the the log and I've had more success working through the Operations Center. Does that give any more information?
 
That is the from the source. The destination logs have nothing in them. All other replications run fine. I suspect there are a few corrupt containers and TSM does not know how to handle them.
 
ANR8213E Socket 20 aborted due to send error; error 32

this is the error on the Source logs
 
ANR8213E Socket 20 aborted due to send error; error 32
ANR8213E:
ANR8213E (Linux) Socket Socket identifier aborted due to send error; error error code.

Explanation
The session between the server and the specified client system experienced a fatal error sending data.
System action
The session with the remote system is ended.
User response
Ensure that the specified remote system is operational and is properly configured to run TCP/IP.

Error 32 means "broken pipe".

Sounds like a networking problem between the source and target.
 
Mar 25, 2018, 10:51:21 AM ANR0986I Process 629 for Replicate Node running in the FOREGROUND processed 426,531 items for a total of 2,065,542,681,043 bytes with a completion state of FAILURE at 10:51:21 AM. (SESSION: 173076, PROCESS: 629)
Mar 25, 2018, 10:51:21 AM ANR1893E Process 629 for Replicate Node completed with a completion state of FAILURE. (SESSION: 173076, PROCESS: 629)
 
Mar 25, 2018, 10:51:21 AM ANR0986I Process 629 for Replicate Node running in the FOREGROUND processed 426,531 items for a total of 2,065,542,681,043 bytes with a completion state of FAILURE at 10:51:21 AM. (SESSION: 173076, PROCESS: 629)
Mar 25, 2018, 10:51:21 AM ANR1893E Process 629 for Replicate Node completed with a completion state of FAILURE. (SESSION: 173076, PROCESS: 629)
What you captured is the final status of the process. The cause of the failure is above that somewhere in the activity log, look for any errors for PROCESS: 629 prior to Mar 25, 2018, 10:51:21 AM
 
Use the Operations Center to find the nodes it has had problems with. (Not sure the version you need for this, works for me on 8.1.1)

From the Menu, Storage, Replication. Select the line with the Source/Target which is failing and click on details. The details screen will show the failed jobs which you can click on and it will show the nodes which are failing.
 
Its failing only on one node. Protect works fine. Replicate just hangs and does nothing. IBM support is also stumped as logs do not show anything.
 
It's semantics, but if replication fails, then it doesn't hang. By definition, an hang never completes, you have to kill it.

If you already have IBM engaged, that's likely your best course of action at this point.
 
Thats exactly what happens. Replication starts and then for 24 hours it just sits there doing nothing. No network throughput, no active threads, sessions are inactive for same amount of time.
 
Back
Top