replication failure

Lars-Owe · Apr 17, 2017

Hi!

A couple of our nodes are experiencing replication failures. query replication for the affected file systems shows:
...
Backup Files Not Replicated Due To Errors: 1
...

What is the typical action here? The client is a Windows machine, c$ being the culprit file system with 35307 files. Source and target backup servers are running Spectrum Protect 7.1.7.0. Running the replication process again gives consistent results, but not much help:

2017-04-17 21.14.51 ANR0984I Process 312 for Replicate Node started in the
FOREGROUND at 21:14:51. (SESSION: 75622, PROCESS: 312)
2017-04-17 21.14.51 ANR2110I REPLICATE NODE started as process 312. (SESSION:
75622, PROCESS: 312)
2017-04-17 21.14.51 ANR0408I Session 75642 started for server TSM5 (AIX)
(Tcp/Ip) for replication. (SESSION: 75622, PROCESS: 312)
2017-04-17 21.14.52 ANR0408I Session 75643 started for server TSM5 (AIX)
(Tcp/Ip) for replication. (SESSION: 75622, PROCESS: 312)
2017-04-17 21.14.52 ANR0408I Session 75644 started for server TSM5 (AIX)
(Tcp/Ip) for replication. (SESSION: 75622, PROCESS: 312)
2017-04-17 21.14.52 ANR3192I Replicate Node: Proxy agent nodes replicated: 0
of 0 identified. Associated authorized nodes replicated:
0 of 0 identified. Client option sets replicated: 0 of 0
identified. (SESSION: 75622, PROCESS: 312)
2017-04-17 21.14.52 ANR0327I Replication of node SCAR008A.MEB.KI.SE
completed. Files current: 37,184. Files replicated: 0 of
1. Files updated: 0 of 0. Files deleted: 0 of 0. Amount
replicated: 0 bytes of 0 bytes. Amount transferred: 0
bytes. Elapsed time: 0 Days, 0 Hours, 1 Minutes.
(SESSION: 75622, PROCESS: 312)
2017-04-17 21.14.52 ANR0987I Process 312 for Replicate Node running in the
FOREGROUND processed 37,184 items with a completion
state of FAILURE at 21:14:52. (SESSION: 75622, PROCESS:
312)
2017-04-17 21.14.52 ANR1893E Process 312 for Replicate Node completed with a
completion state of FAILURE. (SESSION: 75622, PROCESS:
312)
2017-04-17 21.16.09 ANR2017I Administrator LARS-OWE issued command: QUERY
ACTLOG search='Process: 312' (SESSION: 75622)

inthesun · Apr 17, 2017

Hi,

From your update, you state more then one node is failing Node Replication. Are they being processed as a group? or are you processing one at a time? The above looks like the process failed due to an error when a group of nodes are being replicated. There may have been a communication failure to the target for one of the other nodes and the above messages are just reporting that one of the Nodes sent no data as the process failed.

If these nodes are in a container pool, are your Protect STGpool commands processing successfully before you do the Node Replication?

If you are unable to find the original failure, then you may want to open a ticket with IBM and have them review the full Actlogs, from the source and target systems, during the full time Node Replication is running.

Lars-Owe · Apr 18, 2017

Protect stgpool is running successfully. We've removed the two nodes having troubles from the node groups being replicated. The above log extract comes from the replication of a single file system (c$) on one of the two affected nodes.

marclant · Apr 18, 2017

Lars-Owe said:
Protect stgpool is running successfully. We've removed the two nodes having troubles from the node groups being replicated. The above log extract comes from the replication of a single file system (c$) on one of the two affected nodes.

Also check the activity log of the target server, the failure can be caused as much on the target as the source.

The ffdc.log located in the instance directory may also have additional information, again check both the source and target.

Lars-Owe · Apr 18, 2017

There's nothing spectacular going on at the target server:

2017-04-18 20.36.11 ANR0408I Session 580642 started for server TSM4 (AIX)
(Tcp/Ip) for replication. (SESSION: 580642)
2017-04-18 20.36.11 ANR0950I Session 580636 for node VM_ITS_IT-DCN01 is using
inline server data deduplication or inline compression.
(SESSION: 580636)
2017-04-18 20.36.11 ANR0984I Process 1679 for Replicate Node ( As Secondary )
started in the BACKGROUND at 20:36:11. (SESSION: 580642,
PROCESS: 1679)
2017-04-18 20.36.11 ANR2110I Replicate Node ( As Secondary ) started as
process 1679. (SESSION: 580642, PROCESS: 1679)
2017-04-18 20.36.11 ANR2071I Administrator SCAR008A.MEB.KI.SE updated.
(SESSION: 580642, PROCESS: 1679)
2017-04-18 20.36.11 ANR0408I Session 580643 started for server TSM4 (AIX)
(Tcp/Ip) for replication. (SESSION: 580643)
2017-04-18 20.36.11 ANR0408I Session 580644 started for server TSM4 (AIX)
(Tcp/Ip) for replication. (SESSION: 580644)
2017-04-18 20.36.12 ANR0950I Session 580638 for node VM_ITS_IT-DCN01 is using
inline server data deduplication or inline compression.
(SESSION: 580638)
2017-04-18 20.36.13 ANR0409I Session 580642 ended for server TSM4 (AIX).
(SESSION: 580642, PROCESS: 1679)
2017-04-18 20.36.13 ANR0409I Session 580644 ended for server TSM4 (AIX).
(SESSION: 580644)
2017-04-18 20.36.13 ANR0409I Session 580643 ended for server TSM4 (AIX).
(SESSION: 580643)

I tried a protect stg contpool forcereconcile=yes, and it too ran successfully.

The ffdc logs are primarily made up of:
[04-18-2017 06:01:37.473][ FFDC_GENERAL_SERVER_ERROR ]: (sddelete.c:2112) Unable to delete non-dedup chunkId -5537312171585819799

According to an APAR I found this is harmless and should be ignored. It did also state:

[04-18-2017 08:06:55.462][ FFDC_GENERAL_SERVER_ERROR ]: (imdmgr.c:3700) Column 14 in table Archive.Objects is NULL.~
[04-18-2017 08:07:24.744][ FFDC_GENERAL_SERVER_ERROR ]: (imdmgr.c:3700) Column 14 in table Archive.Objects is NULL.~

The node I replicated has no archive data, only backup.

inthesun · Apr 19, 2017

The next thing you can do to see if it reports a problem, on the source server, is to do an AUDIT CONTAINER STGPOOL=<pool_name> ACTION=SCANALL . Here is the link to the full command doc:
https://www.ibm.com/support/knowledgecenter/en/SSEQVQ_8.1.0/srv.reference/r_cmd_container_audit.html

If that finds nothing and as you have provided the logs, that are not reporting what the failure is -- like a damaged extent or orphaned extent, then you should get IBM support to look deeper. They should get traces of the failed node replication process that fails within a minute.

I hope this is helpful.

replication failure

Lars-Owe

inthesun

Lars-Owe

marclant

Lars-Owe

inthesun

Data Privacy Impact Assessment

Sponsor ADSM.ORG

Navigation Menu

NordVPN 3 Months FREE

Forum statistics