1. Community Tip: Please Give Thanks to Those Sharing Their Knowledge.
    If you receive helpful answer on this forum, please show thanks to the poster by clicking "LIKE" link for the answer that you found helpful.
  2. Community Tip: Forum Rules (PLEASE CLICK HERE TO READ BEFORE POSTING)
    Click the link above to access ADSM.ORG Acceptable Use Policy and forum rules which should be observed when using this website. Violators may be banned from this website. This notice will disappear after you have made at least 3 posts.

replication failure

Discussion in 'TSM Operation' started by Lars-Owe, Apr 17, 2017.

  1. Lars-Owe

    Lars-Owe ADSM.ORG Member

    Joined:
    Jan 19, 2015
    Messages:
    18
    Likes Received:
    0
    Occupation:
    Sys.admin.
    Location:
    Uppsala
    Hi!

    A couple of our nodes are experiencing replication failures. query replication for the affected file systems shows:
    ...
    Backup Files Not Replicated Due To Errors: 1
    ...

    What is the typical action here? The client is a Windows machine, c$ being the culprit file system with 35307 files. Source and target backup servers are running Spectrum Protect 7.1.7.0. Running the replication process again gives consistent results, but not much help:

    2017-04-17 21.14.51 ANR0984I Process 312 for Replicate Node started in the
    FOREGROUND at 21:14:51. (SESSION: 75622, PROCESS: 312)
    2017-04-17 21.14.51 ANR2110I REPLICATE NODE started as process 312. (SESSION:
    75622, PROCESS: 312)
    2017-04-17 21.14.51 ANR0408I Session 75642 started for server TSM5 (AIX)
    (Tcp/Ip) for replication. (SESSION: 75622, PROCESS: 312)
    2017-04-17 21.14.52 ANR0408I Session 75643 started for server TSM5 (AIX)
    (Tcp/Ip) for replication. (SESSION: 75622, PROCESS: 312)
    2017-04-17 21.14.52 ANR0408I Session 75644 started for server TSM5 (AIX)
    (Tcp/Ip) for replication. (SESSION: 75622, PROCESS: 312)
    2017-04-17 21.14.52 ANR3192I Replicate Node: Proxy agent nodes replicated: 0
    of 0 identified. Associated authorized nodes replicated:
    0 of 0 identified. Client option sets replicated: 0 of 0
    identified. (SESSION: 75622, PROCESS: 312)
    2017-04-17 21.14.52 ANR0327I Replication of node SCAR008A.MEB.KI.SE
    completed. Files current: 37,184. Files replicated: 0 of
    1. Files updated: 0 of 0. Files deleted: 0 of 0. Amount
    replicated: 0 bytes of 0 bytes. Amount transferred: 0
    bytes. Elapsed time: 0 Days, 0 Hours, 1 Minutes.
    (SESSION: 75622, PROCESS: 312)
    2017-04-17 21.14.52 ANR0987I Process 312 for Replicate Node running in the
    FOREGROUND processed 37,184 items with a completion
    state of FAILURE at 21:14:52. (SESSION: 75622, PROCESS:
    312)
    2017-04-17 21.14.52 ANR1893E Process 312 for Replicate Node completed with a
    completion state of FAILURE. (SESSION: 75622, PROCESS:
    312)
    2017-04-17 21.16.09 ANR2017I Administrator LARS-OWE issued command: QUERY
    ACTLOG search='Process: 312' (SESSION: 75622)
     
  2.  
  3. inthesun

    inthesun ADSM.ORG Member

    Joined:
    Oct 15, 2014
    Messages:
    18
    Likes Received:
    2
    Location:
    Tucson
    Hi,

    From your update, you state more then one node is failing Node Replication. Are they being processed as a group? or are you processing one at a time? The above looks like the process failed due to an error when a group of nodes are being replicated. There may have been a communication failure to the target for one of the other nodes and the above messages are just reporting that one of the Nodes sent no data as the process failed.

    If these nodes are in a container pool, are your Protect STGpool commands processing successfully before you do the Node Replication?

    If you are unable to find the original failure, then you may want to open a ticket with IBM and have them review the full Actlogs, from the source and target systems, during the full time Node Replication is running.
     
  4. Lars-Owe

    Lars-Owe ADSM.ORG Member

    Joined:
    Jan 19, 2015
    Messages:
    18
    Likes Received:
    0
    Occupation:
    Sys.admin.
    Location:
    Uppsala
    Protect stgpool is running successfully. We've removed the two nodes having troubles from the node groups being replicated. The above log extract comes from the replication of a single file system (c$) on one of the two affected nodes.
     
  5. marclant

    marclant ADSM.ORG Moderator

    Joined:
    Jun 16, 2006
    Messages:
    2,533
    Likes Received:
    354
    Occupation:
    Accelerated Value Specialist for Spectrum Protect
    Location:
    Canada
    Also check the activity log of the target server, the failure can be caused as much on the target as the source.

    The ffdc.log located in the instance directory may also have additional information, again check both the source and target.
     
  6. Lars-Owe

    Lars-Owe ADSM.ORG Member

    Joined:
    Jan 19, 2015
    Messages:
    18
    Likes Received:
    0
    Occupation:
    Sys.admin.
    Location:
    Uppsala
    There's nothing spectacular going on at the target server:

    2017-04-18 20.36.11 ANR0408I Session 580642 started for server TSM4 (AIX)
    (Tcp/Ip) for replication. (SESSION: 580642)
    2017-04-18 20.36.11 ANR0950I Session 580636 for node VM_ITS_IT-DCN01 is using
    inline server data deduplication or inline compression.
    (SESSION: 580636)
    2017-04-18 20.36.11 ANR0984I Process 1679 for Replicate Node ( As Secondary )
    started in the BACKGROUND at 20:36:11. (SESSION: 580642,
    PROCESS: 1679)
    2017-04-18 20.36.11 ANR2110I Replicate Node ( As Secondary ) started as
    process 1679. (SESSION: 580642, PROCESS: 1679)
    2017-04-18 20.36.11 ANR2071I Administrator SCAR008A.MEB.KI.SE updated.
    (SESSION: 580642, PROCESS: 1679)
    2017-04-18 20.36.11 ANR0408I Session 580643 started for server TSM4 (AIX)
    (Tcp/Ip) for replication. (SESSION: 580643)
    2017-04-18 20.36.11 ANR0408I Session 580644 started for server TSM4 (AIX)
    (Tcp/Ip) for replication. (SESSION: 580644)
    2017-04-18 20.36.12 ANR0950I Session 580638 for node VM_ITS_IT-DCN01 is using
    inline server data deduplication or inline compression.
    (SESSION: 580638)
    2017-04-18 20.36.13 ANR0409I Session 580642 ended for server TSM4 (AIX).
    (SESSION: 580642, PROCESS: 1679)
    2017-04-18 20.36.13 ANR0409I Session 580644 ended for server TSM4 (AIX).
    (SESSION: 580644)
    2017-04-18 20.36.13 ANR0409I Session 580643 ended for server TSM4 (AIX).
    (SESSION: 580643)

    I tried a protect stg contpool forcereconcile=yes, and it too ran successfully.

    The ffdc logs are primarily made up of:
    [04-18-2017 06:01:37.473][ FFDC_GENERAL_SERVER_ERROR ]: (sddelete.c:2112) Unable to delete non-dedup chunkId -5537312171585819799

    According to an APAR I found this is harmless and should be ignored. It did also state:

    [04-18-2017 08:06:55.462][ FFDC_GENERAL_SERVER_ERROR ]: (imdmgr.c:3700) Column 14 in table Archive.Objects is NULL.~
    [04-18-2017 08:07:24.744][ FFDC_GENERAL_SERVER_ERROR ]: (imdmgr.c:3700) Column 14 in table Archive.Objects is NULL.~

    The node I replicated has no archive data, only backup.
     
  7. inthesun

    inthesun ADSM.ORG Member

    Joined:
    Oct 15, 2014
    Messages:
    18
    Likes Received:
    2
    Location:
    Tucson
    The next thing you can do to see if it reports a problem, on the source server, is to do an AUDIT CONTAINER STGPOOL=<pool_name> ACTION=SCANALL . Here is the link to the full command doc:
    https://www.ibm.com/support/knowledgecenter/en/SSEQVQ_8.1.0/srv.reference/r_cmd_container_audit.html

    If that finds nothing and as you have provided the logs, that are not reporting what the failure is -- like a damaged extent or orphaned extent, then you should get IBM support to look deeper. They should get traces of the failed node replication process that fails within a minute.

    I hope this is helpful.
     

Share This Page