Reset node replication

Rigido · Jul 12, 2021

Hi,
we moved an old "replication only" configuration to a "protect stgpool + replication" one, the problem is that a 700TB destination directory container pool is 98% full and the source DC pool is just 330TB.
We moved from one configuration to the other one just preceding the "replica step" with a protect stgpool with wait=yes.
Should I reset somehow replication state of nodes like I read (remove replnode, update syncsend, update syncreceive)? Or should I clean replication (remove replnode, remove filespaces, remove node and do a replica node from scratch)?
I think biggest problem is with this node:

Code:

Node Name           Type     Filespace Name      FSID      Files on      Replication          Files on
                                                              Server     Server (1)          Server (1)
---------------     ----     ---------------     ----     ----------     ---------------     ----------
HAP                 Arch     /tdpmux                2         29,171     TSM_HAIX_DR             97,113

Is the biggest one and occupancy on replication server is huge!

Thanks.

marclant · Jul 12, 2021

I wouldn't do remove replnode unless it's the last resort.

So, if you have more files on the target than the source for that node, when was the last successful replication for that node? query replication hap
And has expiration run successfully for that node after that?

SQL:

select start_time,end_time,affected from summary where activity='EXPIRATION' and successful='YES' and entity='HAP' order by end_time desc fetch first row only

To clean-up the target, you could do replicate node hap purgedata=deleted

REPLICATE NODE

Use this command to replicate data in file spaces that belong to one or more client nodes or defined groups of client nodes.

www.ibm.com

Side note: "protect+replicate" do not use more storage than just "replicate" unless you were not replicating all nodes. If you were replicating all nodes, it's the same amount of storage either way, the only difference is how the extents are moved, not how many.

dietmar · Jul 12, 2021

*not read the others reply*

if you use the same policy on both server's then there is something completely wrong. 5-10% is the max which the target pool might be bigger .

i always do a export policy to the target server to be sure those are in sync .

u could run a generate dedupstats on both servers , and wenn finished use q dedupstats where u have "lost" that space on the target .

I think there is no difference if u use protect + replicate or only replicate if both are dedup containers . Just u will transfer a lot of more Data . ( so no ) .

maybe have a look at the purgedata or forcereconcile options in protect and/or replicate

it will always take time to release the space on the containers. ( reuse delay + some internal delete schedules or whatever magic is behind this *gg* )

u could also do the hardcore in this case "select * from archives where node_name = '.....' " to see what u have

Rigido · Jul 15, 2021

Thank you all for suggestions, but I think we have a database performance issue on the replication server that slows down every operation.
DB is now 1,6TB and it takes more than 3 hours to backup (linux os, CPU at more than 70% of Wait I/O, TCPIP commmethod), I tried to change commethod from TCPIP to ShMem but I think the problem is with disk configuration and not loopback bandwidth.
Just think that I tried to remove node HAP from the replication server and I had to interrupt "delete filespace" process, it deleted about 24K files in 5 hours. "Expire inve" process didn't delete anything.

I will let "delete filespace" process run until it finishes.

Thanks.

Rigido · Jul 21, 2021

Still fighting with this one...

I removed all nodes from replication (remove replnode on both server), than I deleted filespaces and nodes on replica server and run expire inventory.
Now on replica server I should have just the "protect stgpool" but on primary server I have:

Code:

Storage         Device         Storage       Estimated       Pct       Pct      Hig-     Lo-     Next Stora-
Pool Name       Class Name     Type            Capacity      Util      Migr     h M-      w      ge Pool
                                                                                 ig      Mi-
                                                                                 Pct      g
                                                                                         Pct
-----------     ----------     ---------     ----------     -----     -----     ----     ---     -----------
DC_REPLICA                     DIRECTORY      368.368 G      85,5

Storage Pool Name     Directory                                         Access
-----------------     ---------------------------------------------     ------------
DC_REPLICA            /TSM/replica00                                    Read/Write
DC_REPLICA            /TSM/replica01                                    Read/Write
DC_REPLICA            /TSM/replica02                                    Read/Write
DC_REPLICA            /TSM/replica03                                    Read/Write
DC_REPLICA            /TSM/replica04                                    Read/Write
DC_REPLICA            /TSM/replica05                                    Read/Write
DC_REPLICA            /TSM/replica06                                    Read/Write
DC_REPLICA            /TSM/replica07                                    Read/Write
DC_REPLICA            /TSM/replica08                                    Read/Write
DC_REPLICA            /TSM/replica09                                    Read/Write
DC_REPLICA            /TSM/replica10                                    Read/Write
DC_REPLICA            /TSM/replica11                                    Read/Write

Each filesystem is on a 30TB lun.

On the replica server:

Code:

Storage         Device         Storage       Estimated       Pct       Pct      Hig-     Lo-     Next Stora-
Pool Name       Class Name     Type            Capacity      Util      Migr     h M-      w      ge Pool
                                                                                 ig      Mi-
                                                                                 Pct      g
                                                                                         Pct
-----------     ----------     ---------     ----------     -----     -----     ----     ---     -----------
DC_POOL                        DIRECTORY      700,386 G      85.7

Storage Pool Name     Directory                                         Access
-----------------     ---------------------------------------------     ------------
DC_POOL               /TSM/dc001                                        Read/Write
DC_POOL               /TSM/dc002                                        Read/Write
DC_POOL               /TSM/dc003                                        Read/Write
DC_POOL               /TSM/dc004                                        Read/Write
DC_POOL               /TSM/dc005                                        Read/Write
DC_POOL               /TSM/dc006                                        Read/Write

Each filesystem is on a 115TB lun.
Now there is a "protect stgpool purge=deleted" running and then I think I will try a "forcereconcile".

I am expecting to have the replica pool with almost the same occupancy of the primary pool (about 315TB), am I wrong?

marclant · Jul 21, 2021

You probably have a ton of backlog in updating and deleting extents, this can take some time. Check query extentupdate dc_pool. I bet there are large numbers in Pending Update or Eligible for Deletion. You'll only see the space go down as those numbers go down. That can sometimes take days depending on the number of extents and horsepower of the server.

A better option might have been to open a Case with IBM to understand what was wrong instead of using the bazooka approach.

Rigido · Jul 21, 2021

Thank you Marclant,
I know that the bazooka approach is not the best one, but I was on hurry and in the position to reset everything, that's why I pulled the trigger

Small update, PROTECT with delete deleted 232687480 extents on primary server, stgpool on replica server still at 85% and q extentupd didn't finish and gave an ANS8001I Return code 4. error.

marclant · Jul 21, 2021

Rigido said:
I know that the bazooka approach is not the best one, but I was on hurry

But, it's not always quicker. There's a lot of data to process.

Rigido said:
Small update, PROTECT with delete deleted 232687480 extents on primary server, stgpool on replica server still at 85%

It's not instant, the reuse delay needs to elapse first.

Rigido said:
q extentupd didn't finish and gave an ANS8001I Return code 4. error.

Check the activity log between the command and that message for more info.

Rigido · Jul 21, 2021

marclant said:
It's not instant, the reuse delay needs to elapse first.

Check the activity log between the command and that message for more info.

I set reused to 0 and this is the actlog:

Code:

07/21/2021 16:26:15      ANR0162W Supplemental database diagnostic information:
                          -1:22003:-802 ([IBM][CLI Driver][DB2/LINUXX8664]
                          SQL0802N  Arithmetic overflow or other arithmetic
                          exception occurred.  SQLSTATE=22003). (SESSION: 13071)
07/21/2021 16:26:15      ANR0106E sdutil.c(14391): Unexpected error 2343 fetching
                          row in table "SD.Chunk.Locations". (SESSION: 13071)

SP 8.1.12.100 on AIX 7200-05-02-2114.

Thanks.

marclant · Jul 21, 2021

Rigido said:
Arithmetic overflow or other arithmetic exception occurred.

I wonder if the numbers are too large to process, it might be since you deleted all the filespaces, that's a lot of extents that are marked for deletion that the server needs to delete. Don't know where to go from here. Maybe recycle the instance and check query extentupdate again. If it may take a while to run, but if it runs to completion, that will give you an idea of the amount of work left to do. You can recheck every few hours and see by how much it goes down, and from there you can extrapolate how many days it will take.

dietmar · Jul 27, 2021

I have an repl Target @ one Customer where also all Data kept on the Storage Pool for no reason actually. I deleted a punch of Data but it is not released. Currrent no idea what is happening. Very Strange and not seen before. I deleted filespaces and Node Data , run expire and no Space is released at all ... ( also a Replication Target Server )

In your case i would just destroy the complete server and reinstall . Delete all the Containers in the Filesystem and give it a go. This should be some hours work only. I always do a cli install which is in case of a recreate just a cut and paste work .... ( also a good documentation *gg* )

In my case i need to keep some Data , so no way for recreating it ...

.

good luck .

br, Dietmar

Rigido · Jul 27, 2021

Thank you all, here it is the last update for this one.

After deleting filespaces and nodes and setting reusedelay at 0, stgpool was freeing at 0.1% by hour. It was at 80% so I dropped tsmdb1 and started from scratch, now it is protecting at 1TB/hour.

Just remember that there is an automatic procedure that puts back reuse delay at 1 after 24 hours you set it at 0.

Ciao.

shcart · Aug 13, 2022

I have a 6 Hourly Admin schedule that puts back reuse delay back at 0.

Reset node replication

Rigido

marclant

REPLICATE NODE

dietmar

Rigido

Rigido

marclant

Rigido

marclant

Rigido

marclant

dietmar

Rigido

shcart

Data Privacy Impact Assessment

Sponsor ADSM.ORG

Navigation Menu

NordVPN 3 Months FREE

Forum statistics