[DEFECT] SP 8.1.4 Server to Server Sessions

ILCattivo

ADSM.ORG Senior Member
Joined
Jul 9, 2013
Messages
192
Reaction score
14
Points
0
Location
Oxford, United Kingdom
Happy New Year all.

So I have 2 x SP 8.1.4 servers located at different DC locations. 1 x Windows 2016 & 1 x RHEL 7

Both of these servers Protect each others Container Storage Pools via the Protect STG and Replicate Node cmds daily. Maxsessions=30 for the Storage Pool Protection process.

Today I have discovered literally 100's of 'RecvW' SSL server sessions going back 200+ hours on one of the destination servers from the other source server? But not the other way round..?

Admittedly the first Protect STG sync was a good 8+TB in size over the WAN link and took a few days to complete, but since then we have these orphaned RecvW' server sessions hanging around and they're simply not shifting once the protection of the storage pool is successful each day?

Something I am missing here or potentially another bug?

Thanks
 
I can confirm that having looked at an identical setup of mine with 2 x 8.1.0 SP Servers , both on win 2012, this behaviour is not present.!

By the way, this odd orphaned session behaviour as described in the OP is Source [win2k16] > Destination [RHEL7] both SP 8.1.4
 
RecvW (receive wait) is normally due to something at the networking layer. A transaction is in progress, but the receiving side suddenly has to wait mid-transfer. This situation "normally" stops when either the transmission continues or when the commtimeout is reached (60 seconds by default).
 
RecvW (receive wait) is normally due to something at the networking layer. A transaction is in progress, but the receiving side suddenly has to wait mid-transfer. This situation "normally" stops when either the transmission continues or when the commtimeout is reached (60 seconds by default).

Yep, can confirm that the COMMTIMEOUT setting is set to the default '60' so as you say, in theory the sessions should have stopped.

Interestingly on my 8.1.0 setup the COMMTIMEOUT is a lot lot bigger.
 
Have you ever solved the problem?

Ah ha.. Do we have someone else with the same issue here?

I have had a PMR open with IBM with this for months now. All sorts of traces put in place and the only thing they can determine is network issues, causing the protect stg pool sessions to disconnect abruptly, in the direction of my Windows ISP Server > Linux ISP Server. Strange however that the issue is not happening in the other direction over that same VPN tunnel??

We have upped network traffic timeouts on our current firewalls at both ends to try and counter this but to no avail. The issue still persists?? Win > Lnx.

We are due to upgrade our firewall hardware at both ends in the coming months, while also increasing the bandwidth capacity between the two, so if that doesn't sort it then it's deffo an issue with the underlying code between ISP between the two different platforms!!

Watch this space...
 
I have similar problem, orphan replication sessions that stays on receiving side even if they are closed on sending side.
I have two Windows 2016 TSM servers, upgraded to 8.1.5 since opening PMR (started with 8.1.4)
So, not sure if it hes something with different platforms.
I have only one direction replication, so I can't tell if same should happen other way.
I will post if there are some findings with my PMR.
 
I have similar problem, orphan replication sessions that stays on receiving side even if they are closed on sending side.
I have two Windows 2016 TSM servers, upgraded to 8.1.5 since opening PMR (started with 8.1.4)
So, not sure if it hes something with different platforms.
I have only one direction replication, so I can't tell if same should happen other way.
I will post if there are some findings with my PMR.

Ok, that's good that you have PMR open with what looks like an identical case.
The server OS that is sending the ISP 8.1.4.2 Protect Stg & replication data to my RHEL 7 ISP 8.1.4.2 server is also 'Windows Server 2016.' <--- Currently a common denominator between our two cases.

Tell you what i'll do.. Might be worth posting my PMR ref no here so if you want you can also provide it to them as a similar reference to your case.

PMR 30728,999,866

The chap I was dealing with @ Lvl 2 Support was Dave Border. [IBM UK]

I have my suspicions this is not network related if these kinds of issues are becoming more prevalent in ISP 8.1.4/5 between 2 replicating servers where at least one is coming from a Windows 2016 Server.

Most of the time taken to identify, what they believe to be the cause, will be the numerous trace files they require. Can take weeks!!
 
Hi,

My case is 07208,707,707 and I am still with L3 (opened on May 23th)
My guy told me he has passed my servmon.pl outputs to L2 few days ago. I will post how it goes.
 
Hi,

My case is 07208,707,707 and I am still with L3 (opened on May 23th)
My guy told me he has passed my servmon.pl outputs to L2 few days ago. I will post how it goes.

Ah yes, been through those too..
I am willing to bet they will come back with the finger pointing at your network like they did with me?

Not sure if you are using it yet where you are, but here in the UK we no longer use the old PMR system. Its now a self service desk dashboard with 'Cases' instead of PMR's. Much better and quicker to manage.

Yep, keep us updated with progress, cause mine just hit a brick wall when they said it was a Network timeout issue!!
 
Yes, I am using new interface too, but I can see "Legacy case number" as well, when I have opened it by mailing to IBM.
No news here, just an issue after upgrade to ISP 8.1.5 - the ISP DB Backup will not work any more until you upgrade ISP client on the ISP server to 8.1.4.1 (newer GSKIT is needed). So I have had another, forked case.
 
Hi
I wonder if you have got any solutions to your replication problems yet?
 
Not for my case. We are on some hold, waiting for hw upgrade. I will post news when there are any
 
Yep, can confirm that the COMMTIMEOUT setting is set to the default '60' so as you say, in theory the sessions should have stopped.

Interestingly on my 8.1.0 setup the COMMTIMEOUT is a lot lot bigger.

Yes, the sessions should have timed out - this really points to a code defect. Given that your 8.1.0 works but not 8.1.4 leads me to this conclusion.
 
Hi
I wonder if you have got any solutions to your replication problems yet?

As others have said, not yet no!!

Not for my case. We are on some hold, waiting for hw upgrade. I will post news when there are any

Me too, IBM started to point the blame at our Network switching / firewalling and bandwidth...??
The switching is soon to be upgraded to the latest and greatest in the coming month or two, so we shall see hey!

Yes, the sessions should have timed out - this really points to a code defect. Given that your 8.1.0 works but not 8.1.4 leads me to this conclusion.

The 8.1.0 system in my case is Windows > Windows.
I suspect, as you do, it's a code defect in ISP 8.1.4 from Windows > Linux as when Protect STG & Replicate Node run from Linux > Windows.. Guess what... it works fine and the sessions close cleanly..
 
Any progress or fixes for this issue? I'm seeing this with 8.1.5 server to server replication (storage pool protection). If a connection dies, receiving server will have "orphaned" sessions which DO NOT timeout, which is rather strange...
 
At the environment I had the problem, it goes away after upgrade to 8.1.5, some HW improvements (second server DB moved to SSD, was on spinning disks before), and really there were some network issues on WAN link. After resolving network issues, whole thing started working way faster, and with no orphaned connections.
 
Back
Top