ADSM-L

Re: [ADSM-L] SQLLiteSpeed backups hanging when moved to TSM server on RHEL5.

2011-02-16 13:15:14
Subject: Re: [ADSM-L] SQLLiteSpeed backups hanging when moved to TSM server on RHEL5.
From: Robert Clark <robert.clark7 AT USBANK DOT COM>
To: ADSM-L AT VM.MARIST DOT EDU
Date: Wed, 16 Feb 2011 10:14:00 -0800
Hi Andy,

        Yes, we did get a pcap file for one of the failed LiteSpeed
backups. This backup was sent to a different TSM server than the text
pasted earlier in this email thread. (TCPPORT on this server is 1500.) The
details of the TSM server are the same as the profile (we have several new
Linux x86_64 servers): TSM 5.5.5.0 with efix for APAR IC71586, on RHEL 5.4
with the specific kernel mentioned previously.

At this point in the pcap, we see a "TCP ZeroWindow", and a successful
window update back to a non-zero size:

No.     Time        Source                Destination           Protocol
Info
  30778 11.755143   TSM_CLIENT          TSM_SERVER         TCP      33391
> 1500 [ACK] Seq=32851627 Ack=2076 Win=65535 Len=1460
  30779 11.755143   TSM_CLIENT          TSM_SERVER         TCP      33391
> 1500 [ACK] Seq=32853087 Ack=2076 Win=65535 Len=1460
  30780 11.794143   TSM_SERVER         TSM_CLIENT          TCP      [TCP
ZeroWindow] 1500 > 33391 [ACK] Seq=2076 Ack=32854547 Win=0 Len=0
  30781 11.805144   TSM_SERVER         TSM_CLIENT          TCP      [TCP
Window Update] 1500 > 33391 [ACK] Seq=2076 Ack=32854547 Win=8760 Len=0
  30782 11.805144   TSM_SERVER         TSM_CLIENT          TCP      [TCP
Window Update] 1500 > 33391 [ACK] Seq=2076 Ack=32854547 Win=35040 Len=0
  30783 11.806144   TSM_CLIENT          TSM_SERVER         TCP      33391
> 1500 [ACK] Seq=32854547 Ack=2076 Win=65535 Len=1460
  30784 11.806144   TSM_CLIENT          TSM_SERVER         TCP      33391
> 1500 [ACK] Seq=32856007 Ack=2076 Win=65535 Len=1460


At the end of the pcap, we see a "TCP ZeroWindow", but to the end of the
text the window stays Zero. The pcap was manually ended where the text
ends, but from talking with the folks that did the test, LiteSpeed
eventually times out at 5000 seconds, with no further backup progress.

After reviewing the wireshark output this point it appears either network
stack state machine is getting stuck with window size zero, or TSM is
waiting for some resource that isn't available. This TSM server is a
completely vanilla config, with all the same settings (as we can
determine) as the older machines running RHEL4, so I suspect the former.
The fact that nothing shows up in the actlog seems to agree.

No.     Time        Source                Destination           Protocol
Info
  30972 11.808144 <Ignored>
  30973 11.808144   TSM_CLIENT          TSM_SERVER         TCP      [TCP
Previous segment lost] 33391 > 1500 [ACK] Seq=33051647 Ack=2076 Win=65535
Len=1460
  30974 11.808144   TSM_SERVER         TSM_CLIENT          TCP      1500 >
33391 [ACK] Seq=2076 Ack=33039967 Win=64240 Len=0
  30975 11.808144   TSM_CLIENT          TSM_SERVER         TCP      33391
> 1500 [ACK] Seq=33053107 Ack=2076 Win=65535 Len=1460
  30976 11.808144   TSM_CLIENT          TSM_SERVER         TCP      33391
> 1500 [ACK] Seq=33054567 Ack=2076 Win=65535 Len=1460
  30977 11.808144   TSM_SERVER         TSM_CLIENT          TCP      1500 >
33391 [ACK] Seq=2076 Ack=33042887 Win=64240 Len=0
  30978 11.808144   TSM_CLIENT          TSM_SERVER         TCP      33391
> 1500 [ACK] Seq=33056027 Ack=2076 Win=65535 Len=1460
  30979 11.808144   TSM_CLIENT          TSM_SERVER         TCP      33391
> 1500 [ACK] Seq=33057487 Ack=2076 Win=65535 Len=1460
  30980 11.808144   TSM_SERVER         TSM_CLIENT          TCP      1500 >
33391 [ACK] Seq=2076 Ack=33045807 Win=64240 Len=0
  30981 11.808144   TSM_CLIENT          TSM_SERVER         TCP      33391
> 1500 [ACK] Seq=33058947 Ack=2076 Win=65535 Len=1460
  30982 11.808144   TSM_CLIENT          TSM_SERVER         TCP      33391
> 1500 [ACK] Seq=33060407 Ack=2076 Win=65535 Len=1460
  30983 11.808144   TSM_CLIENT          TSM_SERVER         TCP      33391
> 1500 [ACK] Seq=33061867 Ack=2076 Win=65535 Len=1460
  30984 11.808144   TSM_CLIENT          TSM_SERVER         TCP      33391
> 1500 [ACK] Seq=33063327 Ack=2076 Win=65535 Len=1460
  30985 11.808144   TSM_CLIENT          TSM_SERVER         TCP      33391
> 1500 [ACK] Seq=33064787 Ack=2076 Win=65535 Len=1460
  30986 11.808144   TSM_CLIENT          TSM_SERVER         TCP      33391
> 1500 [ACK] Seq=33066247 Ack=2076 Win=65535 Len=1460
  30987 11.808144   TSM_CLIENT          TSM_SERVER         TCP      33391
> 1500 [ACK] Seq=33067707 Ack=2076 Win=65535 Len=1460
  30988 11.808144   TSM_CLIENT          TSM_SERVER         TCP      33391
> 1500 [ACK] Seq=33069167 Ack=2076 Win=65535 Len=1460
  30989 11.808144   TSM_CLIENT          TSM_SERVER         TCP      33391
> 1500 [ACK] Seq=33070627 Ack=2076 Win=65535 Len=1460
  30990 11.808144   TSM_CLIENT          TSM_SERVER         TCP      33391
> 1500 [ACK] Seq=33072087 Ack=2076 Win=65535 Len=1460
  30991 11.808144   TSM_CLIENT          TSM_SERVER         TCP      33391
> 1500 [ACK] Seq=33073547 Ack=2076 Win=65535 Len=1460
  30992 11.808144   TSM_CLIENT          TSM_SERVER         TCP      33391
> 1500 [PSH, ACK] Seq=33075007 Ack=2076 Win=65535 Len=1460
  30993 11.808144   TSM_CLIENT          TSM_SERVER         TCP      33391
> 1500 [ACK] Seq=33076467 Ack=2076 Win=65535 Len=1460
  30994 11.808144   TSM_CLIENT          TSM_SERVER         TCP      33391
> 1500 [ACK] Seq=33077927 Ack=2076 Win=65535 Len=1460
  30995 11.808144   TSM_CLIENT          TSM_SERVER         TCP      33391
> 1500 [ACK] Seq=33079387 Ack=2076 Win=65535 Len=1460
  30996 11.808144   TSM_CLIENT          TSM_SERVER         TCP      33391
> 1500 [ACK] Seq=33080847 Ack=2076 Win=65535 Len=1460
  30997 11.808144   TSM_CLIENT          TSM_SERVER         TCP      33391
> 1500 [ACK] Seq=33082307 Ack=2076 Win=65535 Len=1460
  30998 11.808144   TSM_CLIENT          TSM_SERVER         TCP      33391
> 1500 [ACK] Seq=33083767 Ack=2076 Win=65535 Len=1460
  30999 11.808144   TSM_CLIENT          TSM_SERVER         TCP      33391
> 1500 [ACK] Seq=33085227 Ack=2076 Win=65535 Len=1460
  31000 11.808144   TSM_CLIENT          TSM_SERVER         TCP      33391
> 1500 [ACK] Seq=33086687 Ack=2076 Win=65535 Len=1460
  31001 11.808144   TSM_CLIENT          TSM_SERVER         TCP      33391
> 1500 [ACK] Seq=33088147 Ack=2076 Win=65535 Len=1460
  31002 11.808144   TSM_CLIENT          TSM_SERVER         TCP      33391
> 1500 [ACK] Seq=33089607 Ack=2076 Win=65535 Len=1460
  31003 11.808144   TSM_CLIENT          TSM_SERVER         TCP      33391
> 1500 [ACK] Seq=33091067 Ack=2076 Win=65535 Len=1460
  31004 11.808144   TSM_CLIENT          TSM_SERVER         TCP      33391
> 1500 [ACK] Seq=33092527 Ack=2076 Win=65535 Len=1460
  31005 11.808144   TSM_CLIENT          TSM_SERVER         TCP      33391
> 1500 [ACK] Seq=33093987 Ack=2076 Win=65535 Len=1460
  31006 11.808144   TSM_CLIENT          TSM_SERVER         TCP      33391
> 1500 [ACK] Seq=33095447 Ack=2076 Win=65535 Len=1460
  31007 11.808144   TSM_CLIENT          TSM_SERVER         TCP      33391
> 1500 [ACK] Seq=33096907 Ack=2076 Win=65535 Len=1460
  31008 11.808144   TSM_CLIENT          TSM_SERVER         TCP      33391
> 1500 [ACK] Seq=33098367 Ack=2076 Win=65535 Len=1460
  31009 11.808144   TSM_CLIENT          TSM_SERVER         TCP      33391
> 1500 [ACK] Seq=33099827 Ack=2076 Win=65535 Len=1460
  31010 11.808144   TSM_CLIENT          TSM_SERVER         TCP      33391
> 1500 [ACK] Seq=33101287 Ack=2076 Win=65535 Len=1460
  31011 11.808144   TSM_CLIENT          TSM_SERVER         TCP      33391
> 1500 [ACK] Seq=33102747 Ack=2076 Win=65535 Len=1460
  31012 11.808144   TSM_CLIENT          TSM_SERVER         TCP      33391
> 1500 [ACK] Seq=33104207 Ack=2076 Win=65535 Len=1460
  31013 11.808144   TSM_CLIENT          TSM_SERVER         TCP      33391
> 1500 [ACK] Seq=33105667 Ack=2076 Win=65535 Len=1460
  31014 11.808144   TSM_CLIENT          TSM_SERVER         TCP      33391
> 1500 [ACK] Seq=33107127 Ack=2076 Win=65535 Len=1460
  31015 11.808144   TSM_CLIENT          TSM_SERVER         TCP      33391
> 1500 [ACK] Seq=33108587 Ack=2076 Win=65535 Len=1460
  31016 11.841144   TSM_SERVER         TSM_CLIENT          TCP      [TCP
ZeroWindow] 1500 > 33391 [ACK] Seq=2076 Ack=33110047 Win=0 Len=0
  31017 16.501201   TSM_CLIENT          TSM_SERVER         TCP      [TCP
ZeroWindowProbe] 33391 > 1500 [ACK] Seq=33110047 Ack=2076 Win=65535 Len=1
  31018 16.501201   TSM_SERVER         TSM_CLIENT          TCP      [TCP
ZeroWindowProbeAck] [TCP ZeroWindow] 1500 > 33391 [ACK] Seq=2076
Ack=33110047 Win=0 Len=0
  31019 21.502260   TSM_CLIENT          TSM_SERVER         TCP      [TCP
ZeroWindowProbe] 33391 > 1500 [ACK] Seq=33110047 Ack=2076 Win=65535 Len=1
  31020 21.502260   TSM_SERVER         TSM_CLIENT          TCP      [TCP
ZeroWindowProbeAck] [TCP ZeroWindow] 1500 > 33391 [ACK] Seq=2076
Ack=33110047 Win=0 Len=0
  31021 26.527323   TSM_CLIENT          TSM_SERVER         TCP      [TCP
ZeroWindowProbe] 33391 > 1500 [ACK] Seq=33110047 Ack=2076 Win=65535 Len=1
  31022 26.527323   TSM_SERVER         TSM_CLIENT          TCP      [TCP
ZeroWindowProbeAck] [TCP ZeroWindow] 1500 > 33391 [ACK] Seq=2076
Ack=33110047 Win=0 Len=0


I've found one discussion around RHEL 5.4 and a Microsoft box, but the
specifics vary quite a bit:

http://stackoverflow.com/questions/4833954/the-xbox-360-tcp-stack-does-not-respond-to-tcp-zero-window-probes-with-a-0-byte-p

Thanks,
[RC]




From:
Andrew Raibeck <storman AT US.IBM DOT COM>
To:
ADSM-L AT VM.MARIST DOT EDU
Date:
02/15/2011 11:54 AM
Subject:
Re: [ADSM-L] SQLLiteSpeed backups hanging when moved to TSM server on
RHEL5.
Sent by:
"ADSM: Dist Stor Manager" <ADSM-L AT VM.MARIST DOT EDU>



Robert,

On it's face, this sounds like something in the network, with "network"
being between the TSM client side TCP stack and the TSM server side TCP
stack. Have you done any kind of packet tracing to see what's going on?

Best regards,

Andy Raibeck
IBM Software Group
Tivoli Storage Manager Client Product Development
Level 3 Team Lead
Internal Notes e-mail: Andrew Raibeck/Hartford/IBM@IBMUS
Internet e-mail: storman AT us.ibm DOT com

IBM Tivoli Storage Manager support web page:
http://www.ibm.com/support/entry/portal/Overview/Software/Tivoli/Tivoli_Storage_Manager


"ADSM: Dist Stor Manager" <ADSM-L AT vm.marist DOT edu> wrote on 2011-02-15
14:21:57:

> From: Robert Clark <robert.clark7 AT USBANK DOT COM>
> To: ADSM-L AT vm.marist DOT edu
> Date: 2011-02-15 14:24
> Subject: SQLLiteSpeed backups hanging when moved to TSM server on RHEL5.
> Sent by: "ADSM: Dist Stor Manager" <ADSM-L AT vm.marist DOT edu>
>
> We're running into a problem when trying to change SQLLiteSpeed backups
> clients to point to new TSM servers.
>
> The old TSM servers are RHEL4 (on Intel) running TSM server 5.5.5.0 with
> efix for APAR IC71586.  (kernel 2.6.9-89.0.11.ELsmp)
>
> The new TSM servers are RHEL5 (on Intel) running TSM server 5.5.5.0 with
> efix for APAR IC71586. (kernel 2.6.18-164.11.1.el5)
>
>
> We've made sure all the relevant values are set the same on the new
> servers, as on the old. (Management classes, disk storage pools,
maxnummp,
> and everything displayed in "q opt" output on the TSM server.)
>
> The two SQLLiteSpeed clients we've used for testing are:
>
> GENERICSYSTEMNAME1
> O/S: 2008
> SQL version: - 10.0.4000.0 (2008 SP2)
> SQL litespeed version: - 5.0.2.0
> TSM Client: 6.1.3.0
>
> GENERICSYSTEMNAME2
> O/S: 2003
> SQL version: - 9.00.4207.00 (2005 SP3)
> SQL litespeed version:- 5.0.2.0
> TSM Client: 6.1.3.0
>
> We have gathered client side trace, and it appears to indicate the
socket
> is being closed:
>
> 02/09/2011 15:07:51.192 : commtcp.cpp (2525): ANS1006I TCP/IP write
error
> on socket = 9300, errno = 10053, reason : An established connection was
> aborted by the software in your host machine.
>
> 02/09/2011 15:07:51.192 : apisend.cpp (1175):
> Contents of verb (0x7) Data, length: 32768:
>
> 02/09/2011 15:07:51.192 : commtcp.cpp (2525): ANS1006I TCP/IP write
error
> on socket = 4294967295, errno = 10038, reason : An operation was
attempted
>
> on something that is not a socket.
>
> We have also gathered server side trace, but nothing unusual has been
> noted there.
>
> The symptom on the TSM server is that backup session stops making
progress
> after a few minutes, and ultimately must be canceled to be cleaned up.
>
> We've opened a case with Tivoli support, and are working with the
> sysadmins of the TSM server.  We're not making much progress. My hope is
> to jog the memory of the list and see if anyone has seen window size or
> other stack weirdness with RHEL 5 that is triggered by LiteSpeed
backups.
>
> Thanks,
> [RC]


U.S. BANCORP made the following annotations
---------------------------------------------------------------------
Electronic Privacy Notice. This e-mail, and any attachments, contains 
information that is, or may be, covered by electronic communications privacy 
laws, and is also confidential and proprietary in nature. If you are not the 
intended recipient, please be advised that you are legally prohibited from 
retaining, using, copying, distributing, or otherwise disclosing this 
information in any manner. Instead, please reply to the sender that you have 
received this communication in error, and then immediately delete it. Thank you 
in advance for your cooperation.



---------------------------------------------------------------------