Hi Andy,
Yes, we did get a pcap file for one of the failed LiteSpeed
backups. This backup was sent to a different TSM server than the text
pasted earlier in this email thread. (TCPPORT on this server is 1500.) The
details of the TSM server are the same as the profile (we have several new
Linux x86_64 servers): TSM 5.5.5.0 with efix for APAR IC71586, on RHEL 5.4
with the specific kernel mentioned previously.
At this point in the pcap, we see a "TCP ZeroWindow", and a successful
window update back to a non-zero size:
No. Time Source Destination Protocol
Info
30778 11.755143 TSM_CLIENT TSM_SERVER TCP 33391
> 1500 [ACK] Seq=32851627 Ack=2076 Win=65535 Len=1460
30779 11.755143 TSM_CLIENT TSM_SERVER TCP 33391
> 1500 [ACK] Seq=32853087 Ack=2076 Win=65535 Len=1460
30780 11.794143 TSM_SERVER TSM_CLIENT TCP [TCP
ZeroWindow] 1500 > 33391 [ACK] Seq=2076 Ack=32854547 Win=0 Len=0
30781 11.805144 TSM_SERVER TSM_CLIENT TCP [TCP
Window Update] 1500 > 33391 [ACK] Seq=2076 Ack=32854547 Win=8760 Len=0
30782 11.805144 TSM_SERVER TSM_CLIENT TCP [TCP
Window Update] 1500 > 33391 [ACK] Seq=2076 Ack=32854547 Win=35040 Len=0
30783 11.806144 TSM_CLIENT TSM_SERVER TCP 33391
> 1500 [ACK] Seq=32854547 Ack=2076 Win=65535 Len=1460
30784 11.806144 TSM_CLIENT TSM_SERVER TCP 33391
> 1500 [ACK] Seq=32856007 Ack=2076 Win=65535 Len=1460
At the end of the pcap, we see a "TCP ZeroWindow", but to the end of the
text the window stays Zero. The pcap was manually ended where the text
ends, but from talking with the folks that did the test, LiteSpeed
eventually times out at 5000 seconds, with no further backup progress.
After reviewing the wireshark output this point it appears either network
stack state machine is getting stuck with window size zero, or TSM is
waiting for some resource that isn't available. This TSM server is a
completely vanilla config, with all the same settings (as we can
determine) as the older machines running RHEL4, so I suspect the former.
The fact that nothing shows up in the actlog seems to agree.
No. Time Source Destination Protocol
Info
30972 11.808144 <Ignored>
30973 11.808144 TSM_CLIENT TSM_SERVER TCP [TCP
Previous segment lost] 33391 > 1500 [ACK] Seq=33051647 Ack=2076 Win=65535
Len=1460
30974 11.808144 TSM_SERVER TSM_CLIENT TCP 1500 >
33391 [ACK] Seq=2076 Ack=33039967 Win=64240 Len=0
30975 11.808144 TSM_CLIENT TSM_SERVER TCP 33391
> 1500 [ACK] Seq=33053107 Ack=2076 Win=65535 Len=1460
30976 11.808144 TSM_CLIENT TSM_SERVER TCP 33391
> 1500 [ACK] Seq=33054567 Ack=2076 Win=65535 Len=1460
30977 11.808144 TSM_SERVER TSM_CLIENT TCP 1500 >
33391 [ACK] Seq=2076 Ack=33042887 Win=64240 Len=0
30978 11.808144 TSM_CLIENT TSM_SERVER TCP 33391
> 1500 [ACK] Seq=33056027 Ack=2076 Win=65535 Len=1460
30979 11.808144 TSM_CLIENT TSM_SERVER TCP 33391
> 1500 [ACK] Seq=33057487 Ack=2076 Win=65535 Len=1460
30980 11.808144 TSM_SERVER TSM_CLIENT TCP 1500 >
33391 [ACK] Seq=2076 Ack=33045807 Win=64240 Len=0
30981 11.808144 TSM_CLIENT TSM_SERVER TCP 33391
> 1500 [ACK] Seq=33058947 Ack=2076 Win=65535 Len=1460
30982 11.808144 TSM_CLIENT TSM_SERVER TCP 33391
> 1500 [ACK] Seq=33060407 Ack=2076 Win=65535 Len=1460
30983 11.808144 TSM_CLIENT TSM_SERVER TCP 33391
> 1500 [ACK] Seq=33061867 Ack=2076 Win=65535 Len=1460
30984 11.808144 TSM_CLIENT TSM_SERVER TCP 33391
> 1500 [ACK] Seq=33063327 Ack=2076 Win=65535 Len=1460
30985 11.808144 TSM_CLIENT TSM_SERVER TCP 33391
> 1500 [ACK] Seq=33064787 Ack=2076 Win=65535 Len=1460
30986 11.808144 TSM_CLIENT TSM_SERVER TCP 33391
> 1500 [ACK] Seq=33066247 Ack=2076 Win=65535 Len=1460
30987 11.808144 TSM_CLIENT TSM_SERVER TCP 33391
> 1500 [ACK] Seq=33067707 Ack=2076 Win=65535 Len=1460
30988 11.808144 TSM_CLIENT TSM_SERVER TCP 33391
> 1500 [ACK] Seq=33069167 Ack=2076 Win=65535 Len=1460
30989 11.808144 TSM_CLIENT TSM_SERVER TCP 33391
> 1500 [ACK] Seq=33070627 Ack=2076 Win=65535 Len=1460
30990 11.808144 TSM_CLIENT TSM_SERVER TCP 33391
> 1500 [ACK] Seq=33072087 Ack=2076 Win=65535 Len=1460
30991 11.808144 TSM_CLIENT TSM_SERVER TCP 33391
> 1500 [ACK] Seq=33073547 Ack=2076 Win=65535 Len=1460
30992 11.808144 TSM_CLIENT TSM_SERVER TCP 33391
> 1500 [PSH, ACK] Seq=33075007 Ack=2076 Win=65535 Len=1460
30993 11.808144 TSM_CLIENT TSM_SERVER TCP 33391
> 1500 [ACK] Seq=33076467 Ack=2076 Win=65535 Len=1460
30994 11.808144 TSM_CLIENT TSM_SERVER TCP 33391
> 1500 [ACK] Seq=33077927 Ack=2076 Win=65535 Len=1460
30995 11.808144 TSM_CLIENT TSM_SERVER TCP 33391
> 1500 [ACK] Seq=33079387 Ack=2076 Win=65535 Len=1460
30996 11.808144 TSM_CLIENT TSM_SERVER TCP 33391
> 1500 [ACK] Seq=33080847 Ack=2076 Win=65535 Len=1460
30997 11.808144 TSM_CLIENT TSM_SERVER TCP 33391
> 1500 [ACK] Seq=33082307 Ack=2076 Win=65535 Len=1460
30998 11.808144 TSM_CLIENT TSM_SERVER TCP 33391
> 1500 [ACK] Seq=33083767 Ack=2076 Win=65535 Len=1460
30999 11.808144 TSM_CLIENT TSM_SERVER TCP 33391
> 1500 [ACK] Seq=33085227 Ack=2076 Win=65535 Len=1460
31000 11.808144 TSM_CLIENT TSM_SERVER TCP 33391
> 1500 [ACK] Seq=33086687 Ack=2076 Win=65535 Len=1460
31001 11.808144 TSM_CLIENT TSM_SERVER TCP 33391
> 1500 [ACK] Seq=33088147 Ack=2076 Win=65535 Len=1460
31002 11.808144 TSM_CLIENT TSM_SERVER TCP 33391
> 1500 [ACK] Seq=33089607 Ack=2076 Win=65535 Len=1460
31003 11.808144 TSM_CLIENT TSM_SERVER TCP 33391
> 1500 [ACK] Seq=33091067 Ack=2076 Win=65535 Len=1460
31004 11.808144 TSM_CLIENT TSM_SERVER TCP 33391
> 1500 [ACK] Seq=33092527 Ack=2076 Win=65535 Len=1460
31005 11.808144 TSM_CLIENT TSM_SERVER TCP 33391
> 1500 [ACK] Seq=33093987 Ack=2076 Win=65535 Len=1460
31006 11.808144 TSM_CLIENT TSM_SERVER TCP 33391
> 1500 [ACK] Seq=33095447 Ack=2076 Win=65535 Len=1460
31007 11.808144 TSM_CLIENT TSM_SERVER TCP 33391
> 1500 [ACK] Seq=33096907 Ack=2076 Win=65535 Len=1460
31008 11.808144 TSM_CLIENT TSM_SERVER TCP 33391
> 1500 [ACK] Seq=33098367 Ack=2076 Win=65535 Len=1460
31009 11.808144 TSM_CLIENT TSM_SERVER TCP 33391
> 1500 [ACK] Seq=33099827 Ack=2076 Win=65535 Len=1460
31010 11.808144 TSM_CLIENT TSM_SERVER TCP 33391
> 1500 [ACK] Seq=33101287 Ack=2076 Win=65535 Len=1460
31011 11.808144 TSM_CLIENT TSM_SERVER TCP 33391
> 1500 [ACK] Seq=33102747 Ack=2076 Win=65535 Len=1460
31012 11.808144 TSM_CLIENT TSM_SERVER TCP 33391
> 1500 [ACK] Seq=33104207 Ack=2076 Win=65535 Len=1460
31013 11.808144 TSM_CLIENT TSM_SERVER TCP 33391
> 1500 [ACK] Seq=33105667 Ack=2076 Win=65535 Len=1460
31014 11.808144 TSM_CLIENT TSM_SERVER TCP 33391
> 1500 [ACK] Seq=33107127 Ack=2076 Win=65535 Len=1460
31015 11.808144 TSM_CLIENT TSM_SERVER TCP 33391
> 1500 [ACK] Seq=33108587 Ack=2076 Win=65535 Len=1460
31016 11.841144 TSM_SERVER TSM_CLIENT TCP [TCP
ZeroWindow] 1500 > 33391 [ACK] Seq=2076 Ack=33110047 Win=0 Len=0
31017 16.501201 TSM_CLIENT TSM_SERVER TCP [TCP
ZeroWindowProbe] 33391 > 1500 [ACK] Seq=33110047 Ack=2076 Win=65535 Len=1
31018 16.501201 TSM_SERVER TSM_CLIENT TCP [TCP
ZeroWindowProbeAck] [TCP ZeroWindow] 1500 > 33391 [ACK] Seq=2076
Ack=33110047 Win=0 Len=0
31019 21.502260 TSM_CLIENT TSM_SERVER TCP [TCP
ZeroWindowProbe] 33391 > 1500 [ACK] Seq=33110047 Ack=2076 Win=65535 Len=1
31020 21.502260 TSM_SERVER TSM_CLIENT TCP [TCP
ZeroWindowProbeAck] [TCP ZeroWindow] 1500 > 33391 [ACK] Seq=2076
Ack=33110047 Win=0 Len=0
31021 26.527323 TSM_CLIENT TSM_SERVER TCP [TCP
ZeroWindowProbe] 33391 > 1500 [ACK] Seq=33110047 Ack=2076 Win=65535 Len=1
31022 26.527323 TSM_SERVER TSM_CLIENT TCP [TCP
ZeroWindowProbeAck] [TCP ZeroWindow] 1500 > 33391 [ACK] Seq=2076
Ack=33110047 Win=0 Len=0
I've found one discussion around RHEL 5.4 and a Microsoft box, but the
specifics vary quite a bit:
http://stackoverflow.com/questions/4833954/the-xbox-360-tcp-stack-does-not-respond-to-tcp-zero-window-probes-with-a-0-byte-p
Thanks,
[RC]
From:
Andrew Raibeck <storman AT US.IBM DOT COM>
To:
ADSM-L AT VM.MARIST DOT EDU
Date:
02/15/2011 11:54 AM
Subject:
Re: [ADSM-L] SQLLiteSpeed backups hanging when moved to TSM server on
RHEL5.
Sent by:
"ADSM: Dist Stor Manager" <ADSM-L AT VM.MARIST DOT EDU>
Robert,
On it's face, this sounds like something in the network, with "network"
being between the TSM client side TCP stack and the TSM server side TCP
stack. Have you done any kind of packet tracing to see what's going on?
Best regards,
Andy Raibeck
IBM Software Group
Tivoli Storage Manager Client Product Development
Level 3 Team Lead
Internal Notes e-mail: Andrew Raibeck/Hartford/IBM@IBMUS
Internet e-mail: storman AT us.ibm DOT com
IBM Tivoli Storage Manager support web page:
http://www.ibm.com/support/entry/portal/Overview/Software/Tivoli/Tivoli_Storage_Manager
"ADSM: Dist Stor Manager" <ADSM-L AT vm.marist DOT edu> wrote on 2011-02-15
14:21:57:
> From: Robert Clark <robert.clark7 AT USBANK DOT COM>
> To: ADSM-L AT vm.marist DOT edu
> Date: 2011-02-15 14:24
> Subject: SQLLiteSpeed backups hanging when moved to TSM server on RHEL5.
> Sent by: "ADSM: Dist Stor Manager" <ADSM-L AT vm.marist DOT edu>
>
> We're running into a problem when trying to change SQLLiteSpeed backups
> clients to point to new TSM servers.
>
> The old TSM servers are RHEL4 (on Intel) running TSM server 5.5.5.0 with
> efix for APAR IC71586. (kernel 2.6.9-89.0.11.ELsmp)
>
> The new TSM servers are RHEL5 (on Intel) running TSM server 5.5.5.0 with
> efix for APAR IC71586. (kernel 2.6.18-164.11.1.el5)
>
>
> We've made sure all the relevant values are set the same on the new
> servers, as on the old. (Management classes, disk storage pools,
maxnummp,
> and everything displayed in "q opt" output on the TSM server.)
>
> The two SQLLiteSpeed clients we've used for testing are:
>
> GENERICSYSTEMNAME1
> O/S: 2008
> SQL version: - 10.0.4000.0 (2008 SP2)
> SQL litespeed version: - 5.0.2.0
> TSM Client: 6.1.3.0
>
> GENERICSYSTEMNAME2
> O/S: 2003
> SQL version: - 9.00.4207.00 (2005 SP3)
> SQL litespeed version:- 5.0.2.0
> TSM Client: 6.1.3.0
>
> We have gathered client side trace, and it appears to indicate the
socket
> is being closed:
>
> 02/09/2011 15:07:51.192 : commtcp.cpp (2525): ANS1006I TCP/IP write
error
> on socket = 9300, errno = 10053, reason : An established connection was
> aborted by the software in your host machine.
>
> 02/09/2011 15:07:51.192 : apisend.cpp (1175):
> Contents of verb (0x7) Data, length: 32768:
>
> 02/09/2011 15:07:51.192 : commtcp.cpp (2525): ANS1006I TCP/IP write
error
> on socket = 4294967295, errno = 10038, reason : An operation was
attempted
>
> on something that is not a socket.
>
> We have also gathered server side trace, but nothing unusual has been
> noted there.
>
> The symptom on the TSM server is that backup session stops making
progress
> after a few minutes, and ultimately must be canceled to be cleaned up.
>
> We've opened a case with Tivoli support, and are working with the
> sysadmins of the TSM server. We're not making much progress. My hope is
> to jog the memory of the list and see if anyone has seen window size or
> other stack weirdness with RHEL 5 that is triggered by LiteSpeed
backups.
>
> Thanks,
> [RC]
U.S. BANCORP made the following annotations
---------------------------------------------------------------------
Electronic Privacy Notice. This e-mail, and any attachments, contains
information that is, or may be, covered by electronic communications privacy
laws, and is also confidential and proprietary in nature. If you are not the
intended recipient, please be advised that you are legally prohibited from
retaining, using, copying, distributing, or otherwise disclosing this
information in any manner. Instead, please reply to the sender that you have
received this communication in error, and then immediately delete it. Thank you
in advance for your cooperation.
---------------------------------------------------------------------
|