ADSM-L

Re: [ADSM-L] SQLLiteSpeed backups hanging when moved to TSM server on RHEL5.

2011-03-04 19:57:03
Subject: Re: [ADSM-L] SQLLiteSpeed backups hanging when moved to TSM server on RHEL5.
From: Robert Clark <robert.clark7 AT USBANK DOT COM>
To: ADSM-L AT VM.MARIST DOT EDU
Date: Fri, 4 Mar 2011 16:55:29 -0800
Hi Andrew,

We don't have all the details, but we appear to have a work around.

When going through all the TSM servers comparing settings, the servers had
no TCPWINDOWSIZE specified. All seervers were defaulted to 63k. (64512)

Even though we expect TSM to request 63k tcpwindows when opening the
socket, we think we're seeing larger (64k) windows in the failing
sessions. (The ones earlier in this thread for example)

During troubleshooting, we set the TCPWINDOSIZE to 65536 in dsmserv.opt
and restarted dsmserv. But 64512 still showed up in "q opt" output.

On the next test, we set TCPWINDOSIZE to "0". This tells dsmserv to use
the OS setting, which was 64k.

Once we made this change, we have not been able to duplicate the failure.
Also, the backups are completing faster, and the amount of wait shown in
the "q ses" output has gone down.

I'll pass more details back to the list, once we've had a chance to do
some more tuning.

Thanks,
[RC]





From:
Andrew Raibeck <storman AT US.IBM DOT COM>
To:
ADSM-L AT VM.MARIST DOT EDU
Date:
02/17/2011 08:25 AM
Subject:
Re: [ADSM-L] SQLLiteSpeed backups hanging when moved to TSM server on
RHEL5.
Sent by:
"ADSM: Dist Stor Manager" <ADSM-L AT VM.MARIST DOT EDU>



Robert,

I am not an expert on reading packet traces, but a couple of thoughts:

- TSM only sets up the window size at the time the session is opened,
based
on the TCPWINDOWSIZE setting. What are the current TCPWINDOWSIZE settings
for your TSM server and TSM clients? Note: for Windows, you should use the
default (don't specify a window size); or if you must code something, code
TCPWINDOWSIZE 63 (which happens to be the default value). I've seen larger
sizes cause the "shrinking window" behavior, which is controlled by the
network (once the session is established, TSM does not change the window
size). On the TSM server, make sure not to use a TCPWINDOWSIZE greater
than
63 unless you are certain that the environment is configured to support
RFC
1323.

- The socket is not being closed by TSM, but by "the network", probably
due
to the communications problems. There is probably some kind of timing
problem going on, the data coming in too quickly.

- On Windows 2003, check if you could be experiencing the issue described
in http://www.ibm.com/support/docview.wss?uid=swg21460285, and take the
corrective actions described therein. I'm not confident this is the
problem, but still worth checking.

Best regards,

Andy Raibeck
IBM Software Group
Tivoli Storage Manager Client Product Development
Level 3 Team Lead
Internal Notes e-mail: Andrew Raibeck/Hartford/IBM@IBMUS
Internet e-mail: storman AT us.ibm DOT com

IBM Tivoli Storage Manager support web page:
http://www.ibm.com/support/entry/portal/Overview/Software/Tivoli/Tivoli_Storage_Manager


"ADSM: Dist Stor Manager" <ADSM-L AT vm.marist DOT edu> wrote on 2011-02-16
13:14:00:

> From: Robert Clark <robert.clark7 AT USBANK DOT COM>
> To: ADSM-L AT vm.marist DOT edu
> Date: 2011-02-16 13:18
> Subject: Re: SQLLiteSpeed backups hanging when moved to TSM server on
RHEL5.
> Sent by: "ADSM: Dist Stor Manager" <ADSM-L AT vm.marist DOT edu>
>
> Hi Andy,
>
>         Yes, we did get a pcap file for one of the failed LiteSpeed
> backups. This backup was sent to a different TSM server than the text
> pasted earlier in this email thread. (TCPPORT on this server is 1500.)
The
> details of the TSM server are the same as the profile (we have several
new
> Linux x86_64 servers): TSM 5.5.5.0 with efix for APAR IC71586, on RHEL
5.4
> with the specific kernel mentioned previously.
>
> At this point in the pcap, we see a "TCP ZeroWindow", and a successful
> window update back to a non-zero size:
>
> No.     Time        Source                Destination           Protocol
> Info
>   30778 11.755143   TSM_CLIENT          TSM_SERVER         TCP 33391
> > 1500 [ACK] Seq=32851627 Ack=2076 Win=65535 Len=1460
>   30779 11.755143   TSM_CLIENT          TSM_SERVER         TCP 33391
> > 1500 [ACK] Seq=32853087 Ack=2076 Win=65535 Len=1460
>   30780 11.794143   TSM_SERVER         TSM_CLIENT          TCP      [TCP
> ZeroWindow] 1500 > 33391 [ACK] Seq=2076 Ack=32854547 Win=0 Len=0
>   30781 11.805144   TSM_SERVER         TSM_CLIENT          TCP      [TCP
> Window Update] 1500 > 33391 [ACK] Seq=2076 Ack=32854547 Win=8760 Len=0
>   30782 11.805144   TSM_SERVER         TSM_CLIENT          TCP      [TCP
> Window Update] 1500 > 33391 [ACK] Seq=2076 Ack=32854547 Win=35040 Len=0
>   30783 11.806144   TSM_CLIENT          TSM_SERVER         TCP 33391
> > 1500 [ACK] Seq=32854547 Ack=2076 Win=65535 Len=1460
>   30784 11.806144   TSM_CLIENT          TSM_SERVER         TCP 33391
> > 1500 [ACK] Seq=32856007 Ack=2076 Win=65535 Len=1460
>
>
> At the end of the pcap, we see a "TCP ZeroWindow", but to the end of the
> text the window stays Zero. The pcap was manually ended where the text
> ends, but from talking with the folks that did the test, LiteSpeed
> eventually times out at 5000 seconds, with no further backup progress.
>
> After reviewing the wireshark output this point it appears either
network
> stack state machine is getting stuck with window size zero, or TSM is
> waiting for some resource that isn't available. This TSM server is a
> completely vanilla config, with all the same settings (as we can
> determine) as the older machines running RHEL4, so I suspect the former.
> The fact that nothing shows up in the actlog seems to agree.
>
> No.     Time        Source                Destination           Protocol
> Info
>   30972 11.808144 <Ignored>
>   30973 11.808144   TSM_CLIENT          TSM_SERVER         TCP      [TCP
> Previous segment lost] 33391 > 1500 [ACK] Seq=33051647 Ack=2076
Win=65535
> Len=1460
>   30974 11.808144   TSM_SERVER         TSM_CLIENT          TCP      1500
>
> 33391 [ACK] Seq=2076 Ack=33039967 Win=64240 Len=0
>   30975 11.808144   TSM_CLIENT          TSM_SERVER         TCP 33391
> > 1500 [ACK] Seq=33053107 Ack=2076 Win=65535 Len=1460
>   30976 11.808144   TSM_CLIENT          TSM_SERVER         TCP 33391
> > 1500 [ACK] Seq=33054567 Ack=2076 Win=65535 Len=1460
>   30977 11.808144   TSM_SERVER         TSM_CLIENT          TCP      1500
>
> 33391 [ACK] Seq=2076 Ack=33042887 Win=64240 Len=0
>   30978 11.808144   TSM_CLIENT          TSM_SERVER         TCP 33391
> > 1500 [ACK] Seq=33056027 Ack=2076 Win=65535 Len=1460
>   30979 11.808144   TSM_CLIENT          TSM_SERVER         TCP 33391
> > 1500 [ACK] Seq=33057487 Ack=2076 Win=65535 Len=1460
>   30980 11.808144   TSM_SERVER         TSM_CLIENT          TCP      1500
>
> 33391 [ACK] Seq=2076 Ack=33045807 Win=64240 Len=0
>   30981 11.808144   TSM_CLIENT          TSM_SERVER         TCP 33391
> > 1500 [ACK] Seq=33058947 Ack=2076 Win=65535 Len=1460
>   30982 11.808144   TSM_CLIENT          TSM_SERVER         TCP 33391
> > 1500 [ACK] Seq=33060407 Ack=2076 Win=65535 Len=1460
>   30983 11.808144   TSM_CLIENT          TSM_SERVER         TCP 33391
> > 1500 [ACK] Seq=33061867 Ack=2076 Win=65535 Len=1460
>   30984 11.808144   TSM_CLIENT          TSM_SERVER         TCP 33391
> > 1500 [ACK] Seq=33063327 Ack=2076 Win=65535 Len=1460
>   30985 11.808144   TSM_CLIENT          TSM_SERVER         TCP 33391
> > 1500 [ACK] Seq=33064787 Ack=2076 Win=65535 Len=1460
>   30986 11.808144   TSM_CLIENT          TSM_SERVER         TCP 33391
> > 1500 [ACK] Seq=33066247 Ack=2076 Win=65535 Len=1460
>   30987 11.808144   TSM_CLIENT          TSM_SERVER         TCP 33391
> > 1500 [ACK] Seq=33067707 Ack=2076 Win=65535 Len=1460
>   30988 11.808144   TSM_CLIENT          TSM_SERVER         TCP 33391
> > 1500 [ACK] Seq=33069167 Ack=2076 Win=65535 Len=1460
>   30989 11.808144   TSM_CLIENT          TSM_SERVER         TCP 33391
> > 1500 [ACK] Seq=33070627 Ack=2076 Win=65535 Len=1460
>   30990 11.808144   TSM_CLIENT          TSM_SERVER         TCP 33391
> > 1500 [ACK] Seq=33072087 Ack=2076 Win=65535 Len=1460
>   30991 11.808144   TSM_CLIENT          TSM_SERVER         TCP 33391
> > 1500 [ACK] Seq=33073547 Ack=2076 Win=65535 Len=1460
>   30992 11.808144   TSM_CLIENT          TSM_SERVER         TCP 33391
> > 1500 [PSH, ACK] Seq=33075007 Ack=2076 Win=65535 Len=1460
>   30993 11.808144   TSM_CLIENT          TSM_SERVER         TCP 33391
> > 1500 [ACK] Seq=33076467 Ack=2076 Win=65535 Len=1460
>   30994 11.808144   TSM_CLIENT          TSM_SERVER         TCP 33391
> > 1500 [ACK] Seq=33077927 Ack=2076 Win=65535 Len=1460
>   30995 11.808144   TSM_CLIENT          TSM_SERVER         TCP 33391
> > 1500 [ACK] Seq=33079387 Ack=2076 Win=65535 Len=1460
>   30996 11.808144   TSM_CLIENT          TSM_SERVER         TCP 33391
> > 1500 [ACK] Seq=33080847 Ack=2076 Win=65535 Len=1460
>   30997 11.808144   TSM_CLIENT          TSM_SERVER         TCP 33391
> > 1500 [ACK] Seq=33082307 Ack=2076 Win=65535 Len=1460
>   30998 11.808144   TSM_CLIENT          TSM_SERVER         TCP 33391
> > 1500 [ACK] Seq=33083767 Ack=2076 Win=65535 Len=1460
>   30999 11.808144   TSM_CLIENT          TSM_SERVER         TCP 33391
> > 1500 [ACK] Seq=33085227 Ack=2076 Win=65535 Len=1460
>   31000 11.808144   TSM_CLIENT          TSM_SERVER         TCP 33391
> > 1500 [ACK] Seq=33086687 Ack=2076 Win=65535 Len=1460
>   31001 11.808144   TSM_CLIENT          TSM_SERVER         TCP 33391
> > 1500 [ACK] Seq=33088147 Ack=2076 Win=65535 Len=1460
>   31002 11.808144   TSM_CLIENT          TSM_SERVER         TCP 33391
> > 1500 [ACK] Seq=33089607 Ack=2076 Win=65535 Len=1460
>   31003 11.808144   TSM_CLIENT          TSM_SERVER         TCP 33391
> > 1500 [ACK] Seq=33091067 Ack=2076 Win=65535 Len=1460
>   31004 11.808144   TSM_CLIENT          TSM_SERVER         TCP 33391
> > 1500 [ACK] Seq=33092527 Ack=2076 Win=65535 Len=1460
>   31005 11.808144   TSM_CLIENT          TSM_SERVER         TCP 33391
> > 1500 [ACK] Seq=33093987 Ack=2076 Win=65535 Len=1460
>   31006 11.808144   TSM_CLIENT          TSM_SERVER         TCP 33391
> > 1500 [ACK] Seq=33095447 Ack=2076 Win=65535 Len=1460
>   31007 11.808144   TSM_CLIENT          TSM_SERVER         TCP 33391
> > 1500 [ACK] Seq=33096907 Ack=2076 Win=65535 Len=1460
>   31008 11.808144   TSM_CLIENT          TSM_SERVER         TCP 33391
> > 1500 [ACK] Seq=33098367 Ack=2076 Win=65535 Len=1460
>   31009 11.808144   TSM_CLIENT          TSM_SERVER         TCP 33391
> > 1500 [ACK] Seq=33099827 Ack=2076 Win=65535 Len=1460
>   31010 11.808144   TSM_CLIENT          TSM_SERVER         TCP 33391
> > 1500 [ACK] Seq=33101287 Ack=2076 Win=65535 Len=1460
>   31011 11.808144   TSM_CLIENT          TSM_SERVER         TCP 33391
> > 1500 [ACK] Seq=33102747 Ack=2076 Win=65535 Len=1460
>   31012 11.808144   TSM_CLIENT          TSM_SERVER         TCP 33391
> > 1500 [ACK] Seq=33104207 Ack=2076 Win=65535 Len=1460
>   31013 11.808144   TSM_CLIENT          TSM_SERVER         TCP 33391
> > 1500 [ACK] Seq=33105667 Ack=2076 Win=65535 Len=1460
>   31014 11.808144   TSM_CLIENT          TSM_SERVER         TCP 33391
> > 1500 [ACK] Seq=33107127 Ack=2076 Win=65535 Len=1460
>   31015 11.808144   TSM_CLIENT          TSM_SERVER         TCP 33391
> > 1500 [ACK] Seq=33108587 Ack=2076 Win=65535 Len=1460
>   31016 11.841144   TSM_SERVER         TSM_CLIENT          TCP      [TCP
> ZeroWindow] 1500 > 33391 [ACK] Seq=2076 Ack=33110047 Win=0 Len=0
>   31017 16.501201   TSM_CLIENT          TSM_SERVER         TCP      [TCP
> ZeroWindowProbe] 33391 > 1500 [ACK] Seq=33110047 Ack=2076 Win=65535
Len=1
>   31018 16.501201   TSM_SERVER         TSM_CLIENT          TCP      [TCP
> ZeroWindowProbeAck] [TCP ZeroWindow] 1500 > 33391 [ACK] Seq=2076
> Ack=33110047 Win=0 Len=0
>   31019 21.502260   TSM_CLIENT          TSM_SERVER         TCP      [TCP
> ZeroWindowProbe] 33391 > 1500 [ACK] Seq=33110047 Ack=2076 Win=65535
Len=1
>   31020 21.502260   TSM_SERVER         TSM_CLIENT          TCP      [TCP
> ZeroWindowProbeAck] [TCP ZeroWindow] 1500 > 33391 [ACK] Seq=2076
> Ack=33110047 Win=0 Len=0
>   31021 26.527323   TSM_CLIENT          TSM_SERVER         TCP      [TCP
> ZeroWindowProbe] 33391 > 1500 [ACK] Seq=33110047 Ack=2076 Win=65535
Len=1
>   31022 26.527323   TSM_SERVER         TSM_CLIENT          TCP      [TCP
> ZeroWindowProbeAck] [TCP ZeroWindow] 1500 > 33391 [ACK] Seq=2076
> Ack=33110047 Win=0 Len=0
>
>
> I've found one discussion around RHEL 5.4 and a Microsoft box, but the
> specifics vary quite a bit:
>
> http://stackoverflow.com/questions/4833954/the-xbox-360-tcp-stack-
> does-not-respond-to-tcp-zero-window-probes-with-a-0-byte-p
>
> Thanks,
> [RC]
>
>
>
>
> From:
> Andrew Raibeck <storman AT US.IBM DOT COM>
> To:
> ADSM-L AT VM.MARIST DOT EDU
> Date:
> 02/15/2011 11:54 AM
> Subject:
> Re: [ADSM-L] SQLLiteSpeed backups hanging when moved to TSM server on
> RHEL5.
> Sent by:
> "ADSM: Dist Stor Manager" <ADSM-L AT VM.MARIST DOT EDU>
>
>
>
> Robert,
>
> On it's face, this sounds like something in the network, with "network"
> being between the TSM client side TCP stack and the TSM server side TCP
> stack. Have you done any kind of packet tracing to see what's going on?
>
> Best regards,
>
> Andy Raibeck
> IBM Software Group
> Tivoli Storage Manager Client Product Development
> Level 3 Team Lead
> Internal Notes e-mail: Andrew Raibeck/Hartford/IBM@IBMUS
> Internet e-mail: storman AT us.ibm DOT com
>
> IBM Tivoli Storage Manager support web page:
> http://www.ibm.com/support/entry/portal/Overview/Software/Tivoli/
> Tivoli_Storage_Manager
>
>
> "ADSM: Dist Stor Manager" <ADSM-L AT vm.marist DOT edu> wrote on 2011-02-15
> 14:21:57:
>
> > From: Robert Clark <robert.clark7 AT USBANK DOT COM>
> > To: ADSM-L AT vm.marist DOT edu
> > Date: 2011-02-15 14:24
> > Subject: SQLLiteSpeed backups hanging when moved to TSM server on
RHEL5.
> > Sent by: "ADSM: Dist Stor Manager" <ADSM-L AT vm.marist DOT edu>
> >
> > We're running into a problem when trying to change SQLLiteSpeed
backups
> > clients to point to new TSM servers.
> >
> > The old TSM servers are RHEL4 (on Intel) running TSM server 5.5.5.0
with
> > efix for APAR IC71586.  (kernel 2.6.9-89.0.11.ELsmp)
> >
> > The new TSM servers are RHEL5 (on Intel) running TSM server 5.5.5.0
with
> > efix for APAR IC71586. (kernel 2.6.18-164.11.1.el5)
> >
> >
> > We've made sure all the relevant values are set the same on the new
> > servers, as on the old. (Management classes, disk storage pools,
> maxnummp,
> > and everything displayed in "q opt" output on the TSM server.)
> >
> > The two SQLLiteSpeed clients we've used for testing are:
> >
> > GENERICSYSTEMNAME1
> > O/S: 2008
> > SQL version: - 10.0.4000.0 (2008 SP2)
> > SQL litespeed version: - 5.0.2.0
> > TSM Client: 6.1.3.0
> >
> > GENERICSYSTEMNAME2
> > O/S: 2003
> > SQL version: - 9.00.4207.00 (2005 SP3)
> > SQL litespeed version:- 5.0.2.0
> > TSM Client: 6.1.3.0
> >
> > We have gathered client side trace, and it appears to indicate the
> socket
> > is being closed:
> >
> > 02/09/2011 15:07:51.192 : commtcp.cpp (2525): ANS1006I TCP/IP write
> error
> > on socket = 9300, errno = 10053, reason : An established connection
was
> > aborted by the software in your host machine.
> >
> > 02/09/2011 15:07:51.192 : apisend.cpp (1175):
> > Contents of verb (0x7) Data, length: 32768:
> >
> > 02/09/2011 15:07:51.192 : commtcp.cpp (2525): ANS1006I TCP/IP write
> error
> > on socket = 4294967295, errno = 10038, reason : An operation was
> attempted
> >
> > on something that is not a socket.
> >
> > We have also gathered server side trace, but nothing unusual has been
> > noted there.
> >
> > The symptom on the TSM server is that backup session stops making
> progress
> > after a few minutes, and ultimately must be canceled to be cleaned up.
> >
> > We've opened a case with Tivoli support, and are working with the
> > sysadmins of the TSM server.  We're not making much progress. My hope
is
> > to jog the memory of the list and see if anyone has seen window size
or
> > other stack weirdness with RHEL 5 that is triggered by LiteSpeed
> backups.
> >
> > Thanks,
> > [RC]
>
>
> U.S. BANCORP made the following annotations
> ---------------------------------------------------------------------
> Electronic Privacy Notice. This e-mail, and any attachments,
> contains information that is, or may be, covered by electronic
> communications privacy laws, and is also confidential and
> proprietary in nature. If you are not the intended recipient, please
> be advised that you are legally prohibited from retaining, using,
> copying, distributing, or otherwise disclosing this information in
> any manner. Instead, please reply to the sender that you have
> received this communication in error, and then immediately delete
> it. Thank you in advance for your cooperation.
>
>
>
> ---------------------------------------------------------------------


U.S. BANCORP made the following annotations
---------------------------------------------------------------------
Electronic Privacy Notice. This e-mail, and any attachments, contains 
information that is, or may be, covered by electronic communications privacy 
laws, and is also confidential and proprietary in nature. If you are not the 
intended recipient, please be advised that you are legally prohibited from 
retaining, using, copying, distributing, or otherwise disclosing this 
information in any manner. Instead, please reply to the sender that you have 
received this communication in error, and then immediately delete it. Thank you 
in advance for your cooperation.



---------------------------------------------------------------------

<Prev in Thread] Current Thread [Next in Thread>
  • Re: [ADSM-L] SQLLiteSpeed backups hanging when moved to TSM server on RHEL5., Robert Clark <=