ADSM-L

[ADSM-L] An update to the continuing saga of TSM 5.5.5.0 on RedHat 5.4 on x86_64.

2011-09-21 11:46:38
Subject: [ADSM-L] An update to the continuing saga of TSM 5.5.5.0 on RedHat 5.4 on x86_64.
From: Robert Clark <robert.clark7 AT USBANK DOT COM>
To: ADSM-L AT VM.MARIST DOT EDU
Date: Wed, 21 Sep 2011 08:42:53 -0700
Hi List,

Disabling the buggy TCP window size scaling in RHEL 5.4 (on the TSM
server), appears to be an effective work around for this part of the
problem. (The change can be made on the fly, and can be added to
sysctl.conf so that it persists across reboots.)

(net.ipv4.tcp_window_scaling=0 is the flag.)

The bug has been fixed in at least one level of RHEL 5.6, and we will be
upgrading to that level, when the logistics can be worked out.

Thanks,
[RC]

----- Forwarded by Robert A Clark/OR/USB on 09/21/2011 08:36 AM -----

From:   Robert A Clark/OR/USB
To:     "ADSM: Dist Stor Manager" <ADSM-L AT VM.MARIST DOT EDU>
Date:   09/15/2011 01:50 PM
Subject:        Re: [ADSM-L] The continuing saga of TSM 5.5.5.0 on RedHat
5.4 on x86_64.


Hi Andrew,

The start of this problem is obscure. We did have serious problems when
Linux was upgraded from RHEL 5.3 => 5.4 on the TSM servers, requiring we
set "TCPWINDOWsize 0" in dsmserv.opt on the Linux servers. Log pinning has
been a problem for us for at least two or three months, and it has been
getting worse lately.

Starting at 02:25:53.536 this morning, this particular client (which
demonstrates the problem pretty reliably) goes into "Producer is waiting
on consumer". It stays this way until we used "dsmcutil stop /name:"TSM
Scheduler" once we arrived at work.

Somewhere in all the semantics between the TSM client and the TSM server,
the little FSA state machine has entered the "I've fallen and I can't get
up!" state. The client and server don't seem to do anything about this.

Network capture during this same timeframe shows the client sends 39 bytes
of data every 5 seconds which the server ACKS.
This behavior repeats until the end of the network capture at ~04:00.

Personally, I'm looking at "
https://bugzilla.redhat.com/show_bug.cgi?id=627496";, although its not
clear whether this patch for kernel-2.6.18-194 would apply to RHEL 5.4 or
not. (We're running 2.6.18-164.11.1.el5 at the moment.)

Tonight, I'm changing from "TCPWINDOWSize 63" to "TCPWINDOWSize 0" in the
dsm.opt on some of the nodes, to see if it helps the client recover from
whatever is happening.

09/15/2011 02:25:52.942 [000736] [1880] : DccTaskStatus.cpp   ( 818):
DccTaskletMsg.msgID ==> 1
09/15/2011 02:25:52.942 [000736] [1880] : dsfifo.cpp          ( 641):
fifoQgetNextWaitNoTS(1794b24): cbCanProceed=True.
09/15/2011 02:25:52.942 [000736] [1880] : dsfifo.cpp          ( 690):
fifoQgetNextWaitNoTS(1794b24): Next object is available.
09/15/2011 02:25:52.942 [000736] [1880] : dsfifo.cpp          ( 342):
fifoQchangeWaitTime(1794b24): wait time 1644, change 0
09/15/2011 02:25:52.942 [000736] [1880] : circq.cpp           ( 248): Q
17a3eb0, poped 3d9eef0, count 1, top index 434 - 1773b78, bot index 435 -
3e66328
09/15/2011 02:25:52.942 [000736] [1880] : dsfifo.cpp          ( 745):
fifoQgetNext(1794b24): Giving entry 3d9eef0, rc 0.
09/15/2011 02:25:52.942 [000736] [1880] : DccTaskStatus.cpp   ( 818):
DccTaskletMsg.msgID ==> 2
09/15/2011 02:25:52.942 [000736] [1880] : dsfifo.cpp          ( 641):
fifoQgetNextWaitNoTS(1794b24): cbCanProceed=True.
09/15/2011 02:25:52.942 [000736] [1880] : dsfifo.cpp          ( 690):
fifoQgetNextWaitNoTS(1794b24): Next object is available.
09/15/2011 02:25:52.942 [000736] [1880] : dsfifo.cpp          ( 342):
fifoQchangeWaitTime(1794b24): wait time 1644, change 0
09/15/2011 02:25:52.942 [000736] [1880] : circq.cpp           ( 248): Q
17a3eb0, poped 1773b78, count 0, top index 435 - 3e66328, bot index 435 -
3e66328
09/15/2011 02:25:52.942 [000736] [1880] : dsfifo.cpp          ( 735):
fifoQgetNext(1794b24): Giving entry 1773b78, but next entry is NULL.
09/15/2011 02:25:52.942 [000736] [1880] : dsfifo.cpp          ( 745):
fifoQgetNext(1794b24): Giving entry 1773b78, rc 0.
09/15/2011 02:25:52.942 [000736] [1880] : DccTaskStatus.cpp   ( 818):
DccTaskletMsg.msgID ==> 1

                 B/A Performance thread         4================>
09/15/2011 02:25:53.536 [000736] [16312] : pkthread.cpp        ( 678):
After TimedWait rc = 964, milliSecondsLeft = 0, cbCanProceed = 0.

                B/A Txn Producer thread         2================>
09/15/2011 02:25:53.536 [000736] [14136] : dsfifo.cpp          ( 431):
fifoQinsert(3d531b0): Producer is done waiting, rc is 0
09/15/2011 02:25:53.536 [000736] [14136] : dsfifo.cpp          ( 433):
fifoQinsert() Thread 14136 returned from wait on queue 3D531B0 .
09/15/2011 02:25:53.536 [000736] [14136] : dsfifo.cpp          ( 342):
fifoQchangeWaitTime(3d531b0): wait time -19640, change -1000
09/15/2011 02:25:53.536 [000736] [14136] : dsfifo.cpp          ( 420):
fifoQinsert(3d531b0): Producer is waiting on consumer
09/15/2011 02:25:53.536 [000736] [14136] : dsfifo.cpp          ( 422):
fifoQinsert() Thread 14136 waiting on queue 3D531B0 .

                            main thread         0================>
09/15/2011 02:25:53.942 [000736] [1880] : pkthread.cpp        ( 678):
After TimedWait rc = 964, milliSecondsLeft = 0, cbCanProceed = 0.

                 B/A Performance thread         4================>
09/15/2011 02:25:54.536 [000736] [16312] : pkthread.cpp        ( 678):
After TimedWait rc = 964, milliSecondsLeft = 0, cbCanProceed = 0.

                B/A Txn Producer thread         2================>
09/15/2011 02:25:54.536 [000736] [14136] : dsfifo.cpp          ( 431):
fifoQinsert(3d531b0): Producer is done waiting, rc is 0
09/15/2011 02:25:54.536 [000736] [14136] : dsfifo.cpp          ( 433):
fifoQinsert() Thread 14136 returned from wait on queue 3D531B0 .
09/15/2011 02:25:54.536 [000736] [14136] : dsfifo.cpp          ( 342):
fifoQchangeWaitTime(3d531b0): wait time -20640, change -1000
09/15/2011 02:25:54.536 [000736] [14136] : dsfifo.cpp          ( 420):
fifoQinsert(3d531b0): Producer is waiting on consumer
09/15/2011 02:25:54.536 [000736] [14136] : dsfifo.cpp          ( 422):
fifoQinsert() Thread 14136 waiting on queue 3D531B0 .

Thanks,
[RC]



From:   Andrew Raibeck <storman AT US.IBM DOT COM>
To:     ADSM-L AT VM.MARIST DOT EDU
Date:   09/14/2011 04:05 PM
Subject:        Re: [ADSM-L] The continuing saga of TSM 5.5.5.0 on RedHat
5.4 on x86_64.
Sent by:        "ADSM: Dist Stor Manager" <ADSM-L AT VM.MARIST DOT EDU>



Hi Robert,

The RecvW state suggests that the TSM server is waiting on data from the
TSM client, and isn't receiving it. Either the server sent something to
the
client and is awaiting a response but the client never got the
information;
or the client sent something back, but the server didn't get it. Off-hand
this sounds like an underlying networking-related issue.

I know you are a long-time TSM user, so I imagine that it hasn't been like
this for years. :-) Do you have any history on when this problem started
to
manifest? And if so, what environmental changes might have coincided with
that time?

The first thing that came to mind for me is an issue with the Windows
Scalable Network Pack (SNP) that is installed and activated when you
install Windows 2003 SP2. See this document:

https://www-304.ibm.com/support/docview.wss?uid=swg21460285

In addition, do you specify in dsm.opt the TCPWINDOWSIZE option? And if
so,
do you deviate from the default value of 63? If so, I recommend commenting
it out, restarting the scheduler service, and see how that works.

Best regards,

Andy Raibeck
IBM Software Group
Tivoli Storage Manager Client Product Development
Level 3 Team Lead
Internal Notes e-mail: Andrew Raibeck/Hartford/IBM@IBMUS
Internet e-mail: storman AT us.ibm DOT com

IBM Tivoli Storage Manager support web page:
http://www.ibm.com/support/entry/portal/Overview/Software/Tivoli/Tivoli_Storage_Manager


"ADSM: Dist Stor Manager" <ADSM-L AT vm.marist DOT edu> wrote on 2011-09-14
17:05:35:

> From: Robert Clark <robert.clark7 AT USBANK DOT COM>
> To: ADSM-L AT vm.marist DOT edu
> Date: 2011-09-14 17:07
> Subject: The continuing saga of TSM 5.5.5.0 on RedHat 5.4 on x86_64.
> Sent by: "ADSM: Dist Stor Manager" <ADSM-L AT vm.marist DOT edu>
>
> Hi List,
>
> We have a number of TSM clients (5.5.2.0 on mostly Windows 2003 SP2)
that
> stop making forward progress in the busy  part of the night, and are
just
> sitting in RecvW when we check in the morning.
>
> These clients that aren't making progress appear to be involved as cause
> or effect in the log pinning and getting to 80% full, and then dsmserv
> starts killing the oldest session. (The cynic in me wonders if this pain
> was added to get people to move from 5.5 to 6.x.)
>
> For the problem sessions below, there is only one session per node, and
> based on the received bytes count, appear to correspond to the data
> session and not the meta-data session.
>
> Stopping the backup on the client side 'dsmcutil q /name:"TSM
Scheduler"',
> and then running a manual "dsmc i" finishes a manual incremental (
> typically) in under 5 minutes.
>
> So far, I've upped the MAXNUMMP value for all the effected nodes from 2
to
> 5, and have remove "queryschedperiod 12" and "quiet yes" from the client
> option set.
>
> I have a PMR open with Tivoli support, but wanted to check with the list
> to see if anyone has seen this particular weirdness before.
>
>
> tsm: TSMXX15>q ses
>
>   Sess     Comm.      Sess         Wait       Bytes       Bytes     Sess
> Platform     Client Name
> Number     Method     State        Time        Sent       Recvd     Type
> ------     ------     ------     ------     -------     ------- -----
>   --------     --------------------
>
> THESE SYSTEMS WERE EXPERIENCING THE PROBLEM ON THE MORNING THIS OUTPUT
WAS
> GATHERED:
>
> 56,511     Tcp/Ip     RecvW        0 S        7.4 K     135.8 G     Node
> WinNT        FRED
> 56,844     Tcp/Ip     RecvW      40.5 M         760     640.1 M     Node
> WinNT        BARNEY
> 58,047     Tcp/Ip     RecvW      46.3 M       1.5 K     150.7 M     Node
> WinNT        WILMA
> 58,907     Tcp/Ip     RecvW      9.1 M        4.7 K      56.0 G     Node
> WinNT        DINO
> 59,677     Tcp/Ip     RecvW      33.1 M         736     946.8 M     Node
> WinNT        BAMBAM
>
> THESE SYSTEMS WERE NOT EXPERIENCING THE PROBLEM ON THE MORNING THIS
OUTPUT
> WAS GATHERED:
>
> 64,606     Tcp/Ip     RecvW      37.7 M       1.8 K     156.3 M     Node
> SQLLite-     TRURL_SQL
>      Speed
> 64,725     Tcp/Ip     RecvW      10.0 M       2.5 K       2.2 G     Node
> SQLLite-     TRURL_SQL
>      Speed
> 64,759     Tcp/Ip     IdleW       21 S        8.1 M         530     Node
> AIX          KLAPAUCIUS
> 64,760     Tcp/Ip     RecvW        0 S          352       1.1 G     Node
> AIX          KLAPAUCIUS
>
> THE NODE NAMES IN THIS EMAIL WERE CHANGED TO CONFORM TO POLICY.
>
> Thanks,
> [RC]
> U.S. BANCORP made the following annotations
> ---------------------------------------------------------------------
> Electronic Privacy Notice. This e-mail, and any attachments,
> contains information that is, or may be, covered by electronic
> communications privacy laws, and is also confidential and
> proprietary in nature. If you are not the intended recipient, please
> be advised that you are legally prohibited from retaining, using,
> copying, distributing, or otherwise disclosing this information in
> any manner. Instead, please reply to the sender that you have
> received this communication in error, and then immediately delete
> it. Thank you in advance for your cooperation.
>
>
>
> ---------------------------------------------------------------------
>

U.S. BANCORP made the following annotations
---------------------------------------------------------------------
Electronic Privacy Notice. This e-mail, and any attachments, contains 
information that is, or may be, covered by electronic communications privacy 
laws, and is also confidential and proprietary in nature. If you are not the 
intended recipient, please be advised that you are legally prohibited from 
retaining, using, copying, distributing, or otherwise disclosing this 
information in any manner. Instead, please reply to the sender that you have 
received this communication in error, and then immediately delete it. Thank you 
in advance for your cooperation.



---------------------------------------------------------------------

<Prev in Thread] Current Thread [Next in Thread>