ADSM-L

Re: Possibly OT: How to diagnose 3494 ATL "communications" failures

2004-06-15 11:07:13
Subject: Re: Possibly OT: How to diagnose 3494 ATL "communications" failures
From: Zoltan Forray/AC/VCU <zforray AT VCU DOT EDU>
To: ADSM-L AT VM.MARIST DOT EDU
Date: Tue, 15 Jun 2004 10:30:16 -0400
No, that is not the real issue. This is what I kept trying to explain to
IBM.

There is no communication failure, as far as I can see, other than the AIX
system saying the LM is not reponding in a timely fashion.  When this
occurs, AIX says the box is offline.  I can ping each other from both
ends, the LM and AIX.  But, the LM functionality is lost. TSM/AIX no
longer knows what tapes are mounted...what their status is/was, etc.  When
I go to the LM and force a reinit of the port (it says the port is
initialized and functioning), the problems usually clear up, for a while.
The LM is not what you would consider "hung up". Everything continues to
function from the MVS/zOS side that is sharing the library.

The LM is not a different subnet. It is on the same, GIG-E private subnet.
 No, the LM can not be configured to anything faster than 10-Half (per
IBM). eventhough it is a simple 3C509 16-bit card and should be able to
handle it.

When the subnet traffic was analyzed (at the times the "timeout" were
occuring), the main thing that popped-up was the flood of broadcast
traffic.  I don't know what else was analyzed. However, when they put the
sniffer on the "line", they told me the broadcast traffic from the TSM
server was 6X any other traffic.

As for your questions:

1.  No.   No.  No.
2.  Yes.
3.  Yes, mostly. I recently pushed ATAPE to the latest level, 8.4.8.0 -
More on this, later.
4.   Not sure/don't know.
5.   Yes. There is not firewall process on this subnet. It's main/only
purpose is for TSM backup traffic.
6.   Not that we can tell.  Some times for hours during the day, other
times in the middle of the night.
7.  Yes. Tested end-to-end. That was the first thing we checked.
8.  Yes. This is an old, non-VTS box. There hasn't been a
micro-code/firmware change/update for a long time.
9.  Haven't tried this.

To add to the confusion, since we updated the ATAPE drives to 8.4.8.0 (and
had to reconfigure the PATHs for ALL TSM servers that share the
library.........great fun !), I have not had any more problems.

I had my AIX guy look over the history of changes to ATAPE since the level
we were at, and he could not see anything obvious that would have
addresses our problem.  FWIW, we were at 8.3.6.0.

Still holding my breath, hoping this fixes it and stops causing
interrupted sleep !




"Thorson, Paul" <Paul.Thorson AT MCKESSON DOT COM>
Sent by: "ADSM: Dist Stor Manager" <ADSM-L AT VM.MARIST DOT EDU>
06/13/2004 10:47 PM
Please respond to
"ADSM: Dist Stor Manager" <ADSM-L AT VM.MARIST DOT EDU>


To
ADSM-L AT VM.MARIST DOT EDU
cc

Subject
Re: Possibly OT: How to diagnose 3494 ATL "communications" failur es






Hello Zoltan,

I'm a bit confused.  I thought the problem you're having is the 3494 LM
"hangs" up, and ceases to  communicate with the TSM server until the port
is
re-initialized.  If the network folks have put a sniffer on the GIG-E
backup
network, I would not be surprised the TSM server is the "mass
communicator",
unless you're talking about "abnormal broadcast" traffic.  It should be,
since it returns client query info during the backups.  But is this
causing
your LM to hang, especially considering the LM might be a different subnet
with a 10Mbit half-duplex connection?  I would guess that perhaps the
client
which is outside that private network might also have a high number of
files, or perhaps is backing up at the same time there are other clients
with lots of file, thus increasing the TSM management thread traffic.  Did
the network folks correlate the packet sizes along with the number of
packets?

As for the original problem, you're using an O/S, application software,
and
a hardware combination that is very common, so it seems that software is
off
the hook.  I think we'd need to know more information about your
environment
to provide the best course of trouble-shooting.  For instance,

1.  Is there any other server, application, or hardware device having
problems with a TCP port getting hung up?  Is the network traffic to the
clients across the same NIC as the communication with the 3494?  Have
there
been any recent changes to the network that correlate with the problem?

2.  Is the AIX TSM server the only LAN host accessing the 3494?  Can you
access the LM web interface (3494 Specialist) or use mtlib when it's hung
up?

3.  Does you AIX server have recent maintenance levels applied?

4.  Has it been verified that no other device on the network is attempting
to use the same IP address as the 3494 LM?  I believe Wanda P. suggested
looking into that.  The network folks should be able to find it.

5.  Per Richard's suggestion, have you eliminated firewall changes as the
source of the problem?  Port 3494 needs to be open.  A simple ping will
not
verify that, since it uses a different port.

6.  Is there any correlation by time of day with the LM port seems to
hang?
Is there any correlation with the TSM activity log error messages when
communications are broken?  (IE, always after a tape mount, always after a
dismount, etc).  Does the /etc/ibmatl.conf file use a hard-coded TCP/IP
address or a DNS entry for the LM?

7.  Has the network cable from the 3494 to the switch been
examined/swapped?
Is that switch logging any errors?  Has another port been tried?

8.  Has the IBM CE verified the level of LM code is recent?  In our
environment, we're running 527.21 and 528.09 with no problems, but there
are
problems with other levels.

9.  Can you reproduce the hang using mtlib queries, or is it only when TSM
accesses the 3494 that the problem occurs?

Finally, if the 3494's LAN port needs to be re-initialized to re-establish
communications, and that's the only problem you're having, I would attempt
to escalate the issue with IBM hardware support.  The 3494 is supposed to
be
high availability, and they should provide you with detailed
trouble-shooting procedures.  I've never heard of anyone running trace
files
on the LM before, but I would guess it could be done.

Good luck and regards,

- Paul


-----Original Message-----
From: Zoltan Forray/AC/VCU [mailto:zforray AT VCU DOT EDU]
Sent: Friday, June 11, 2004 10:41 AM
To: ADSM-L AT VM.MARIST DOT EDU
Subject: Re: Possibly OT: How to diagnose 3494 ATL "communications"
failures


Well, after further analysis, this topic has had some strange twists.

Per some suggestions, I had IBM replace the NIC in the LM.      This had
absolutely no effect on the problem.

I have been discussing this issue with our networking folks. Their initial
review showed massive amounts of BROADCAST traffic on this subnet, which
is a private, GIG-E internal network, with connections to the outside
world, primarily for TSM backups traffic.

I updated ATAPE, which was a bit behind. I could not see anything in the
history of changes, that would address this kind of situation.

Now, my networking folks have put a sniffer on this private subnet.

Image my suprise when the biggest causer of broadcast traffic, is the TSM
AIX server, itself !!!!!

They also said that the peak in broadcast traffic correlated to backup
traffic from a box that is outside the private subnet (i.e. does not have
a direct connection to the same switch).

Anyone have any suggestions on why the TSM server would be doing this ?
This is an AIX 5.1 TSM 5.2.1.3 system, that is *EXCLUSIVELY* used for TSM
backups.

Could this have anything to do with the HLADDRESS parms on the node
definitions ?   Possibly a bug that has been fixed in a later release of
the TSM server ?

*NOTHING* has changed on the AIX system for the past 6-months, when it
comes to the AIX and TSM server software, itself !  The last upgrade was
from AIX 4.3.3 to 5.1 and the TSM server, at the same time.




Richard Sims <rbs AT BU DOT EDU>
Sent by: "ADSM: Dist Stor Manager" <ADSM-L AT VM.MARIST DOT EDU>
05/24/2004 07:41 PM
Please respond to
"ADSM: Dist Stor Manager" <ADSM-L AT VM.MARIST DOT EDU>


To
ADSM-L AT VM.MARIST DOT EDU
cc

Subject
Re: Possibly OT: How to diagnose 3494 ATL "communications" failur es






>This hardware/system has been in place for years, without change.
>
>My bets are on a problem with the LM itself.
>
>This weekend, the connection died, again. No non-distruptive attempts to
>restablish the connection with the LM, worked. Yes, both boxes could PING
>each other.  As I told IBM, this is *NOT* a connectivity issue, in the
>lan/network sense. This is a "the LM is not responding as an LM".
>
>The only way we got it to work was to reinitialize the LAN ports from the
>LM/ATL.

Well, there has been change: it's gotten older!  (I've become an expert on
the subject.)  You could be seeing the effects of a deteriorating network
card or the like...which could be aggravated if the library is not on UPS
or power that is otherwise conditioned.

Watch out also if the library is not behind a firewall.  I worry about
these
"embedded system" computers in that they rarely see any updates, and yet
we
know that "holes" in operating systems are periodically found.  In the
right
network circumstances, odd behavior from such a system may be the result
of
someone trying to hack the box.

    me again,  Richard Sims

<Prev in Thread] Current Thread [Next in Thread>