ADSM-L

Re: Possibly OT: How to diagnose 3494 ATL "communications" failur es

2004-06-13 22:48:40
Subject: Re: Possibly OT: How to diagnose 3494 ATL "communications" failur es
From: "Thorson, Paul" <Paul.Thorson AT MCKESSON DOT COM>
To: ADSM-L AT VM.MARIST DOT EDU
Date: Sun, 13 Jun 2004 22:47:59 -0400
Hello Zoltan,

I'm a bit confused.  I thought the problem you're having is the 3494 LM
"hangs" up, and ceases to  communicate with the TSM server until the port is
re-initialized.  If the network folks have put a sniffer on the GIG-E backup
network, I would not be surprised the TSM server is the "mass communicator",
unless you're talking about "abnormal broadcast" traffic.  It should be,
since it returns client query info during the backups.  But is this causing
your LM to hang, especially considering the LM might be a different subnet
with a 10Mbit half-duplex connection?  I would guess that perhaps the client
which is outside that private network might also have a high number of
files, or perhaps is backing up at the same time there are other clients
with lots of file, thus increasing the TSM management thread traffic.  Did
the network folks correlate the packet sizes along with the number of
packets?

As for the original problem, you're using an O/S, application software, and
a hardware combination that is very common, so it seems that software is off
the hook.  I think we'd need to know more information about your environment
to provide the best course of trouble-shooting.  For instance,

1.  Is there any other server, application, or hardware device having
problems with a TCP port getting hung up?  Is the network traffic to the
clients across the same NIC as the communication with the 3494?  Have there
been any recent changes to the network that correlate with the problem?

2.  Is the AIX TSM server the only LAN host accessing the 3494?  Can you
access the LM web interface (3494 Specialist) or use mtlib when it's hung
up?

3.  Does you AIX server have recent maintenance levels applied?

4.  Has it been verified that no other device on the network is attempting
to use the same IP address as the 3494 LM?  I believe Wanda P. suggested
looking into that.  The network folks should be able to find it.

5.  Per Richard's suggestion, have you eliminated firewall changes as the
source of the problem?  Port 3494 needs to be open.  A simple ping will not
verify that, since it uses a different port.

6.  Is there any correlation by time of day with the LM port seems to hang?
Is there any correlation with the TSM activity log error messages when
communications are broken?  (IE, always after a tape mount, always after a
dismount, etc).  Does the /etc/ibmatl.conf file use a hard-coded TCP/IP
address or a DNS entry for the LM?

7.  Has the network cable from the 3494 to the switch been examined/swapped?
Is that switch logging any errors?  Has another port been tried?

8.  Has the IBM CE verified the level of LM code is recent?  In our
environment, we're running 527.21 and 528.09 with no problems, but there are
problems with other levels.

9.  Can you reproduce the hang using mtlib queries, or is it only when TSM
accesses the 3494 that the problem occurs?

Finally, if the 3494's LAN port needs to be re-initialized to re-establish
communications, and that's the only problem you're having, I would attempt
to escalate the issue with IBM hardware support.  The 3494 is supposed to be
high availability, and they should provide you with detailed
trouble-shooting procedures.  I've never heard of anyone running trace files
on the LM before, but I would guess it could be done.

Good luck and regards,

- Paul


-----Original Message-----
From: Zoltan Forray/AC/VCU [mailto:zforray AT VCU DOT EDU]
Sent: Friday, June 11, 2004 10:41 AM
To: ADSM-L AT VM.MARIST DOT EDU
Subject: Re: Possibly OT: How to diagnose 3494 ATL "communications" failures


Well, after further analysis, this topic has had some strange twists.

Per some suggestions, I had IBM replace the NIC in the LM.      This had
absolutely no effect on the problem.

I have been discussing this issue with our networking folks. Their initial
review showed massive amounts of BROADCAST traffic on this subnet, which
is a private, GIG-E internal network, with connections to the outside
world, primarily for TSM backups traffic.

I updated ATAPE, which was a bit behind. I could not see anything in the
history of changes, that would address this kind of situation.

Now, my networking folks have put a sniffer on this private subnet.

Image my suprise when the biggest causer of broadcast traffic, is the TSM
AIX server, itself !!!!!

They also said that the peak in broadcast traffic correlated to backup
traffic from a box that is outside the private subnet (i.e. does not have
a direct connection to the same switch).

Anyone have any suggestions on why the TSM server would be doing this ?
This is an AIX 5.1 TSM 5.2.1.3 system, that is *EXCLUSIVELY* used for TSM
backups.

Could this have anything to do with the HLADDRESS parms on the node
definitions ?   Possibly a bug that has been fixed in a later release of
the TSM server ?

*NOTHING* has changed on the AIX system for the past 6-months, when it
comes to the AIX and TSM server software, itself !  The last upgrade was
from AIX 4.3.3 to 5.1 and the TSM server, at the same time.




Richard Sims <rbs AT BU DOT EDU>
Sent by: "ADSM: Dist Stor Manager" <ADSM-L AT VM.MARIST DOT EDU>
05/24/2004 07:41 PM
Please respond to
"ADSM: Dist Stor Manager" <ADSM-L AT VM.MARIST DOT EDU>


To
ADSM-L AT VM.MARIST DOT EDU
cc

Subject
Re: Possibly OT: How to diagnose 3494 ATL "communications" failur es






>This hardware/system has been in place for years, without change.
>
>My bets are on a problem with the LM itself.
>
>This weekend, the connection died, again. No non-distruptive attempts to
>restablish the connection with the LM, worked. Yes, both boxes could PING
>each other.  As I told IBM, this is *NOT* a connectivity issue, in the
>lan/network sense. This is a "the LM is not responding as an LM".
>
>The only way we got it to work was to reinitialize the LAN ports from the
>LM/ATL.

Well, there has been change: it's gotten older!  (I've become an expert on
the subject.)  You could be seeing the effects of a deteriorating network
card or the like...which could be aggravated if the library is not on UPS
or power that is otherwise conditioned.

Watch out also if the library is not behind a firewall.  I worry about
these
"embedded system" computers in that they rarely see any updates, and yet
we
know that "holes" in operating systems are periodically found.  In the
right
network circumstances, odd behavior from such a system may be the result
of
someone trying to hack the box.

    me again,  Richard Sims

<Prev in Thread] Current Thread [Next in Thread>