[Veritas-bu] LTO drives not visible to a SSO host after a tap e drive po

Brocade told us that the defect does not occur in the 2.X.X code.  

We are currently on v2.6.0e.  The situation is also dependent on the
complexity of the SAN.

Our HBA's for tape backup are all JNI 6460.

We have had 4-5 custom patches applied to JNI's Solaris 5.1.1 driver (from
the upcoming 5.1.3.b.2 driver that is now in QA).  Released as 5.2???

1. (Adding lun on the fly feature)
2. (bad mutex panic)
3. (bad trap panic)
4. (FC Tape related panic on command timeout)
5. (Fixed continuous panics on reboot)


JNI did not have a multiport 2 Gb analyzer to look at the HBA/Drive/ISL at
the same time so JNI could not verify Brocade's claims.  All we could see is
the HBA sends out the ADISK.  The drive responds with the LOGO which the HBA
never sees.

One thing that you may find interesting about these older Brocade switches
is this note from IBM.

Brocade provided a fix in their 2.4.1e code, known as the "ACK" fix (in case
you want to look at the release notes.)  Since the switch is no longer
prematurely killing off communication to the processor (I960) there is
nothing preventing the processor from being overloaded.  In the cases that I
have seen the "overloading seems to come from spurious traffic on the
ethernet segment.  The reasons the ethernet port is an issue are the
following:
1.)  The ethernet port has the same priority or weight as the FC ports.
2.)  There was no shut-off valve for the ethernet port in times of stress.
3.)  The changes placed into 2.6, such as having a threshold for ethernet
traffic and then shutting off for 5min at a time, were not set low enough.
4.)  The ethernet port/chip runs in promiscuous mode. Meaning every frame /
packet that hits the ethernet port has to completely decoded.  The F16 or 2
Gig switches, not directors, use the same ethernet port / chip.  However the
i960 is a 100Mhz processor as opposed to the 25Mhz processor in the 1Gig
version.  Additionally, the VxWorks code has streamlined the TCP stack to
help mitigate / alleviate the issue.  I just confirmed that the ethernet
port / chip in the M12 / 12000 does NOT operate in promiscuous mode.  Also,
on the M12 they do NOT use the i960 and they have switched from VxWorks to
Linux.

      We, SAN Central, have see a direct correlation of switch hangs /
reboots to ethernet port attachment to production ethernet LANs.  In EVERY
case isolating the ethernet segment from production traffic through either a
standalone ethernet switch or firewall corrected the problem.

      There are additional "fixes" in 2.6.0X code to help prevent these
hangs, reboots, and communication issues.  However, this code level has not
had a great deal of success in actually fixing the problems.


Additional Technical Detail

Brocade Fibre Channel switches have either three (2109-S08) or five
(2109-S16) active processing components.  Four of these are the Brocade
developed integrated circuits (ASICs) that are responsible for the handling
of the Fibre Channel frames.  The fifth component is an Intel processor
(i960) that is responsible for the administrative functions of the switch.
Such functions include:
      - Telnet and Web administration
      - SNMP and syslog services
      - Fibre channel services
            = FSPF routing calculations
            = Nameserver functions
            = Processing of Fabric login requests
            =Zoning (specifically soft / WWN zoning is more intensive being
handled by software, eg soft, as opposed to hardware)
      - All TCP/IP processing over the ethernet port.

(Brocade's stance) This processor is referred to as the HOST processor of
the switch.  Normally this processor is essentially idle and has sufficient
processing capability to handle the needs of the switch.  However, if an
unusual load is placed on the HOST processor then activities of the switch
can be delayed or even stopped. Activities such as those listed above. Even
when these activities are impacted, the forwarding of Fibre Channel frames
for ESTABLISHED connections should not be impacted as that function is
performed by the Integrated Circuits mentioned previously.

I have personally seen, on several occasions, that when problems occur with
the i960 processor command timeout and other traffic problem can occur.
Traffic related problems with established connection will not ALWAYS be
adversely affected, this is greatly dependent on load (I/O).


Brian

-----Original Message-----
From: Robert Johannes [mailto:robert_johannes AT udlp DOT com] 
Sent: Monday, October 14, 2002 12:52
To: Brian Boone-TM
Cc: 'veritas-bu AT mailman.eng.auburn DOT edu'
Subject: Re: [Veritas-bu] LTO drives not visible to a SSO host after a tape
drive power cycle

We have the brocade silkworm 2800 in our environment just for SSO stuff;
would anyone by chance know if upgrading to firmware v2.6.0c break
things, as in Brian's experience below?  We are currently at v2.4.1c.

robert

Brian Boone-TM wrote:
> 
> Not sure if anyone has seen this one, but it has been plaguing us for
quite
> some time.
> 
> In our SAN dedicated to tape backups we have our SSO host HBAs on a
brocade
> 12000 domain.  This is ISLd to a Brocade 6400 where we have our FC-AL IBM
> 3580 LTO drives.
> 
> What we have observed, is that after a drive is power cycled (during
> maintenance, firmware upgrades, library reboot) Solaris can no longer see
> the LTO drive(s).  If NetBackup touches these devices the Media Management
> processes hang until the device comes back.
> 
> Brocade has identified this as their problem.
> 
> After looking at the log files, trace files and the details, Brocade has
> determined that there is a defect within the v4.0.0x code stream which
> matches a newly discovered known issue.
> 
> This issue is related to the way in which frames are handled as a result
of
> the Brocade Frame Filtering technology.  In essence, the scenario is as
> follows:
> 
> A host performs ADISC (discovery) to the target before doing a PLOGI (Port
> Login). In your scenario, the target(tape drive) is responding to the
ADISC
> with an unsolicited LOGO(different OXID)(This is the correct and expected
> behaviour). The 12000 switch using version 4.0.0x is not handling these
> LOGOs appropriately. Basically, it is dropping them. This results in the
> host retrying with ADISC again. This goes on forever causing the hosts to
> lose the devices.  The fix is to not drop these LOGOs and ensure they are
> forwarded to the host correctly.
> 
> Brocade has identified the problem and have created a fix for this defect.
> The current patch fix is found in version v4.0.2_rc1.6.  This will be
rolled
> into v4.0.2a which is in the process of being released.  The date for
> v4.0.2a is forthcoming.
> 
> In the interim, they are creating a v4.0.0x code stream fix.  The method
> that we have had success with in restoring visibility is to
> portdisable/portenable on the port that the HBA is using.  This USUALLY
> brings back all of fubard targets.
> 
> Hope this helps somebody.
> 
> Brian Boone
> Storage Area Network Specialist
> Systems Operations
> TELUS Mobility
> Brian.Boone AT telus DOT com
> 
> _______________________________________________
> Veritas-bu maillist  -  Veritas-bu AT mailman.eng.auburn DOT edu
> http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
[Veritas-bu] LTO drives not visible to a SSO host after a tap e drive power cycle