Veritas-bu

[Veritas-bu] Re: SCSI reset errors and downed drives

2001-09-06 23:51:48
Subject: [Veritas-bu] Re: SCSI reset errors and downed drives
From: gyurchak AT veritas DOT com (Gregg Yurchak)
Date: Thu, 6 Sep 2001 23:51:48 -0400
I believe NT will always take the lowest WWN it receives off the switch it
is plugged into and make it target 0, and then the next highest for target
1, and so on until it exhausts that switch, and then moves onto the next
switch and again increments targets starting with the lowest WWN.  You
shouldn't get bouncing targets to WWNs like is often in Unix.  Unix on the
other hand doesn't do that nasty reconfiguration every time a boot happens.

But in the scenario you describe below (a drive is missing and
reconfiguration happens), persistent binding still would not help you in the
Unix world.  Yes, all the old WWNs would still be mapped to the same target
combos, but a /dev/rmt entry still won't get created for that device if it's
not physically there, and everything above is going to shift down and screw
up your drive mappings.

If NT didn't reconfigure on reboot, the series of events necessary for this
happening is most unlikely.  Unless there's a real strong reason for them to
do so, the feature needs to be taken out.

Thanks,
Gregg Yurchak
VERITAS Consultant
Biloxi, MS
gregg AT veritas DOT com
Office: 1.228.822.9810
Cell:    1.228.324.6939




-----Original Message-----
From: scott.kendall AT abbott DOT com [mailto:scott.kendall AT abbott DOT com]
Sent: Thursday, September 06, 2001 3:37 PM
To: anthony.guzzi AT storability DOT com
Cc: veritas-bu AT mailman.eng.auburn DOT edu
Subject: Re: [Veritas-bu] Re: SCSI reset errors and downed drives



Great information... I have just one question.  What about NT/2000?

I too have had the "DOWN" drive problem when a server is rebooted and for
whatever reason doesn't see the drives in the correct order or is missing
some
of the drives.

NT/2000 use "tape numbers" when configuring drives.  Even if all of the SCSI
target and LUNs for the drives are the same, if a drive is missing it will
shift the rest and still confuse things.

Example:
drive 0 is always target 1, LUN 0
drive 1 is always target 1, LUN 1
drive 2 is always target 1, LUN 2

drive 0 is Tape0 (or \\.\Tape0) within NT
drive 1 is Tape1 within NT
drive 2 is Tape2 within NT

If we reboot and for some reason drive 1 is not there, drive 2 will still be
target 1, LUN 2... but this won't help because it is now Tape1 instead of
Tape2.

The only way I see around this is for NT/2000 to be able to persistently
bind
their "tape numbers" to a specific SCSI target/LUN (and I don't know if it
can
do this)

OR

to have NetBackup on NT configured for the actual SCSI bus/port/target/LUN
of
the drive instead of the tape number

OR

to have NetBackup look at the serial number of the drive and do everything
dynamically (the reason this comes to mind is that there is a similar
problem
with Oracle on NT and raw partitions, which are required for OPS.  Since the
partitions are raw you don't have drive letters and have to map things to a
disk number within NT.  What Oracle did was use a symbolic link that maps
itself to a specific disk partition.  When the disk number changes, the
symbolic link dynamically maps to the new disk number.  I believe they are
doing this by looking at the signature that NT writes on the disk.)  This
seems like the most flexible, but not available today... maybe a future
release of SSO.


Thanks,
Scott



 

                    anthony.guzzi AT storability DOT com

                    Sent by:                             To:
dayalsd AT lycos DOT com, <veritas-bu AT mailman.eng.auburn DOT edu>            
                    veritas-bu-admin AT mailman DOT eng.        cc:

                    auburn.edu                           Subject:
[Veritas-bu] Re: SCSI reset errors and downed drives         
 

 

                    09/05/2001 12:30 PM

 

 






I've got one phrase for you:     persistent binding

I'm wondering if your bridges are being "discovered"/recognized in a
different order then when NBU was installed.  If you are using a fabric,
then remember that for most OS's, unless you specify otherwise, the first
fibre device found by the OS will be assigned target 0 off the HBA, the
second one will get target 1, etc.  Keep in mind that there's no guarantee
the devices will be found in the same order each time.  And should a fibre
switch reboot, you run the risk (though very slim) that the targets may
change.  But with persistent binding, you'll be binding each fibre
device's world-wide name (WWN) to a specific SCSI target off the HBA.  The
way no matter what order the system sees the devices, they'll always get
the same SCSI target.

I recently had to work on an L-700 with 12 fibre-native STK 9840 tape
drives.  Each drive was connected directly to a Brocade switch as was the
master server.  Every time the server rebooted, we would get downed
drives.  This was a result of some of the tape drives being
'discovered'/recognized by the system in a different order then when NBU
was set up.  As such, they were being given different SCSI target numbers.
This 're-arrangement' of the tape drives really messed up NBU.  The end
result was the master server was instructing the robot to put a tape in
one drive and then accessing [via the SCSI target] another drive and as
would be expected failed to see the tape and so downed the drive.

Check your fibre HBA vendor's documentation for instructions on how to
enable persistent binding for the HBA's driver under HP-UX (the procedure
differs by vendor, driver, and OS).

-- Tony Guzzi
Sr. Solutions Engineer, AssuredRestore team
Storability, Inc.






To: veritas-bu AT mailman.eng.auburn DOT edu
Date: Wed, 05 Sep 2001 08:36:07 -0500
From: "dayal singh" <dayalsd AT lycos DOT com>
Reply-To: dayalsd AT lycos DOT com
Organization: Lycos Mail  (http://mail.lycos.com:80)
Subject: [Veritas-bu] SCSI reset errors and downed drives

NBU GURUs,
                          We are continuously experiencing SCSI reset
erros and the drives are being downed on some of the drives.  Most of the
time it happens on the specific drives, sometimes it affects other drives
also. I am running NBU DataCenter 3.4 on HP-UX 11.0, N-class machine and I
have twenty  Quantum DLT8000 drives, connected through fiber to a
SureStore L700 tape library over HP fiber-scsi bridges. The bridges have a
firmware of 4040.

Anyone has seen these errors, any fixes i.e patches etc  ?

Y'r resonse is greatly appreciated.

TIA

Dayal




_______________________________________________
Veritas-bu maillist  -  Veritas-bu AT mailman.eng.auburn DOT edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu




_______________________________________________
Veritas-bu maillist  -  Veritas-bu AT mailman.eng.auburn DOT edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu

<Prev in Thread] Current Thread [Next in Thread>