[Networker] Question about fibre channel on a Solaris box
2005-11-02 10:45:50
We have a Sony PetaSite that's connected to a Sunfire v480 via fibre.
The tape library has 13 S-AIT tape drives, each connected to a Qlogic
fibre channel switch (or bridge). The Qlogic switch is connected to
the v480 via a Qlogic fibre card in the v480 and the Qlogic drivers.
We use this is our NetWorker server with Solaris 9 and NetWorker 7.2.1.
For the most part, this system works great! We back up a few
terabytes each night for about 220 clients with more clients being
added all the time. We also have successfully recovered several
terabytes worth of data since we deployed this hardware.
Unfortunately, I have a knack for finding obscure bugs in software
and it seems as if our NetWorker server is no exception. I found two
obscure bugs last year that both caused nsrd with 7.1.3 to crash
about once a month. Fortunately, that problem seems to have gone away
after we upgraded to 7.2.1.
Unfortunately, due to a bad choice in the way I initially configured
our PetaSite for tape cleaning and a misunderstanding about how many
cleanings can be done with S-AIT cleaning tapes and extremely heavy
use of our PetaSite, I have uncovered a bug in the tape drives'
firmware such that once in a while, a tape drive in our PetaSite will
go off line while its trying to locate a mark on a tape. In working
with people at Sony, we are now aware of why this problem happens. An
updated tape drive firmware version is being tested now by Sony.
The reason I am saying this is because each time a S-AIT goes off-
line (i.e., loses communication with our PetaSite's controller unit),
it also loses communication with our NetWorker server. To resolve
this issue, I shut down all the NetWorker daemons, power cycle the
failed tape drive, eject the tape that is in the tape drive, reboot
our NetWorker server then reset the library from within NetWorker.
This process typically requires an hour of my time, including a trip
across campus to where our tape library is located. Its a pain in the
neck. Fortunately, this situation rarely causes a significant delay
in our backup schedule and I can sometimes wait a few days before I
take action to put the failed tape drive back on line.
Although I expect this problem to be resolved fairly soon, what I am
wondering about is if there is a better way for me to trigger
NetWorker and Solaris to reestablish access to a failed fibre channel
tape drive after I have power cycled it and it is again accessible to
the PetaSite's controller. The site engineer from Sony who works on
our PetaSite and the software support engineer who I have been
working with by phone and email on this situation both say I should
be able to get the drive back online without rebooting our Solaris
box, but I have no idea how. So far, the only way I can figure out
how to get our v480 to talk to the device again is to do a reboot,
but that's often not possible because we tend to keep our tape
library busy 24x7.
Any suggestions on how to handle this better than what I do now will
be appreciated.
To sign off this list, send email to listserv AT listserv.temple DOT edu and type
"signoff networker" in the
body of the email. Please write to networker-request AT listserv.temple DOT edu
if you have any problems
wit this list. You can access the archives at
http://listserv.temple.edu/archives/networker.html or
via RSS at http://listserv.temple.edu/cgi-bin/wa?RSS&L=NETWORKER
|
<Prev in Thread] |
Current Thread |
[Next in Thread>
|
- [Networker] Question about fibre channel on a Solaris box,
Stan Horwitz <=
|
|
|