Networker

[Networker] Question about fibre channel on a Solaris box

2005-11-02 10:45:50
Subject: [Networker] Question about fibre channel on a Solaris box
From: Stan Horwitz <stan AT TEMPLE DOT EDU>
To: NETWORKER AT LISTSERV.TEMPLE DOT EDU
Date: Wed, 2 Nov 2005 10:42:26 -0500
We have a Sony PetaSite that's connected to a Sunfire v480 via fibre. The tape library has 13 S-AIT tape drives, each connected to a Qlogic fibre channel switch (or bridge). The Qlogic switch is connected to the v480 via a Qlogic fibre card in the v480 and the Qlogic drivers. We use this is our NetWorker server with Solaris 9 and NetWorker 7.2.1.

For the most part, this system works great! We back up a few terabytes each night for about 220 clients with more clients being added all the time. We also have successfully recovered several terabytes worth of data since we deployed this hardware. Unfortunately, I have a knack for finding obscure bugs in software and it seems as if our NetWorker server is no exception. I found two obscure bugs last year that both caused nsrd with 7.1.3 to crash about once a month. Fortunately, that problem seems to have gone away after we upgraded to 7.2.1.

Unfortunately, due to a bad choice in the way I initially configured our PetaSite for tape cleaning and a misunderstanding about how many cleanings can be done with S-AIT cleaning tapes and extremely heavy use of our PetaSite, I have uncovered a bug in the tape drives' firmware such that once in a while, a tape drive in our PetaSite will go off line while its trying to locate a mark on a tape. In working with people at Sony, we are now aware of why this problem happens. An updated tape drive firmware version is being tested now by Sony.

The reason I am saying this is because each time a S-AIT goes off- line (i.e., loses communication with our PetaSite's controller unit), it also loses communication with our NetWorker server. To resolve this issue, I shut down all the NetWorker daemons, power cycle the failed tape drive, eject the tape that is in the tape drive, reboot our NetWorker server then reset the library from within NetWorker. This process typically requires an hour of my time, including a trip across campus to where our tape library is located. Its a pain in the neck. Fortunately, this situation rarely causes a significant delay in our backup schedule and I can sometimes wait a few days before I take action to put the failed tape drive back on line.

Although I expect this problem to be resolved fairly soon, what I am wondering about is if there is a better way for me to trigger NetWorker and Solaris to reestablish access to a failed fibre channel tape drive after I have power cycled it and it is again accessible to the PetaSite's controller. The site engineer from Sony who works on our PetaSite and the software support engineer who I have been working with by phone and email on this situation both say I should be able to get the drive back online without rebooting our Solaris box, but I have no idea how. So far, the only way I can figure out how to get our v480 to talk to the device again is to do a reboot, but that's often not possible because we tend to keep our tape library busy 24x7.

Any suggestions on how to handle this better than what I do now will be appreciated.

To sign off this list, send email to listserv AT listserv.temple DOT edu and type 
"signoff networker" in the
body of the email. Please write to networker-request AT listserv.temple DOT edu 
if you have any problems
wit this list. You can access the archives at 
http://listserv.temple.edu/archives/networker.html or
via RSS at http://listserv.temple.edu/cgi-bin/wa?RSS&L=NETWORKER