[Veritas-bu] drives being marked down/missing

Hi,

I note a couple of things:

1) Almost impossible to have more than one or so drives go south at the same
time (anyway, according to Quantum and STL). So, the more drives down, the
more the issue points to hardware in between the drives and the main unit.
The biggest possibility is that if you use DIFF SCSI you have some bent
pins, or pins that are so bent that they short on each other....that would
do the error you report, especially if the problem were on the robotic
controller scsi channel. Diff uses all 68 pins, so even one will cause
errors...

2) I would check carefully in /var/adm/messages for anything showing the
actual downing of the drives. In the same sentence (messages entry), or
nearby, you may note the actual physical error reported.

3) can robtest move, load and unload drives? Can you tar or dd to a drive?

4) vmd/ltid will report the errors via the separate logs in
/usr/openv/netbackup/logs/???? dirs for your stacker/tld, etc. What do those
have? bptm in particular would be useful to look at.


I hope this starts helping.

Yours,

Chris



----- Original Message -----
From: "danix" <danix AT cloud9 DOT net>
To: <veritas-bu AT mailman.eng.auburn DOT edu>
Sent: Monday, May 20, 2002 7:30 AM
Subject: [Veritas-bu] drives being marked down/missing


> I sent this to the sun-managers list before finding this list.
> Sorry if you see this twice.
>
> (I'm not the backup admin, just attacking this from a solaris/hardware
> perspective, but I am working closely with the backup admin on this.)
>
> We have a running backup system using Netbackup 3.4.1 on an E450, with a
small SStoragetek jukebox.  About a month ago we added a larger jukebox to
the system, and it was going so well, our backup admin set all jobs to go to
the large device and removed the smaller one from netbackup, though leaving
it plugged and powered up.
>
> That night, all backups failed, as the tape drives failed and were marked
"down" in succession.
>
> The next day, we thought there was a problem with the large device, so we
had all jobs go to the smaller unit.  The same thing happened, all drives
got marked down.
>
> We deleted all tape units from netbackup and let it rediscover the drives.
It found the 4 in the smaller unit OK, but only 4 of 6 in the larger unit.
All those drives again failed during testing, and were marked down.
>
> Over the weekend we went to disk, which worked OK.
>
> We have:
> - checked syslog files, nothing obvious
> - 4 distinct HBA cards, so there is no common controller
> - cleaned the heads in all drives.
> - deleted the contents of /dev/rmt and run devfsadm (this is a solaris8
box, by the way)
>
> Any suggestions or experience with this problem is appreciated. I will of
course summarize.
> _______________________________________________
> Veritas-bu maillist  -  Veritas-bu AT mailman.eng.auburn DOT edu
> http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
>