[Networker] SCSI bus RESETS finally resolved???

Sometime back, I posted a message regarding SCSI bus RESET messages we'dbeen seeing on one of our storage node servers. This had been an ongoingnightmare. I've provided some sample messages way below. The messagesoccured almost every day and always indicated the scsi id for the pickeritself. After fiddling around with various cables, host HBAs, drivers,host HBA settings in the card utility (e.g. increasing the timeout from10 seconds to 255 for the picker device), the SCSI card in the actuallibrary, terminators, settings on the library, isolating the roboticpicker onto its own SCSI channel on the host's HBA, as opposed to daisychaining it to one of the drives, blah-blah-blah -- basically tryingeverything to resolve this issue -- it now appears that the RESET occurssoon after a clone operation is started. Hmm ...

We're running Solaris on the primary server and RedHat Linux on thestorage node. The library is a Storagetek L80 (with 4 LTO-1 drives), butwe also have an older Quantum P1000 with 2 SDLT drives, and I've seenthe same problems there when cloning on that one. We're running 6.1.1,but we plan to upgrade to 7.2.1 once we have a newer server in place.

What I've noticed is that if I run a clone using either the command line(nsrclone), or if I have cloning turned on for the given group, thenonce the cloning starts, the error will occur *if* one or more of therequired tapes (original or clone volume) is not currently loaded in thedrive(s). What I see on the GUI is a 'device or resource busy' messageon one of the drives, and then NetWorker retries every 30 seconds. TheRESET message will occur in the system log on the storage node soonafter. This problem doesn't pick on any specific drive. I've seen it onall of them. Sometimes, it never succeeds while other times it might. Ishould note that when I've tested this, it seems to have these problemsat least 50% of the time, and in my tests, no backups were running! If,however, I manually load the required tapes first, and then I run theclone then everything works just dandy, and I don't see these messages.

Furthermore, I turned automatic cloning off for the one and only groupthat was using it. This group is set to run every morning, and there areno other backups running when it runs. What I've noticed is that afterturning cloning off on this group a week ago, the RESET messages havedissapeared from the storage node server's system log, and I no longersee these ugly errors. I then run the clone manually from the commandline after I ensure that the required original tape(s) and clone volumeare mounted. Works like a champ. Now, I've only been playing SherlockHolmes for about a week now, and it's possible that the error couldstill occur when a clone is not running, but it's looking more and moresuspicious when turning off automatic cloning quiets things, and I canreplicate the error by cloning without one or more of the required tapesbeing mounted.

I should note that I don't specify or enforce any specific drives on thejukebox for cloning. I allow all the devices to be used for cloning orbackups., and as such, it's quite likely that the drives will havebackup volumes in them when the group with automatic cloning runs sincethis group runs in the morning and the backups groups run at night.Also, I have the Max Parallelism for the jukebox set to 4 because Ilike to be able to write to all 4 drives simultaneously during backups.And, all of the drives are write enabled. I'm thinking there must besomething that causes NetWorker to choke or to time out when expecting aresponse from the picker and this is doing something to the storage nodeserver, causing it to generate those RESETS on the affected SCSI bus.

What should we make of all this???? Is this a known issue with 6.1.1 orolder releases? Can we expect better things from 7.2.1 in regards tothis problem? Is there anything we can tweak or adjust that could fixthis? Any settings in the jukebox or devices that should be adjusted?


Thanks,

George

<<< system log messages >>>

Jan 17 16:28:45 snode kernel: scsi : aborting command due to timeout : pid 
38588788, scsi2, channel 0, id 0, lun 0 Move medium/play audio(12) 00 00 00 01 
f4 03 f7 00 00 00 00
Jan 17 16:28:45 snode kernel: mptscsih: ioc0: id=0 OldAbort: scheduling ABORT 
SCSI IO (sc=f5ed3a00)
Jan 17 16:28:45 snode kernel: SCSI host 2 abort (pid 38588788) timed out - 
resetting
Jan 17 16:28:45 snode kernel: SCSI bus is being reset for host 2 channel 0.
Jan 17 16:28:45 snode kernel: mptscsih: ioc0: id=0 OldReset: scheduling 
BUS_RESET SCSI IO (sc=f5ed3a00)
Jan 17 16:28:45 snode kernel: mptbase: ioc0: WARNING - IOCStatus(0x0048): SCSI 
Task Terminated

To sign off this list, send email to listserv AT listserv.temple DOT edu and type 
"signoff networker" in the
body of the email. Please write to networker-request AT listserv.temple DOT edu 
if you have any problems
wit this list. You can access the archives at 
http://listserv.temple.edu/archives/networker.html or
via RSS at http://listserv.temple.edu/cgi-bin/wa?RSS&L=NETWORKER