Networker

[Networker] SCSI bus RESETS finally resolved???

2006-01-17 16:49:38
Subject: [Networker] SCSI bus RESETS finally resolved???
From: George Sinclair <George.Sinclair AT NOAA DOT GOV>
To: NETWORKER AT LISTSERV.TEMPLE DOT EDU
Date: Tue, 17 Jan 2006 16:46:02 -0500
Sometime back, I posted a message regarding SCSI bus RESET messages we'd been seeing on one of our storage node servers. This had been an ongoing nightmare. I've provided some sample messages way below. The messages occured almost every day and always indicated the scsi id for the picker itself. After fiddling around with various cables, host HBAs, drivers, host HBA settings in the card utility (e.g. increasing the timeout from 10 seconds to 255 for the picker device), the SCSI card in the actual library, terminators, settings on the library, isolating the robotic picker onto its own SCSI channel on the host's HBA, as opposed to daisy chaining it to one of the drives, blah-blah-blah -- basically trying everything to resolve this issue -- it now appears that the RESET occurs soon after a clone operation is started. Hmm ...

We're running Solaris on the primary server and RedHat Linux on the storage node. The library is a Storagetek L80 (with 4 LTO-1 drives), but we also have an older Quantum P1000 with 2 SDLT drives, and I've seen the same problems there when cloning on that one. We're running 6.1.1, but we plan to upgrade to 7.2.1 once we have a newer server in place.

What I've noticed is that if I run a clone using either the command line (nsrclone), or if I have cloning turned on for the given group, then once the cloning starts, the error will occur *if* one or more of the required tapes (original or clone volume) is not currently loaded in the drive(s). What I see on the GUI is a 'device or resource busy' message on one of the drives, and then NetWorker retries every 30 seconds. The RESET message will occur in the system log on the storage node soon after. This problem doesn't pick on any specific drive. I've seen it on all of them. Sometimes, it never succeeds while other times it might. I should note that when I've tested this, it seems to have these problems at least 50% of the time, and in my tests, no backups were running! If, however, I manually load the required tapes first, and then I run the clone then everything works just dandy, and I don't see these messages.

Furthermore, I turned automatic cloning off for the one and only group that was using it. This group is set to run every morning, and there are no other backups running when it runs. What I've noticed is that after turning cloning off on this group a week ago, the RESET messages have dissapeared from the storage node server's system log, and I no longer see these ugly errors. I then run the clone manually from the command line after I ensure that the required original tape(s) and clone volume are mounted. Works like a champ. Now, I've only been playing Sherlock Holmes for about a week now, and it's possible that the error could still occur when a clone is not running, but it's looking more and more suspicious when turning off automatic cloning quiets things, and I can replicate the error by cloning without one or more of the required tapes being mounted.

I should note that I don't specify or enforce any specific drives on the jukebox for cloning. I allow all the devices to be used for cloning or backups., and as such, it's quite likely that the drives will have backup volumes in them when the group with automatic cloning runs since this group runs in the morning and the backups groups run at night. Also, I have the Max Parallelism for the jukebox set to 4 because I like to be able to write to all 4 drives simultaneously during backups. And, all of the drives are write enabled. I'm thinking there must be something that causes NetWorker to choke or to time out when expecting a response from the picker and this is doing something to the storage node server, causing it to generate those RESETS on the affected SCSI bus.

What should we make of all this???? Is this a known issue with 6.1.1 or older releases? Can we expect better things from 7.2.1 in regards to this problem? Is there anything we can tweak or adjust that could fix this? Any settings in the jukebox or devices that should be adjusted?

Thanks,

George

<<< system log messages >>>

Jan 17 16:28:45 snode kernel: scsi : aborting command due to timeout : pid 
38588788, scsi2, channel 0, id 0, lun 0 Move medium/play audio(12) 00 00 00 01 
f4 03 f7 00 00 00 00
Jan 17 16:28:45 snode kernel: mptscsih: ioc0: id=0 OldAbort: scheduling ABORT 
SCSI IO (sc=f5ed3a00)
Jan 17 16:28:45 snode kernel: SCSI host 2 abort (pid 38588788) timed out - 
resetting
Jan 17 16:28:45 snode kernel: SCSI bus is being reset for host 2 channel 0.
Jan 17 16:28:45 snode kernel: mptscsih: ioc0: id=0 OldReset: scheduling 
BUS_RESET SCSI IO (sc=f5ed3a00)
Jan 17 16:28:45 snode kernel: mptbase: ioc0: WARNING - IOCStatus(0x0048): SCSI 
Task Terminated

To sign off this list, send email to listserv AT listserv.temple DOT edu and type 
"signoff networker" in the
body of the email. Please write to networker-request AT listserv.temple DOT edu 
if you have any problems
wit this list. You can access the archives at 
http://listserv.temple.edu/archives/networker.html or
via RSS at http://listserv.temple.edu/cgi-bin/wa?RSS&L=NETWORKER

<Prev in Thread] Current Thread [Next in Thread>
  • [Networker] SCSI bus RESETS finally resolved???, George Sinclair <=