Networker

Re: [Networker] Adaptec aic7xxx ABORT -- need help!!!

2003-07-18 04:48:54
Subject: Re: [Networker] Adaptec aic7xxx ABORT -- need help!!!
From: Christian Drexler <CDrexler AT TEE.TOSHIBA DOT DE>
To: NETWORKER AT LISTMAIL.TEMPLE DOT EDU
Date: Fri, 18 Jul 2003 10:48:44 +0200
Hi George,
we had a similar problem with two different Adpatec SCSI controllers
under RH 73 and 80 when writing to an MOD. The workaround was to use the
aic7xxx_old module. If I remember correctly there was a timing problem
with the aic7xxx-module when writing and verifying the written data.

Regards

~christian

On Wed, 16 Jul 2003 16:38:28 -0400
George Sinclair <George.Sinclair AT noaa DOT gov> wrote:

> Hello,
>
> I think we have an issue with the drivers or BIOS on our two Adaptec
> 9160 cards (please see PROBLEM and SETUP below) and our Dell Red Hat
> storage node server that is causing massive problems during backups.
> In looking around through various google searches this appears to be a
> problem many others have experienced with the Adaptec driver, but I've
> been unable to get any clear answers on exactly what to do to fix
> this. Most of the stuff I read seemed to point to older versions of
> Linux, where an older version of the Adaptec driver was the culprit. I
> posted to this site sometime ago, and I think the recommendation at
> the time was to patch the driver, or upgrade Red Hat. We were waiting
> to take it to 7.3 anyway, and I would have thought the new kernel
> would have included the latest version of the driver. We now have Red
> Hat 7.3 on the storage node server --  it's been on there for a while
> -- but we're still seeing the same problem.
>
> I did not install Linux on this host, but I will try to provide as
> much information as possible. I do not know if the current driver is
> built into the kernel or is a separate module. I suspect it's built
> in. Anyway, I was hoping someone on this listing might be able help me
> to fix this very annoying problem. I need advice on what to do here so
> I don't screw things up. I'm not very knowledgeable about Linux.
>
> PROBLEM: We regularly receive ABORT messages from the aic7xxx driver
> on our Linux box. We're running NetWorker 6.1.1 on a Dell storage node
> with two attached tape libraries that use these cards. I've provided
> some sample output from the /var/log/messages file below. These errors
> seem to occur once every few days or so, and as near as I can tell,
> only when the cards are in use. When these errors happen, there is a
> reasonable likelihood that one or more nsrjb
> processes (these perform mounting, loading, unloading, etc. of tapes)
> will hang, resulting in frozen backup operations. Sometimes, the
> affected tape will be prematurely marked full by the software -- no
> doubt, a nasty side effect of this phenomena. Communication from the
> host to the affected devices (tape drives) will often be terminated,
> but other times, the communication is unaffected. In either case, the
> syntax of the messages does not seem to differ. Sometimes the host
> itself will lock up and must be cold booted, but normally the machine
> is fine, and the worst case scenario is that no further communication
> with the attached devices is possible until the machine is rebooted.
>
> SETUP: The host in question is a Dell PowerEdge 2550, BIOS Revision
> A06.'uname -a' shows: 2.4.20-13.7smp #1 SMP Mon May 12 12:31:27 EDT
> 2003 i686 unknown
>
> On bootup, I see:
>
> Adaptec SCSI Card 39160 BIOS, (c) 2000 Adaptec, Inc.
> v2.57.2S2
>
> for the card that manages our P1000 tape library and
>
> v 3.10.0 (c) 2001
>
> for the card managing the Storagetek library.
>
> We're running one Storagetek L80 LTO tape library on one of the cards.
> In this case both channels are being used. Specifically, the picker
> and two drives on the library are daisy chained and attach to channel
> A on the SCSI card via one LVD cable, and the other two drives are
> daisy chained and attach to channel B on the SCSI card. The other
> library is an ATL P1000 tape library with two SDLT drives and connects
> to the other Adaptec card, using only one channel. Both libraries are
> terminated properly, and all cables have been checked. As I said,
> communication is restored once the host is rebooted.
>
> I did check Adaptec's page, and it appears that we have the latest
> firmware release for the second card, but we're behind on the other
> one. I could download that for the other card and flash it's BIOS, but
> this seems to be a one shot "better know you really want to do this"
> deal. Not sure if I should do this, but I was thinking that maybe the
> current driver won't work reliably with the older BIOS, so maybe this
> is part of the problem. I'd read that many changes were included in
> the new Adaptec driver. I don't know how to determine the version
> we're using. I'm wondering if we can patch the driver and if so how? I
> thought that Linux just uses a built-in driver, and the only way to
> get the next version is to upgrade the OS? Anyway, I'm thinking we
> have all the right settings in the config for the cards, but maybe
> we're using the wrong version of the driver, or that older BIOS is the
> problem. Maybe it's kludging up the other card. When the communication
> is terminated, it appears to affect both libraries, but I've not tried
> running just one library. Could do that, but I've seen so many things
> out there about these cards that seem to match our symptoms that I
> thought I'd start with the driver as the culprit.
>
> I can't imagine the primary server having anything to do with this
> mischief, but just for information, it's running 6.1.1 on Sun Solaris
> 8.
>
> I'm thinking we should flash the firmware on the one card to bring it
> up to the latest version and then download the latest version of the
> Adaptec driver, install it as a module and re-compile the kernel, but
> again, not sure how the current driver is integrated into the kernel
> or even if or how I would even go about doing this.
>
> Would appreciate any help.
>
> George
> George.Sinclair AT noaa DOT gov
>
> <<< /var/log/messages >>>
> Jul 7 04:53:30 hostname kernel: scsi0:0:3:0: Attempting to queue an
> ABORT message
> Jul 7 04:53:30 hostname kernel: scsi0: Dumping Card State in Command
> phase, at SEQADDR 0x168
> Jul 7 04:53:30 hostname kernel: ACCUM = 0x80, SINDEX = 0xa0, DINDEX =
> 0xe4, ARG_2 = 0x0
> Jul 7 04:53:30 hostname kernel: HCNT = 0x0 SCBPTR = 0x0
> Jul 7 04:53:30 hostname kernel: SCSISEQ = 0x12, SBLKCTL = 0xa
> Jul 7 04:53:30 hostname kernel: DFCNTRL = 0x4, DFSTATUS = 0x89
> Jul 7 04:53:30 hostname kernel: LASTPHASE = 0x80, SCSISIGI = 0x44,
> SXFRCTL0 = 0x88
> Jul 7 04:53:30 hostname kernel: SSTAT0 = 0x7, SSTAT1 = 0x11
> Jul 7 04:53:30 hostname kernel: SCSIPHASE = 0x2
> Jul 7 04:53:30 hostname kernel: STACK == 0x175, 0x160, 0xe7, 0x34
> Jul 7 04:53:30 hostname kernel: SCB count = 4
> Jul 7 04:53:30 hostname kernel: Kernel NEXTQSCB = 0
> Jul 7 04:53:30 hostname kernel: Card NEXTQSCB = 0
> Jul 7 04:53:30 hostname kernel: QINFIFO entries:
> Jul 7 04:53:30 hostname kernel: Waiting Queue entries:
> Jul 7 04:53:30 hostname kernel: Disconnected Queue entries:
> Jul 7 04:53:30 hostname kernel: QOUTFIFO entries:
> Jul 7 04:53:30 hostname kernel: Sequencer Free SCB List: 2 1 3 4 5 6 7
> 8 9
> 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
> Jul 7 04:53:30 hostname kernel: Sequencer SCB Info: 0(c 0x40, s 0x37,
> l 0,
> t 0x1) 1(c 0x40, s 0x37, l 0, t 0xff) 2(c 0x40, s 0x27, l 0, t 0xff)
> 3(c 0x0, s 0xff, l 255, t 0xff) 4(c 0x0, s 0xff, l 255, t 0xff) 5(c
> 0x0, s 0xff, l 255, t 0xff) 6(c 0x0, s 0xff, l 255, t 0xff) 7(c 0x0, s
> 0xff, l 255, t 0xff) 8(c 0x0, s 0xff, l 255, t 0xff) 9(c 0x0, s 0xff,
> l 255, t 0xff) 10(c 0x0, s 0xff, l 255, t 0xff) 11(c 0x0, s 0xff, l
> 255, t 0xff) 12(c 0x0, s 0xff, l 255, t 0xff) 13(c 0x0, s 0xff, l 255,
> t 0xff) 14(c 0x0, s 0xff, l 255, t 0xff) 15(c 0x0, s 0xff, l 255, t
> 0xff) 16(c 0x0, s 0xff, l 255, t 0xff) 17(c 0x0, s 0xff, l 255, t
> 0xff) 18(c 0x0, s 0xff, l 255, t 0xff) 19(c 0x0, s 0xff, l 255, t
> 0xff) 20(c 0x0, s 0xff, l 255, t 0xff) 21(c 0x0, s 0xff, l 255, t
> 0xff) 22(c 0x0, s 0xff, l 255, t 0xff) 23(c 0x0, s 0xff, l 255, t
> 0xff) 24(c 0x0, s 0xff, l 255, t 0xff) 25(c 0x0, s 0xff, l 255, t
> 0xff) 26(c 0x0, s 0xff, l 255, t 0xff) 27(c 0x0, s 0xff, l 255, t
> 0xff) 28(c 0x0, s 0xff, l 255, t 0xff) 29(c 0x0, s 0xff, l 255, t
> 0xff) Jul 7 04:53:30 hostname kernel: 30(c 0x0, s 0xff, l 255, t 0xff)
> 31(c 0x0, s 0xff, l 255, t 0xff)
> Jul 7 04:53:30 hostname kernel: Pending list: 1(c 0x40, s 0x37, l 0)
> Jul 7 04:53:30 hostname kernel: Kernel Free SCB list: 2 3
> Jul 7 04:53:30 hostname kernel: Untagged Q(3): 1
> Jul 7 04:53:30 hostname kernel: DevQ(0:0:0): 0 waiting
> Jul 7 04:53:30 hostname kernel: DevQ(0:2:0): 0 waiting
> Jul 7 04:53:30 hostname kernel: DevQ(0:3:0): 0 waiting
> Jul 7 04:53:30 hostname kernel: scsi0:0:3:0: Device is active,
> asserting ATN
> Jul 7 04:53:30 hostname kernel: Recovery code sleeping
> Jul 7 04:53:30 hostname kernel: (scsi0:A:3:0): Abort Message Sent
> Jul 7 04:53:30 hostname kernel: (scsi0:A:3:0): SCB 1 - Abort
> Completed. Jul 7 04:53:30 hostname kernel: Recovery SCB completes
> Jul 7 04:53:30 hostname kernel: Recovery code awake
> Jul 7 04:53:30 hostname kernel: aic7xxx_abort returns 0x2002
>
> --
> Note: To sign off this list, send a "signoff networker" command via
> email to listserv AT listmail.temple DOT edu or visit the list's Web site at
> http://listmail.temple.edu/archives/networker.html where you can
> also view and post messages to the list.
> =*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=


--
Christian Drexler               mail: cdrexler AT tee.toshiba DOT de
Systems Administrator           phone: +49 211 5296 322
Toshiba Electronics Europe      fax: +49 211 5296 9322

--
Note: To sign off this list, send a "signoff networker" command via email
to listserv AT listmail.temple DOT edu or visit the list's Web site at
http://listmail.temple.edu/archives/networker.html where you can
also view and post messages to the list.
=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=

<Prev in Thread] Current Thread [Next in Thread>