Networker

[Networker] Adaptec aic7xxx ABORT -- need help!!!

2003-07-16 16:38:31
Subject: [Networker] Adaptec aic7xxx ABORT -- need help!!!
From: George Sinclair <George.Sinclair AT NOAA DOT GOV>
To: NETWORKER AT LISTMAIL.TEMPLE DOT EDU
Date: Wed, 16 Jul 2003 16:38:28 -0400
Hello,

I think we have an issue with the drivers or BIOS on our two Adaptec
9160 cards (please see PROBLEM and SETUP below) and our Dell Red Hat
storage node server that is causing massive problems during backups. In
looking around through various google searches this appears to be a
problem many others have experienced with the Adaptec driver, but I've
been unable to get any clear answers on exactly what to do to fix this.
Most of the stuff I read seemed to point to older versions of Linux,
where an older version of the Adaptec driver was the culprit. I posted
to this site sometime ago, and I think the recommendation at the time
was to patch the driver, or upgrade Red Hat. We were waiting to take it
to 7.3 anyway, and I would have thought the new kernel would have
included the latest version of the driver. We now have Red Hat 7.3 on
the storage node server --  it's been on there for a while -- but we're
still seeing the same problem.

I did not install Linux on this host, but I will try to provide as much
information as possible. I do not know if the current driver is built
into the kernel or is a separate module. I suspect it's built in.
Anyway, I was hoping someone on this listing might be able help me to
fix this very annoying problem. I need advice on what to do here so I
don't screw things up. I'm not very knowledgeable about Linux.

PROBLEM: We regularly receive ABORT messages from the aic7xxx driver on
our Linux box. We're running NetWorker 6.1.1 on a Dell storage node with
two attached tape libraries that use these cards. I've provided some
sample output from the /var/log/messages file below. These errors seem
to occur once every few days or so, and as near as I can tell, only when
the cards are in use. When these errors happen, there is a reasonable
likelihood that one or more nsrjb
processes (these perform mounting, loading, unloading, etc. of tapes)
will hang, resulting in frozen backup operations. Sometimes, the
affected tape will be prematurely marked full by the software -- no
doubt, a nasty side effect of this phenomena. Communication from the
host to the affected devices (tape drives) will often be terminated, but
other times, the communication is unaffected. In either case, the syntax
of the messages does not seem to differ. Sometimes the host itself will
lock up and must be cold booted, but normally the machine is fine, and
the worst case scenario is that no further communication with the
attached devices is possible until the machine is rebooted.

SETUP: The host in question is a Dell PowerEdge 2550, BIOS Revision A06.
'uname -a' shows: 2.4.20-13.7smp #1 SMP Mon May 12 12:31:27 EDT 2003
i686 unknown

On bootup, I see:

Adaptec SCSI Card 39160 BIOS, (c) 2000 Adaptec, Inc.
v2.57.2S2

for the card that manages our P1000 tape library and

v 3.10.0 (c) 2001

for the card managing the Storagetek library.

We're running one Storagetek L80 LTO tape library on one of the cards.
In this case both channels are being used. Specifically, the picker and
two drives on the library are daisy chained and attach to channel A on
the SCSI card via one LVD cable, and the other two drives are daisy
chained and attach to channel B on the SCSI card. The other library is
an ATL P1000 tape library with two SDLT drives and connects to the other
Adaptec card, using only one channel. Both libraries are terminated
properly, and all cables have been checked. As I said, communication is
restored once the host is rebooted.

I did check Adaptec's page, and it appears that we have the latest
firmware release for the second card, but we're behind on the other one.
I could download that for the other card and flash it's BIOS, but this
seems to be a one shot "better know you really want to do this" deal.
Not sure if I should do this, but I was thinking that maybe the current
driver won't work reliably with the older BIOS, so maybe this is part of
the problem. I'd read that many changes were included in the new Adaptec
driver. I don't know how to determine the version we're using. I'm
wondering if we can patch the driver and if so how? I thought that Linux
just uses a built-in driver, and the only way to get the next version is
to upgrade the OS? Anyway, I'm thinking we have all the right settings
in the config for the cards, but maybe we're using the wrong version of
the driver, or that older BIOS is the problem. Maybe it's kludging up
the other card. When the communication is terminated, it appears to
affect both libraries, but I've not tried running just one library.
Could do that, but I've seen so many things out there about these cards
that seem to match our symptoms that I thought I'd start with the driver
as the culprit.

I can't imagine the primary server having anything to do with this
mischief, but just for information, it's running 6.1.1 on Sun Solaris 8.

I'm thinking we should flash the firmware on the one card to bring it up
to the latest version and then download the latest version of the
Adaptec driver, install it as a module and re-compile the kernel, but
again, not sure how the current driver is integrated into the kernel or
even if or how I would even go about doing this.

Would appreciate any help.

George
George.Sinclair AT noaa DOT gov

<<< /var/log/messages >>>
Jul 7 04:53:30 hostname kernel: scsi0:0:3:0: Attempting to queue an
ABORT message
Jul 7 04:53:30 hostname kernel: scsi0: Dumping Card State in Command
phase, at SEQADDR 0x168
Jul 7 04:53:30 hostname kernel: ACCUM = 0x80, SINDEX = 0xa0, DINDEX =
0xe4, ARG_2 = 0x0
Jul 7 04:53:30 hostname kernel: HCNT = 0x0 SCBPTR = 0x0
Jul 7 04:53:30 hostname kernel: SCSISEQ = 0x12, SBLKCTL = 0xa
Jul 7 04:53:30 hostname kernel: DFCNTRL = 0x4, DFSTATUS = 0x89
Jul 7 04:53:30 hostname kernel: LASTPHASE = 0x80, SCSISIGI = 0x44,
SXFRCTL0 = 0x88
Jul 7 04:53:30 hostname kernel: SSTAT0 = 0x7, SSTAT1 = 0x11
Jul 7 04:53:30 hostname kernel: SCSIPHASE = 0x2
Jul 7 04:53:30 hostname kernel: STACK == 0x175, 0x160, 0xe7, 0x34
Jul 7 04:53:30 hostname kernel: SCB count = 4
Jul 7 04:53:30 hostname kernel: Kernel NEXTQSCB = 0
Jul 7 04:53:30 hostname kernel: Card NEXTQSCB = 0
Jul 7 04:53:30 hostname kernel: QINFIFO entries:
Jul 7 04:53:30 hostname kernel: Waiting Queue entries:
Jul 7 04:53:30 hostname kernel: Disconnected Queue entries:
Jul 7 04:53:30 hostname kernel: QOUTFIFO entries:
Jul 7 04:53:30 hostname kernel: Sequencer Free SCB List: 2 1 3 4 5 6 7 8
9
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Jul 7 04:53:30 hostname kernel: Sequencer SCB Info: 0(c 0x40, s 0x37, l
0,
t 0x1) 1(c 0x40, s 0x37, l 0, t 0xff) 2(c 0x40, s 0x27, l 0, t 0xff) 3(c
0x0, s 0xff, l 255, t 0xff) 4(c 0x0, s 0xff, l 255, t 0xff) 5(c 0x0, s
0xff, l 255, t 0xff) 6(c 0x0, s 0xff, l 255, t 0xff) 7(c 0x0, s 0xff, l
255, t 0xff) 8(c 0x0, s 0xff, l 255, t 0xff) 9(c 0x0, s 0xff, l 255, t
0xff) 10(c 0x0, s 0xff, l 255, t 0xff) 11(c 0x0, s 0xff, l 255, t 0xff)
12(c 0x0, s 0xff, l 255, t 0xff) 13(c 0x0, s 0xff, l 255, t 0xff) 14(c
0x0, s 0xff, l 255, t 0xff) 15(c 0x0, s 0xff, l 255, t 0xff) 16(c 0x0, s
0xff, l 255, t 0xff) 17(c 0x0, s 0xff, l 255, t 0xff) 18(c 0x0, s 0xff,
l 255, t 0xff) 19(c 0x0, s 0xff, l 255, t 0xff) 20(c 0x0, s 0xff, l 255,
t 0xff) 21(c 0x0, s 0xff, l 255, t 0xff) 22(c 0x0, s 0xff, l 255, t
0xff) 23(c 0x0, s 0xff, l 255, t 0xff) 24(c 0x0, s 0xff, l 255, t 0xff)
25(c 0x0, s 0xff, l 255, t 0xff) 26(c 0x0, s 0xff, l 255, t 0xff) 27(c
0x0, s 0xff, l 255, t 0xff) 28(c 0x0, s 0xff, l 255, t 0xff) 29(c 0x0, s
0xff, l 255, t 0xff)
Jul 7 04:53:30 hostname kernel: 30(c 0x0, s 0xff, l 255, t 0xff) 31(c
0x0, s 0xff, l 255, t 0xff)
Jul 7 04:53:30 hostname kernel: Pending list: 1(c 0x40, s 0x37, l 0)
Jul 7 04:53:30 hostname kernel: Kernel Free SCB list: 2 3
Jul 7 04:53:30 hostname kernel: Untagged Q(3): 1
Jul 7 04:53:30 hostname kernel: DevQ(0:0:0): 0 waiting
Jul 7 04:53:30 hostname kernel: DevQ(0:2:0): 0 waiting
Jul 7 04:53:30 hostname kernel: DevQ(0:3:0): 0 waiting
Jul 7 04:53:30 hostname kernel: scsi0:0:3:0: Device is active, asserting
ATN
Jul 7 04:53:30 hostname kernel: Recovery code sleeping
Jul 7 04:53:30 hostname kernel: (scsi0:A:3:0): Abort Message Sent
Jul 7 04:53:30 hostname kernel: (scsi0:A:3:0): SCB 1 - Abort Completed.
Jul 7 04:53:30 hostname kernel: Recovery SCB completes
Jul 7 04:53:30 hostname kernel: Recovery code awake
Jul 7 04:53:30 hostname kernel: aic7xxx_abort returns 0x2002

--
Note: To sign off this list, send a "signoff networker" command via email
to listserv AT listmail.temple DOT edu or visit the list's Web site at
http://listmail.temple.edu/archives/networker.html where you can
also view and post messages to the list.
=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=

<Prev in Thread] Current Thread [Next in Thread>