Networker

Re: [Networker] Adaptec aic7xxx ABORT -- need help!!!

2003-07-22 10:07:43
Subject: Re: [Networker] Adaptec aic7xxx ABORT -- need help!!!
From: Matthew Temple <mht AT RESEARCH.DFCI.HARVARD DOT EDU>
To: NETWORKER AT LISTMAIL.TEMPLE DOT EDU
Date: Tue, 22 Jul 2003 09:57:39 -0400
Did you ever get anywhere with this.   I've seen similar entries in
my log file (see below) but they're not necessarily failures, that is,
the backup can go on.

Which card do you have?

                                                Matt Temple

/var/log/messages:Jul 20 22:00:18 tapes kernel: scsi0:0:2:0: Attempting to
queue an ABORT message
/var/log/messages:Jul 20 22:00:18 tapes kernel: (scsi0:A:2:0): Abort
Message Sent
/var/log/messages:Jul 20 22:00:18 tapes kernel: (scsi0:A:2:0): SCB 3 -
Abort Completed.
/var/log/messages:Jul 20 22:00:19 tapes kernel: aic7xxx_abort returns
0x2002
/var/log/messages.1:Jul 14 15:58:13 tapes kernel: scsi0:0:2:0: Attempting
to queue an ABORT message
/var/log/messages.1:Jul 14 15:58:14 tapes kernel: (scsi0:A:2:0): Abort
Message Sent
/var/log/messages.1:Jul 14 15:58:14 tapes kernel: (scsi0:A:2:0): SCB 2 -
Abort Completed.
/var/log/messages.1:Jul 14 15:58:14 tapes kernel: aic7xxx_abort returns
0x2002
/var/log/messages.1:Jul 14 15:59:17 tapes kernel: scsi0:0:0:0: Attempting
to queue an ABORT message
/var/log/messages.1:Jul 14 15:59:18 tapes kernel: aic7xxx_abort returns
0x2002
/var/log/messages.1:Jul 14 15:59:18 tapes kernel: scsi0:0:1:0: Attempting
to queue an ABORT message
/var/log/messages.1:Jul 14 15:59:18 tapes kernel: aic7xxx_abort returns
0x2002
/var/log/messages.1:Jul 15 10:06:23 tapes kernel: scsi0:0:1:0: Attempting
to queue an ABORT message
/var/log/messages.1:Jul 15 10:06:23 tapes kernel: (scsi0:A:1:0): Abort
Message Sent
/var/log/messages.1:Jul 15 10:06:23 tapes kernel: (scsi0:A:1:0): SCB 1 -
Abort Completed.
/var/log/messages.1:Jul 15 10:06:23 tapes kernel: aic7xxx_abort returns
0x2002



On Thu, 17 Jul 2003, Jose Quinteiro wrote:

> Hello George,
>
> The Adaptec driver is a module on both of my Redhat 7.3 systems:
>
> [root@centaur root]# /sbin/lsmod
> Module                  Size  Used by    Not tainted
> ...
> aic7xxx               124768   0  (unused)
> sd_mod                 12864   0  (unused)
> scsi_mod              108576   2  [aic7xxx sd_mod]
>
> [root@cow root]# /sbin/lsmod
> Module                  Size  Used by    Not tainted
> ...
> st                     29108   0  (unused)
> ext3                   67136   1
> jbd                    49400   1  [ext3]
> aic7xxx               124768   0  (unused)
> sd_mod                 12864   0  (unused)
> scsi_mod              108576   3  [st aic7xxx sd_mod]
>
> Compiling and installing Linux kernel modules is not exactly simple,
> however.
>
> FWIW, whenever I've had a similar problem, it's turned out to be the
> cable. I recently switched a  hard drive/cable that was working
> beautifully on an Adaptec 2940 to an Adaptec 29160N.  It refused to work
> until I changed to a shorter cable on the latter system. It gave me
> errors that are roughly equivalent to what you're seeing.
>
> Saludos,
> Jose.
>
>
> George Sinclair wrote:
> > Hello,
> >
> > I think we have an issue with the drivers or BIOS on our two Adaptec
> > 9160 cards (please see PROBLEM and SETUP below) and our Dell Red Hat
> > storage node server that is causing massive problems during backups. In
> > looking around through various google searches this appears to be a
> > problem many others have experienced with the Adaptec driver, but I've
> > been unable to get any clear answers on exactly what to do to fix this.
> > Most of the stuff I read seemed to point to older versions of Linux,
> > where an older version of the Adaptec driver was the culprit. I posted
> > to this site sometime ago, and I think the recommendation at the time
> > was to patch the driver, or upgrade Red Hat. We were waiting to take it
> > to 7.3 anyway, and I would have thought the new kernel would have
> > included the latest version of the driver. We now have Red Hat 7.3 on
> > the storage node server --  it's been on there for a while -- but we're
> > still seeing the same problem.
> >
> > I did not install Linux on this host, but I will try to provide as much
> > information as possible. I do not know if the current driver is built
> > into the kernel or is a separate module. I suspect it's built in.
> > Anyway, I was hoping someone on this listing might be able help me to
> > fix this very annoying problem. I need advice on what to do here so I
> > don't screw things up. I'm not very knowledgeable about Linux.
> >
> > PROBLEM: We regularly receive ABORT messages from the aic7xxx driver on
> > our Linux box. We're running NetWorker 6.1.1 on a Dell storage node with
> > two attached tape libraries that use these cards. I've provided some
> > sample output from the /var/log/messages file below. These errors seem
> > to occur once every few days or so, and as near as I can tell, only when
> > the cards are in use. When these errors happen, there is a reasonable
> > likelihood that one or more nsrjb
> > processes (these perform mounting, loading, unloading, etc. of tapes)
> > will hang, resulting in frozen backup operations. Sometimes, the
> > affected tape will be prematurely marked full by the software -- no
> > doubt, a nasty side effect of this phenomena. Communication from the
> > host to the affected devices (tape drives) will often be terminated, but
> > other times, the communication is unaffected. In either case, the syntax
> > of the messages does not seem to differ. Sometimes the host itself will
> > lock up and must be cold booted, but normally the machine is fine, and
> > the worst case scenario is that no further communication with the
> > attached devices is possible until the machine is rebooted.
> >
> > SETUP: The host in question is a Dell PowerEdge 2550, BIOS Revision A06.
> > 'uname -a' shows: 2.4.20-13.7smp #1 SMP Mon May 12 12:31:27 EDT 2003
> > i686 unknown
> >
> > On bootup, I see:
> >
> > Adaptec SCSI Card 39160 BIOS, (c) 2000 Adaptec, Inc.
> > v2.57.2S2
> >
> > for the card that manages our P1000 tape library and
> >
> > v 3.10.0 (c) 2001
> >
> > for the card managing the Storagetek library.
> >
> > We're running one Storagetek L80 LTO tape library on one of the cards.
> > In this case both channels are being used. Specifically, the picker and
> > two drives on the library are daisy chained and attach to channel A on
> > the SCSI card via one LVD cable, and the other two drives are daisy
> > chained and attach to channel B on the SCSI card. The other library is
> > an ATL P1000 tape library with two SDLT drives and connects to the other
> > Adaptec card, using only one channel. Both libraries are terminated
> > properly, and all cables have been checked. As I said, communication is
> > restored once the host is rebooted.
> >
> > I did check Adaptec's page, and it appears that we have the latest
> > firmware release for the second card, but we're behind on the other one.
> > I could download that for the other card and flash it's BIOS, but this
> > seems to be a one shot "better know you really want to do this" deal.
> > Not sure if I should do this, but I was thinking that maybe the current
> > driver won't work reliably with the older BIOS, so maybe this is part of
> > the problem. I'd read that many changes were included in the new Adaptec
> > driver. I don't know how to determine the version we're using. I'm
> > wondering if we can patch the driver and if so how? I thought that Linux
> > just uses a built-in driver, and the only way to get the next version is
> > to upgrade the OS? Anyway, I'm thinking we have all the right settings
> > in the config for the cards, but maybe we're using the wrong version of
> > the driver, or that older BIOS is the problem. Maybe it's kludging up
> > the other card. When the communication is terminated, it appears to
> > affect both libraries, but I've not tried running just one library.
> > Could do that, but I've seen so many things out there about these cards
> > that seem to match our symptoms that I thought I'd start with the driver
> > as the culprit.
> >
> > I can't imagine the primary server having anything to do with this
> > mischief, but just for information, it's running 6.1.1 on Sun Solaris 8.
> >
> > I'm thinking we should flash the firmware on the one card to bring it up
> > to the latest version and then download the latest version of the
> > Adaptec driver, install it as a module and re-compile the kernel, but
> > again, not sure how the current driver is integrated into the kernel or
> > even if or how I would even go about doing this.
> >
> > Would appreciate any help.
> >
> > George
> > George.Sinclair AT noaa DOT gov
> >
> > <<< /var/log/messages >>>
> > Jul 7 04:53:30 hostname kernel: scsi0:0:3:0: Attempting to queue an
> > ABORT message
> > Jul 7 04:53:30 hostname kernel: scsi0: Dumping Card State in Command
> > phase, at SEQADDR 0x168
> > Jul 7 04:53:30 hostname kernel: ACCUM = 0x80, SINDEX = 0xa0, DINDEX =
> > 0xe4, ARG_2 = 0x0
> > Jul 7 04:53:30 hostname kernel: HCNT = 0x0 SCBPTR = 0x0
> > Jul 7 04:53:30 hostname kernel: SCSISEQ = 0x12, SBLKCTL = 0xa
> > Jul 7 04:53:30 hostname kernel: DFCNTRL = 0x4, DFSTATUS = 0x89
> > Jul 7 04:53:30 hostname kernel: LASTPHASE = 0x80, SCSISIGI = 0x44,
> > SXFRCTL0 = 0x88
> > Jul 7 04:53:30 hostname kernel: SSTAT0 = 0x7, SSTAT1 = 0x11
> > Jul 7 04:53:30 hostname kernel: SCSIPHASE = 0x2
> > Jul 7 04:53:30 hostname kernel: STACK == 0x175, 0x160, 0xe7, 0x34
> > Jul 7 04:53:30 hostname kernel: SCB count = 4
> > Jul 7 04:53:30 hostname kernel: Kernel NEXTQSCB = 0
> > Jul 7 04:53:30 hostname kernel: Card NEXTQSCB = 0
> > Jul 7 04:53:30 hostname kernel: QINFIFO entries:
> > Jul 7 04:53:30 hostname kernel: Waiting Queue entries:
> > Jul 7 04:53:30 hostname kernel: Disconnected Queue entries:
> > Jul 7 04:53:30 hostname kernel: QOUTFIFO entries:
> > Jul 7 04:53:30 hostname kernel: Sequencer Free SCB List: 2 1 3 4 5 6 7 8
> > 9
> > 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
> > Jul 7 04:53:30 hostname kernel: Sequencer SCB Info: 0(c 0x40, s 0x37, l
> > 0,
> > t 0x1) 1(c 0x40, s 0x37, l 0, t 0xff) 2(c 0x40, s 0x27, l 0, t 0xff) 3(c
> > 0x0, s 0xff, l 255, t 0xff) 4(c 0x0, s 0xff, l 255, t 0xff) 5(c 0x0, s
> > 0xff, l 255, t 0xff) 6(c 0x0, s 0xff, l 255, t 0xff) 7(c 0x0, s 0xff, l
> > 255, t 0xff) 8(c 0x0, s 0xff, l 255, t 0xff) 9(c 0x0, s 0xff, l 255, t
> > 0xff) 10(c 0x0, s 0xff, l 255, t 0xff) 11(c 0x0, s 0xff, l 255, t 0xff)
> > 12(c 0x0, s 0xff, l 255, t 0xff) 13(c 0x0, s 0xff, l 255, t 0xff) 14(c
> > 0x0, s 0xff, l 255, t 0xff) 15(c 0x0, s 0xff, l 255, t 0xff) 16(c 0x0, s
> > 0xff, l 255, t 0xff) 17(c 0x0, s 0xff, l 255, t 0xff) 18(c 0x0, s 0xff,
> > l 255, t 0xff) 19(c 0x0, s 0xff, l 255, t 0xff) 20(c 0x0, s 0xff, l 255,
> > t 0xff) 21(c 0x0, s 0xff, l 255, t 0xff) 22(c 0x0, s 0xff, l 255, t
> > 0xff) 23(c 0x0, s 0xff, l 255, t 0xff) 24(c 0x0, s 0xff, l 255, t 0xff)
> > 25(c 0x0, s 0xff, l 255, t 0xff) 26(c 0x0, s 0xff, l 255, t 0xff) 27(c
> > 0x0, s 0xff, l 255, t 0xff) 28(c 0x0, s 0xff, l 255, t 0xff) 29(c 0x0, s
> > 0xff, l 255, t 0xff)
> > Jul 7 04:53:30 hostname kernel: 30(c 0x0, s 0xff, l 255, t 0xff) 31(c
> > 0x0, s 0xff, l 255, t 0xff)
> > Jul 7 04:53:30 hostname kernel: Pending list: 1(c 0x40, s 0x37, l 0)
> > Jul 7 04:53:30 hostname kernel: Kernel Free SCB list: 2 3
> > Jul 7 04:53:30 hostname kernel: Untagged Q(3): 1
> > Jul 7 04:53:30 hostname kernel: DevQ(0:0:0): 0 waiting
> > Jul 7 04:53:30 hostname kernel: DevQ(0:2:0): 0 waiting
> > Jul 7 04:53:30 hostname kernel: DevQ(0:3:0): 0 waiting
> > Jul 7 04:53:30 hostname kernel: scsi0:0:3:0: Device is active, asserting
> > ATN
> > Jul 7 04:53:30 hostname kernel: Recovery code sleeping
> > Jul 7 04:53:30 hostname kernel: (scsi0:A:3:0): Abort Message Sent
> > Jul 7 04:53:30 hostname kernel: (scsi0:A:3:0): SCB 1 - Abort Completed.
> > Jul 7 04:53:30 hostname kernel: Recovery SCB completes
> > Jul 7 04:53:30 hostname kernel: Recovery code awake
> > Jul 7 04:53:30 hostname kernel: aic7xxx_abort returns 0x2002
> >
> > --
> > Note: To sign off this list, send a "signoff networker" command via email
> > to listserv AT listmail.temple DOT edu or visit the list's Web site at
> > http://listmail.temple.edu/archives/networker.html where you can
> > also view and post messages to the list.
> > =*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=
>
> --
> Note: To sign off this list, send a "signoff networker" command via email
> to listserv AT listmail.temple DOT edu or visit the list's Web site at
> http://listmail.temple.edu/archives/networker.html where you can
> also view and post messages to the list.
> =*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=
>

--
=============================================================
Matthew Temple                Tel:    617/632-2597
Director, Research Computing  Fax:    617/582-7820
Dana-Farber Cancer Institute  mht AT research.dfci.harvard DOT edu
44 Binney Street,  ML105      http://research.dfci.harvard.edu
Boston, MA 02115              Choice is the Choice!

--
Note: To sign off this list, send a "signoff networker" command via email
to listserv AT listmail.temple DOT edu or visit the list's Web site at
http://listmail.temple.edu/archives/networker.html where you can
also view and post messages to the list.
=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=

<Prev in Thread] Current Thread [Next in Thread>