Re: strange behavior when tape drive needs cleaning

Geert,  I believe you have described the symptoms precisely.  Yes, I do
have a linux system.  
I will insert here some kernel log file excerpts to demonstrate more
precisely the symptoms.  
This is another bug in this particular driver (aic79xx scsi) that needs
reporting I suppose.  
Note taht scsi id=6 is the tape drive and the first device to show an
error.  This happened again last night
so perhaps it is more than just a dirty tape drive.  Friday after noon
(last backup), an entire backup completed
without an error.

Apr 16 01:24:08 fea8 kernel: (scsi0:0:6:0) Data overrun detected in
Data-Out phase, tag 0;
Apr 16 01:24:08 fea8 kernel:   Have seen Data Phase. Length=0, NumSGs=0.
Apr 16 01:24:08 fea8 kernel:   Raw SCSI Command: 0x0a 01 00 00 20 00
Apr 16 01:24:08 fea8 kernel: st0: Error 70000 (sugg. bt 0x0, driver bt
0x0, host bt 0x7).
Apr 16 01:24:09 fea8 kernel: PCI-DMA: Out of IOMMU space for 36864 bytes
at device 0000:03:05.0
Apr 16 01:24:09 fea8 kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr
0x0 action 0x0
Apr 16 01:24:09 fea8 kernel: ata2.00: cmd
35/00:00:05:85:10/00:04:1b:00:00/e0 tag 0 cdb 0x0 data 524288 out
Apr 16 01:24:09 fea8 kernel:          res
50/00:00:86:0b:70/00:00:00:00:00/e2 Emask 0x40 (internal error)
Apr 16 01:24:09 fea8 kernel: ata2.00: configured for UDMA/100
Apr 16 01:24:09 fea8 kernel: ata2: EH complete
Apr 16 01:24:09 fea8 kernel: PCI-DMA: Out of IOMMU space for 36864 bytes
at device 0000:03:05.0
Apr 16 01:24:09 fea8 kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr
0x0 action 0x0
Apr 16 01:24:09 fea8 kernel: ata2.00: cmd
35/00:00:05:85:10/00:04:1b:00:00/e0 tag 0 cdb 0x0 data 524288 out
Apr 16 01:24:09 fea8 kernel:          res
50/00:44:00:01:80/00:00:00:00:00/a0 Emask 0x40 (internal error)
Apr 16 01:24:09 fea8 kernel: ata2.00: configured for UDMA/100
Apr 16 01:24:09 fea8 kernel: ata2: EH complete
Apr 16 01:24:09 fea8 kernel: PCI-DMA: Out of IOMMU space for 36864 bytes
at device 0000:03:05.0
Apr 16 01:24:09 fea8 kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr
0x0 action 0x0
Apr 16 01:24:09 fea8 kernel: ata2.00: cmd
35/00:00:05:85:10/00:04:1b:00:00/e0 tag 0 cdb 0x0 data 524288 out
Apr 16 01:24:09 fea8 kernel:          res
50/00:44:00:01:80/03:00:00:00:00/a0 Emask 0x40 (internal error)
Apr 16 01:24:09 fea8 kernel: ata2.00: configured for UDMA/100
Apr 16 01:24:09 fea8 kernel: ata2: EH complete
-
On Sat, 2007-04-14 at 16:28 +0200, Geert Uytterhoeven wrote:
> On Fri, 13 Apr 2007, Freels, James D. wrote:
> > I discovered today that my tape drive was dirty and did not realize it.
> > What happened was the drive sent a scsi error to the kernel/OS and
> > somehow the root (/), home (/home), and holding disk area recognized by
> > AMANDA was changed from rw access to ro access.  I believe this change
> > was made by AMANDA itself.  This happened while AMANDA was backing up.
> > 
> > I had to reboot the machine to place the ro filesystems back to rw as
> > they should be.  I then repeated the attempted backup and the failure
> > occurred again exactly the same way (so it was repeatable).  This is
> > when I suspected the dirty tape drive.  I cleaned the drive and the
> > problem went away.  The backups now work like they should (and have for
> > years).
> > 
> > This is the first time I have seen this.  The drive got dirty due to a
> > different person changing the tapes did not realize they should also
> > clean the drive occasionally.  I also have new higher-capacity tapes so
> > that the same number of tapes will get the driver dirtier quicker.
> > 
> > Does the switch from rw to ro by AMANDA make sense ??  Is this a
> > "feature" ?  This is first I have heard of this.
> 
> The Linux kernel (I assume you run Linux?) automatically remounts a file
> system ro if it notices a medium error on the underlying disk.
> 
> Since there are no actual medium errors on the disk, but problems on the
> tape drive (both are on the same SCSI host adapter?), it looks like the SCSI
> driver has a bug and incorrectly told the upper layer about an error on
> the disk. Is there some more info about the actual error(s) in the kernel 
> logs?
> 
> Gr{oetje,eeting}s,
> 
>                                               Geert
> 
> --
> Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert AT 
> linux-m68k DOT org
> 
> In personal conversations with technical people, I call myself a hacker. But
> when I'm talking to journalists I just say "programmer" or something like 
> that.
>                                                           -- Linus Torvalds
-- 
James D. Freels, Ph.D.
Oak Ridge National Laboratory
freelsjd AT ornl DOT gov

"Windmills can't even produce enough energy to manufacture a windmill."
-Ann Coulter, 03/01/07.