Re: Diagnosing an elusive fault on a critical system [long]

Dear Amanda users,

This is a follow-up of responses and results from my first post on this
list.  You all helped me to navigate a very troublesome issue --
thanks, and may this discussion point other forlorn users toward happy
resolutions.

I will provide only the outline of my original message, the entirety of
which is presumably available in the archives.

On Mon, 19 Aug 2002 13:15:39 -0500, I wrote:

 > <snip>
 > 
 > I am a bad spot.  <snip>
 > 
 > The REAL problem is that this machine has been crashing periodically.
 > It does not always crash in the same way.  It does consistently crash
 > on Saturday mornings, toward the end of a lengthy Amanda amdump run.
 > 
 > The system was up and running since the installation in early May.  A
 > 2.4.9-31mppe kernel has been in use since the third week of May.
 > Amanda backups of local drives began at the end of May, with the
 > addition of NT server shares in early June.  There was a lengthy power
 > outage June 14th - 15th, but this system was powered down before the
 > UPS gave out.  The RH 6.0 network server and firewall have more
 > recently been added as Amanda client systems.
 > 
 > Since the first two anomalies were under heavy load and completely
 > different, I guessed there was a heat issue (see system specs below for
 > the logic of this).  There was a silent, hard crash the first time
 > (June 29, a little after 1:30 am), and hard drive errors the second
 > time (July 20).
 > 
 > Logs from hard drive errors:
 > 
 >   <snip>
 > 
 > After I removed and added /dev/hda7, I ran a CVS update of /etc (like
 > the author of the recent Linux Journal article, I keep my life in a CVS
 > archive).  More disk errors:
 > 
 >   <snip>
 > 
 > I removed and added /hda5 and all was well.
 > 
 > These drive errors were completely transient; I had no more disk errors
 > afterward although we continued to run in this state through the end of
 > July, when I rebooted after updating the openssl RPMs.  Weird, isn't
 > it?  Surely something was overheating, right?  We changed the office
 > thermostat to leave the fans running 24/7, though the air conditioners
 > are still at 78F except between 6am and 10pm weekdays, when it cools
 > down to 74F.
 > 
 > After a third crash under the same circumstances (Aug 10), involving a
 > long run of "kernel: Oops" messages this time, I ordered additional
 > fans and pulled the cover off the case to let it breathe freely until I
 > could take it down and install the fans.
 > 
 > Guess what -- it crashed again last Saturday morning.  More "kernel:
 > Oops" messages.  I guess it probably isn't a heat dissipation
 > problem...  <:-(
 > 
 > I won't include all the "kernel: Oops" dumps, but here are the initial
 > ones from the August 10 and 17 crashes:
 > 
 >   <snip>
 > 
 > <snip>
 > 
 > Before I drone on with more data, some thoughts I have had:
 > 
 >   - Could the power supply be inadequate?

The consensus:  no.

 >   - Does the custom kernel have a problem (there _are_ newer kernels
 >     out there, but I've avoided building my own up to this point and we
 >     need the MPPE patches)?

Somewhat suspicious -- see below.

 >   - What's the problem with Amanda runs?  Sure the CPU, disk and
 >     network are busy, and there's lots of activity on the SCSI tape,
 >     but that's life, buddy!

Details of system activity logging below.

 > HARDWARE:
 > 
 >   Motherboard:          Tyan Trinity K7 (S2380)
 >   CPU:                  AMD Athlon Slot A 750 MHz
 >   Case/PS:              InWin ATX Full Tower Case Q500 w/300w PS and
 >                         added front intake fan
 >   Memory:               128 Mb

This was unsupported ECC RAM -- see questions below.

 >   Storage:              Promise (PDC20267) PCI IDE controller
 >                         Tekram SCSI controller (sym53c8xx: 53c875
 >                           detected with Tekram NVRAM)
 >                         4 IBM-DTLA-307030 (30 Gb) drives (hd[aceg])
 >                         Pioneer DVD-ROM ATAPIModel DVD-106S 012 (hdb)
 >                         Sony SDX-300C AIT SCSI tape
 >                         Exabyte EXB-8200 (tried, unsuccessfully, to
 >                           reuse 8mm dump tapes from the Sun server)
 >   Networking:           SMC1211TX EZCard 10/100 (RealTek RTL8139)
 > 
 > SOFTWARE:
 > 
 > This is a Red Hat 7.2 system, with all RPMS directly from install or
 > Red Hat updates, with the exception of MPPE RPMS from
 > ftp://ftp.planetmirror.com/pub/mppe:
 > 
 >   kernel-2.4.9-31mppe.i386.rpm
 >   kernel-doc-2.4.9-31mppe.i386.rpm
 >   kernel-headers-2.4.9-31mppe.i386.rpm
 >   kernel-source-2.4.9-31mppe.i386.rpm
 >   ppp-2.4.1-3mppe.i386.rpm
 >   pptpd-1.1.3-1.i386.rpm
 > 
 >   Kernel:               Linux version 2.4.9-31mppe (root@richard) (gcc
 >                         version 2.96 20000731 (Red Hat Linux 7.1
 >                         2.96-98)) #1 Tue Mar 5 18:47:37 CET 2002
 >   Filesystems:
 >     <snip>
 > 
 > SERVICES:
 > 
 > SysVInit at runlevel 5:
 >   anacron apmd atd autofs crond gpm ipchains iptables isdn keytable
 >   kudzu lpd netfs network nfs nfslock ntpd p4d portmap pptpd random
 >   rawdevices sendmail smb sshd syslog wine xfs xinetd
 > 
 > Via xinetd:
 >   amanda amanda amandaidx amidxtape imap ipop3 sgi_fam talk telnet
 >   wu-ftpd (I don't know why chkconfig shows amanda twice...)
 > 
 > MISC. KERNEL INFO:
 > 
 >   <snip>
 > 
 > Thanks in advance, especially if you actually read this far!!  Only a
 > true Linux fan would have stayed awake to this, the 390th line of this
 > message.  :)

I posted my message to both redhat-list and amanda-users.  The ranked
responses were (most messages had more than one suggestion):

  6     Flaky RAM
  4     Not enough RAM
  3     Buggy version of the kernel
  2     Disk/tape controller problem
  1     BIOS settings issue
  1     Motherboard cache problem
  1     CPU problem

It was observed that one of the the kernel "Oops" messages (the Linux
equivalent of a blue screen of death, except it doesn't always
die...immediately) was specifically related to allocation of a memory
page.

Other recommended resources:  http://www.bitwizard.nl/sig11/ (about
intermittent segmentation "SIG11" faults and what causes them),
news.linux-sxs.org (ask the resident expert(s)), Linux kernel mailing
list, http://www.linuxmanagers.org/.

Recommended tests:  memtest (an off-line memory tester you run from a
boot floppy, but impractical for deployed servers), measure voltages
under load (they didn't suggest _how_), kernel-compile loop (build and
rebuild the Linux kernel ad nauseum...or is that ad crasheum... to load
the system heavily and stress the memory)

Someone also pointed out that removing the system cover prevents the
fans from producing forced air flow, possibly _contributing_ to heat
problems rather than solving them.  However, pegasus was in the path of
a blower vent that (I thought) is always on.

A priceless quote from one response:

  "Very un-nice problem. Poor you :(
   I do not wish this to any sysadmin."

In light of all this input, I started some kernel building.  Ten
iterations of building went flawlessly under my casual observation, so
at about 5:45, before I logged out, I started a set of 50 to run over
several hours, spanning the evening Amanda run.  Never got there.  :)
At 6:08pm the 4th of 50 kernel builds was interrupted by a Segmentation
fault, the very symptom I was looking for.  Many services on pegasus
went down at the same instant.  The 10:30pm scheduled Amanda run was
forgotten.  Finally, the house of cards completely collapsed at about
11:15pm, when all contact with pegasus was lost.

Replacing the existing 128 Mb SDRAM with 2 256 Mb SDRAM was a perfect
fix.  I compiled the kernel 100 times without encountering any
problems, while running other processes to elevate I/O and CPU loads.
Two Amanda runs have completed flawlessly with full dumps of all disks.
It's also fun to have almost a gigabyte of virtual memory to play with,
and all the RAM is allowing us to cache a few hundred megabytes of
data, making most repetitive disk operations (like dump estimates) fly
like the wind.  Even so, last Tuesday night we used some swap space...
:)

BTW, the original 128 Mb SDRAM was registered ECC memory, but the Tyan
Trinity K7 motherboard does NOT support ECC RAM.  I assume that this
means that the ECC feature will not be used, but is it possible that
this unsupported RAM flavor was actually _causing_ part of the problem?
Should I assume that this RAM is bad, or start using it in a system
that is designed to use ECC RAM?

A review of /var/log/sa/sar* files shows that the critical moment in
the Amanda runs brings the highest sustained levels of context switches
(> 5000 cswch/s), CPU activity (< .5% idle over a 30 minute period) and
paging activity (> 5000 combined pgpgin/s and pgpgout/s) seen on this
system.  So that's the problem with Amanda runs -- they stress the
system as much as, if not more than, building kernels.

The addition of two additional case fans had the desired effect of
lowering the system's running temperature.  The CPU now stays at about
98 F during business hours, 101 F evenings and weekends (see the
thermostat settings description earlier).  Significantly, though, at
the most intensive time of archiving/compressing/taping, the CPU temp
does top 104 F briefly.  Without the fans I'd guess that would have
been closer to 110 F or more.

By the way, I considered making one new fan (less than half way up from
the bottom) an intake fan, and the other (above the power supply), an
exhaust fan.  More than one "Build the perfect Linux box article had
warned of creating negative pressure in the case, implying that air
flow in the power supply might decrease and put the power supply at
risk.  After pondering this for a bit, I decided that since the case
has LARGE HOLES in the front and sides, designed to allow air to flow
in and cool internal drives, that this was probably a non issue.  Now I
can easily feel the air being drawn into the vents and I am more
confident that no devices will overheat with the triple exhaust in the
back.

Finally, I did an rpm -Va to test the possibility that the flaky memory
might have produced corrupted files during installation and upgrades.
No signs of this, though.

Again, thanks for your help.  Another Linux support success story for
the archives.  :)

Truly,

  Jonathan

-- 
 /       Jonathan R. Johnson       | "Every word of God is flawless." \
 |    Minnetonka Software, Inc.    |                 -- Proverbs 30:5 |
 \ johnsonj AT MinnetonkaSoftware DOT com |  My own words only speak for me. /