Dear Amanda users,
Life with Amanda is swell. The RH 7.2 localhost, an NT server and two
RH 6.0 systems (one with strict ipchains firewalling) are all being
backed up, and restores have been successfully carried out on each
system. I love a backup system that allows me to do all these things,
it works (the first time, usually), and I don't even have to darken the
door of the home office to pull it off!! The documentation and mail
list archives are always good sources of help, and the FAQ-o-matic is
occasionally helpful, too.
[ BTW, little Amanda Ruth will be two months old this Saturday. She's
doing great, too, but requires much more hands-on maintenance. :) ]
Now I'm back on the list. Sorry to re-join the list like this, but...
I am a bad spot. Our company has taken the step to replace an ancient
Sun Sparc II and a recently-compromised RH 6.0 network server with a
new RH 7.2 omni-server with software RAID, backup tape, VPN, updated
network services and increased security. So far so good.
Unfortunately, it's taken longer than expected to configure and migrate
all the services, so the costs are running up and the management is
less than enthusiastic with things at this point. But this isn't the
real problem.
The REAL problem is that this machine has been crashing periodically.
It does not always crash in the same way. It does consistently crash
on Saturday mornings, toward the end of a lengthy Amanda amdump run.
The system was up and running since the installation in early May. A
2.4.9-31mppe kernel has been in use since the third week of May.
Amanda backups of local drives began at the end of May, with the
addition of NT server shares in early June. There was a lengthy power
outage June 14th - 15th, but this system was powered down before the
UPS gave out. The RH 6.0 network server and firewall have more
recently been added as Amanda client systems.
Since the first two anomalies were under heavy load and completely
different, I guessed there was a heat issue (see system specs below for
the logic of this). There was a silent, hard crash the first time
(June 29, a little after 1:30 am), and hard drive errors the second
time (July 20).
Logs from hard drive errors:
Jul 20 06:17:51 pegasus kernel: hda: dma_intr: status=0x51 { DriveReady
SeekComplete Error }
Jul 20 06:17:51 pegasus kernel: hda: dma_intr: error=0x40 {
UncorrectableError }, LBAsect=30879791, sector=548864
Jul 20 06:17:51 pegasus kernel: end_request: I/O error, dev 03:07 (hda),
sector 548864
Jul 20 06:17:51 pegasus kernel: raid5: Disk failure on hda7, disabling
device. Operation continuing on 3 devices
After I removed and added /dev/hda7, I ran a CVS update of /etc (like
the author of the recent Linux Journal article, I keep my life in a CVS
archive). More disk errors:
Jul 20 14:15:05 pegasus kernel: hda: dma_intr: status=0x51 { DriveReady
SeekComplete Error }
Jul 20 14:15:05 pegasus kernel: hda: dma_intr: error=0x40 {
UncorrectableError }, LBAsect=343674, sector=22272
Jul 20 14:15:05 pegasus kernel: end_request: I/O error, dev 03:05 (hda),
sector 22272
Jul 20 14:15:05 pegasus kernel: raid5: Disk failure on hda5, disabling
device. Operation continuing on 3 devices
I removed and added /hda5 and all was well.
These drive errors were completely transient; I had no more disk errors
afterward although we continued to run in this state through the end of
July, when I rebooted after updating the openssl RPMs. Weird, isn't
it? Surely something was overheating, right? We changed the office
thermostat to leave the fans running 24/7, though the air conditioners
are still at 78F except between 6am and 10pm weekdays, when it cools
down to 74F.
After a third crash under the same circumstances (Aug 10), involving a
long run of "kernel: Oops" messages this time, I ordered additional
fans and pulled the cover off the case to let it breathe freely until I
could take it down and install the fans.
Guess what -- it crashed again last Saturday morning. More "kernel:
Oops" messages. I guess it probably isn't a heat dissipation
problem... <:-(
I won't include all the "kernel: Oops" dumps, but here are the initial
ones from the August 10 and 17 crashes:
Aug 10 05:04:19 pegasus sendbackup[9944]: error [/bin/tar got signal 11,
index got signal 11, compress got signal 11]
Aug 10 05:04:19 pegasus kernel: Unable to handle kernel paging request at
virtual address 56aabf94
Aug 10 05:04:19 pegasus kernel: printing eip:
Aug 10 05:04:19 pegasus kernel: c0129885
Aug 10 05:04:19 pegasus kernel: *pde = 00000000
Aug 10 05:04:19 pegasus kernel: Oops: 0000
Aug 10 05:04:19 pegasus kernel: Kernel 2.4.9-31mppe
Aug 10 05:04:19 pegasus kernel: CPU: 0
Aug 10 05:04:19 pegasus kernel: EIP: 0010:[kmalloc+105/256] Tainted: P
Aug 10 05:04:19 pegasus kernel: EIP: 0010:[<c0129885>] Tainted: P
Aug 10 05:04:19 pegasus kernel: EFLAGS: 00010887
Aug 10 05:04:19 pegasus kernel: EIP is at kmalloc [kernel] 0x69
Aug 10 05:04:19 pegasus kernel: eax: a53507df ebx: c14c00b0 ecx: c1d6a000
edx: 00000001
Aug 10 05:04:19 pegasus kernel: esi: 00000246 edi: c14c00b0 ebp: a6a0fbe0
esp: c4bafe60
Aug 10 05:04:19 pegasus kernel: ds: 0018 es: 0018 ss: 0018
Aug 10 05:04:19 pegasus kernel: Process tar (pid: 9948, stackpage=c4baf000)
Aug 10 05:04:19 pegasus kernel: Stack: 000de860 00000082 c033ca00 c033c9c0
c4bae000 00000001 00000070 00000014
Aug 10 05:04:19 pegasus kernel: c8842f1b 00000014 00000070 00000000
c4bae000 c56f5760 c6c0f200 c883d334
Aug 10 05:04:19 pegasus kernel: c88457d0 00000014 00000070 00000001
00000000 00000000 ffffffe2 c56f5760
Aug 10 05:04:19 pegasus kernel: Call Trace:
[8139too:__insmod_8139too_O/lib/modules/2.4.9-31mppe/kernel/drivers/+-479461/96]
journal_blocks_per_page_R338b3e60 [jbd] 0x6b
Aug 10 05:04:19 pegasus kernel: Call Trace: [<c8842f1b>]
journal_blocks_per_page_R338b3e60 [jbd] 0x6b
Aug 10 05:04:19 pegasus kernel:
[8139too:__insmod_8139too_O/lib/modules/2.4.9-31mppe/kernel/drivers/+-502988/96]
journal_start_R819d0567 [jbd] 0x80
Aug 10 05:04:19 pegasus kernel: [<c883d334>] journal_start_R819d0567 [jbd]
0x80
Aug 10 05:04:19 pegasus kernel:
[8139too:__insmod_8139too_O/lib/modules/2.4.9-31mppe/kernel/drivers/+-469040/96]
__insmod_jbd_S.rodata_L96 [jbd] 0x2150
Aug 10 05:04:19 pegasus kernel: [<c88457d0>] __insmod_jbd_S.rodata_L96 [jbd]
0x2150
Aug 10 05:04:19 pegasus kernel:
[8139too:__insmod_8139too_O/lib/modules/2.4.9-31mppe/kernel/drivers/+-435991/96]
__insmod_ext3_S.text_L40936 [ext3] 0x5889
Aug 10 05:04:19 pegasus kernel: [<c884d8e9>] __insmod_ext3_S.text_L40936
[ext3] 0x5889
Aug 10 05:04:19 pegasus kernel: [___wait_on_page+141/148] ___wait_on_page
[kernel] 0x8d
Aug 10 05:04:19 pegasus kernel: [<c0124f11>] ___wait_on_page [kernel] 0x8d
Aug 10 05:04:19 pegasus kernel: [__mark_inode_dirty+42/128]
__mark_inode_dirty [kernel] 0x2a
Aug 10 05:04:19 pegasus kernel: [<c01435e6>] __mark_inode_dirty [kernel] 0x2a
Aug 10 05:04:19 pegasus kernel: [update_atime+74/80] update_atime [kernel]
0x4a
Aug 10 05:04:19 pegasus kernel: [<c0144c5e>] update_atime [kernel] 0x4a
Aug 10 05:04:19 pegasus kernel: [do_generic_file_read+1258/1268]
do_generic_file_read [kernel] 0x4ea
Aug 10 05:04:19 pegasus kernel: [<c0125826>] do_generic_file_read [kernel]
0x4ea
Aug 10 05:04:19 pegasus kernel: [generic_file_read+124/148] generic_file_read
[kernel] 0x7c
Aug 10 05:04:19 pegasus kernel: [<c0125900>] generic_file_read [kernel] 0x7c
Aug 10 05:04:19 pegasus kernel: [file_read_actor+0/84] file_read_actor
[kernel] 0x0
Aug 10 05:04:19 pegasus kernel: [<c0125830>] file_read_actor [kernel] 0x0
Aug 10 05:04:19 pegasus kernel: [sys_read+149/244] sys_read [kernel] 0x95
Aug 10 05:04:19 pegasus kernel: [<c0131de1>] sys_read [kernel] 0x95
Aug 10 05:04:19 pegasus kernel: [sys_open+121/164] sys_open [kernel] 0x79
Aug 10 05:04:19 pegasus kernel: [<c01318f9>] sys_open [kernel] 0x79
Aug 10 05:04:19 pegasus kernel: [system_call+51/56] system_call [kernel] 0x33
Aug 10 05:04:19 pegasus kernel: [<c0106e0f>] system_call [kernel] 0x33
Aug 10 05:04:19 pegasus kernel:
Aug 10 05:04:19 pegasus kernel:
Aug 10 05:04:19 pegasus kernel: Code: 8b 44 81 18 03 69 0c 89 41 14 40 75 26
8b 41 04 8b 11 89 42
Aug 17 06:55:50 pegasus kernel: Unable to handle kernel NULL pointer
dereference at virtual address 00000000
Aug 17 06:55:50 pegasus kernel: printing eip:
Aug 17 06:55:50 pegasus kernel: c012aa6b
Aug 17 06:55:50 pegasus kernel: *pde = 00000000
Aug 17 06:55:50 pegasus kernel: Oops: 0002
Aug 17 06:55:50 pegasus kernel: Kernel 2.4.9-31mppe
Aug 17 06:55:50 pegasus kernel: CPU: 0
Aug 17 06:55:50 pegasus kernel: EIP: 0010:[reclaim_page+147/936]
Tainted: P
Aug 17 06:55:50 pegasus kernel: EIP: 0010:[<c012aa6b>] Tainted: P
Aug 17 06:55:50 pegasus kernel: EFLAGS: 00010282
Aug 17 06:55:50 pegasus kernel: EIP is at reclaim_page [kernel] 0x93
Aug 17 06:55:50 pegasus kernel: eax: c02b5c9c ebx: c10ae9c0 ecx: c10ae9c0
edx: 00000000
Aug 17 06:55:50 pegasus kernel: esi: c10ae9a4 edi: 000000cd ebp: c02b5c70
esp: c3761ebc
Aug 17 06:55:50 pegasus kernel: ds: 0018 es: 0018 ss: 0018
Aug 17 06:55:50 pegasus kernel: Process tar (pid: 590, stackpage=c3761000)
Aug 17 06:55:50 pegasus sendbackup[586]: error [/bin/tar got signal 11]
Aug 17 06:55:50 pegasus kernel: Stack: c02b5c70 c02b5e34 00000001 00000001
c012c529 c02b5c70 00000000 c02b5e3c
Aug 17 06:55:50 pegasus kernel: 00000000 000000d2 c012c629 c02b5e30
00000000 00000001 00000001 00000001
Aug 17 06:55:50 pegasus kernel: c02b5e30 c7f2b804 00000000 c555d158
00000000 c0125712 0000019d 00000001
Aug 17 06:55:50 pegasus kernel: Call Trace: [__alloc_pages_limit+113/152]
__alloc_pages_limit [kernel] 0x71
Aug 17 06:55:50 pegasus kernel: Call Trace: [<c012c529>] __alloc_pages_limit
[kernel] 0x71
Aug 17 06:55:51 pegasus kernel: [__alloc_pages+197/600] __alloc_pages
[kernel] 0xc5
Aug 17 06:55:51 pegasus kernel: [<c012c629>] __alloc_pages [kernel] 0xc5
Aug 17 06:55:51 pegasus kernel: [do_generic_file_read+982/1268]
do_generic_file_read [kernel] 0x3d6
Aug 17 06:55:51 pegasus kernel: [<c0125712>] do_generic_file_read [kernel]
0x3d6
Aug 17 06:55:51 pegasus kernel: [generic_file_read+124/148] generic_file_read
[kernel] 0x7c
Aug 17 06:55:51 pegasus kernel: [<c0125900>] generic_file_read [kernel] 0x7c
Aug 17 06:55:51 pegasus kernel: [file_read_actor+0/84] file_read_actor
[kernel] 0x0
Aug 17 06:55:51 pegasus kernel: [<c0125830>] file_read_actor [kernel] 0x0
Aug 17 06:55:51 pegasus kernel: [sys_read+149/244] sys_read [kernel] 0x95
Aug 17 06:55:51 pegasus kernel: [<c0131de1>] sys_read [kernel] 0x95
Aug 17 06:55:51 pegasus kernel: [sys_open+121/164] sys_open [kernel] 0x79
Aug 17 06:55:51 pegasus kernel: [<c01318f9>] sys_open [kernel] 0x79
Aug 17 06:55:51 pegasus kernel: [system_call+51/56] system_call [kernel] 0x33
Aug 17 06:55:51 pegasus kernel: [<c0106e0f>] system_call [kernel] 0x33
Aug 17 06:55:51 pegasus kernel:
Aug 17 06:55:51 pegasus kernel:
Aug 17 06:55:51 pegasus kernel: Code: 89 02 c7 41 04 00 00 00 00 b8 0b 00 00
00 c7 46 1c 00 00 00
What then? A precise diagnosis of the problem is critical. A solution
is overdue. We can't afford to kill this project, it needs to be
finished, but this system just isn't reliable enough to be the
cornerstone of our corporate network. We've had almost 100%
reliability from our old RH6.0 network server and its firewalling
companion. (Good thing, too, or we'd probably not touch Linux again
for quite a while based on our current situation alone.)
I've run the lm_sensors setup and tried making sensors measurements
every 10 minutes (like /usr/lib/sa/sa1), but after a day or so the
temperatures stay fixed and I start getting syslog messages about bus
collisions from i2c-viapro.o... Thus, no sensor data from actual
crashes.
Here's some basic information, and I'd love to know what additional
information would be most effective in pinpointing the big, bad bug in
this valuable system.
Before I drone on with more data, some thoughts I have had:
- Could the power supply be inadequate?
- Does the custom kernel have a problem (there _are_ newer kernels
out there, but I've avoided building my own up to this point and we
need the MPPE patches)?
- What's the problem with Amanda runs? Sure the CPU, disk and
network are busy, and there's lots of activity on the SCSI tape,
but that's life, buddy!
HARDWARE:
Motherboard: Tyan Trinity K7 (S2380)
CPU: AMD Athlon Slot A 750 MHz
Case/PS: InWin ATX Full Tower Case Q500 w/300w PS and
added front intake fan
Memory: 128 Mb
Storage: Promise (PDC20267) PCI IDE controller
Tekram SCSI controller (sym53c8xx: 53c875
detected with Tekram NVRAM)
4 IBM-DTLA-307030 (30 Gb) drives (hd[aceg])
Pioneer DVD-ROM ATAPIModel DVD-106S 012 (hdb)
Sony SDX-300C AIT SCSI tape
Exabyte EXB-8200 (tried, unsuccessfully, to
reuse 8mm dump tapes from the Sun server)
Networking: SMC1211TX EZCard 10/100 (RealTek RTL8139)
SOFTWARE:
This is a Red Hat 7.2 system, with all RPMS directly from install or
Red Hat updates, with the exception of MPPE RPMS from
ftp://ftp.planetmirror.com/pub/mppe:
kernel-2.4.9-31mppe.i386.rpm
kernel-doc-2.4.9-31mppe.i386.rpm
kernel-headers-2.4.9-31mppe.i386.rpm
kernel-source-2.4.9-31mppe.i386.rpm
ppp-2.4.1-3mppe.i386.rpm
pptpd-1.1.3-1.i386.rpm
Kernel: Linux version 2.4.9-31mppe (root@richard) (gcc
version 2.96 20000731 (Red Hat Linux 7.1
2.96-98)) #1 Tue Mar 5 18:47:37 CET 2002
Filesystems:
$ mount
/dev/md1 on / type ext3 (rw)
none on /proc type proc (rw)
/dev/md2 on /boot type ext3 (rw)
none on /dev/pts type devpts (rw,gid=5,mode=620)
/dev/md3 on /hold type ext3 (rw)
/dev/md0 on /home type ext3 (rw)
none on /dev/shm type tmpfs (rw)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
$ cat /proc/mdstat
Personalities : [raid1] [raid5]
read_ahead 1024 sectors
md3 : active raid5 hda7[0] hdc7[1] hde7[2] hdg7[3]
44555904 blocks level 5, 64k chunk, algorithm 0 [4/4] [UUUU]
md0 : active raid5 hda6[0] hdc6[1] hde6[2] hdg6[3]
42604032 blocks level 5, 64k chunk, algorithm 0 [4/4] [UUUU]
md1 : active raid5 hda5[0] hdc5[1] hde5[2] hdg5[3]
2409408 blocks level 5, 64k chunk, algorithm 0 [4/4] [UUUU]
md2 : active raid1 hda1[0] hdc1[1]
56128 blocks [2/2] [UU]
unused devices: <none>
$ cat /proc/swaps
Filename Type Size Used Priority
/dev/hde1 partition 160608 16852 -1
/dev/hdg1 partition 160608 0 -2
/dev/hda2 partition 104412 0 -3
/dev/hdc2 partition 104412 0 -4
SERVICES:
SysVInit at runlevel 5:
anacron apmd atd autofs crond gpm ipchains iptables isdn keytable
kudzu lpd netfs network nfs nfslock ntpd p4d portmap pptpd random
rawdevices sendmail smb sshd syslog wine xfs xinetd
Via xinetd:
amanda amanda amandaidx amidxtape imap ipop3 sgi_fam talk telnet
wu-ftpd (I don't know why chkconfig shows amanda twice...)
MISC. KERNEL INFO:
$ cat /proc/interrupts
CPU0
0: 1471213 XT-PIC timer
1: 6 XT-PIC keyboard
2: 0 XT-PIC cascade
4: 16 XT-PIC serial
8: 1 XT-PIC rtc
10: 13562417 XT-PIC ide2, ide3
11: 30 XT-PIC sym53c8xx
12: 34156 XT-PIC eth0
14: 7376543 XT-PIC ide0
15: 7382515 XT-PIC ide1
NMI: 0
ERR: 0
$ cat /proc/iomem
00000000-0009ffff : System RAM
000a0000-000bffff : Video RAM area
000c0000-000c7fff : Video ROM
000c8000-000c9fff : Extension ROM
000ca000-000ca1ff : Extension ROM
000f0000-000fffff : System ROM
00100000-07feffff : System RAM
00100000-002b270a : Kernel code
002b270b-002c92eb : Kernel data
07ff0000-07ff2fff : ACPI Non-volatile Storage
07ff3000-07ffffff : ACPI Tables
d0000000-d3ffffff : VIA Technologies, Inc. VT8371 [KX133]
d4000000-d7ffffff : PCI Bus #01
d4000000-d4ffffff : ATI Technologies Inc 3D Rage Pro AGP 1X/2X
d6000000-d6000fff : ATI Technologies Inc 3D Rage Pro AGP 1X/2X
d9000000-d901ffff : Promise Technology, Inc. 20267
d9020000-d90200ff : Accton Technology Corporation SMC2-1211TX
d9020000-d90200ff : 8139too
d9021000-d90210ff : Symbios Logic Inc. (formerly NCR) 53c875
d9022000-d9022fff : Symbios Logic Inc. (formerly NCR) 53c875
ffff0000-ffffffff : reserved
$ cat /proc/ioports
0000-001f : dma1
0020-003f : pic1
0040-005f : timer
0060-006f : keyboard
0070-007f : rtc
0080-008f : dma page reg
00a0-00bf : pic2
00c0-00df : dma2
00f0-00ff : fpu
0170-0177 : ide1
01f0-01f7 : ide0
02f8-02ff : serial(auto)
0376-0376 : ide1
03c0-03df : vga+
03f6-03f6 : ide0
03f8-03ff : serial(auto)
0cf8-0cff : PCI conf1
4000-40ff : VIA Technologies, Inc. VT82C686 [Apollo Super ACPI]
5000-500f : VIA Technologies, Inc. VT82C686 [Apollo Super ACPI]
5000-5007 : via2-smbus
6000-607f : VIA Technologies, Inc. VT82C686 [Apollo Super ACPI]
6000-607f : via686a-sensors
9000-9fff : PCI Bus #01
9000-90ff : ATI Technologies Inc 3D Rage Pro AGP 1X/2X
a000-a00f : VIA Technologies, Inc. Bus Master IDE
a000-a007 : ide0
a008-a00f : ide1
ac00-ac07 : Promise Technology, Inc. 20267
ac00-ac07 : ide2
b000-b003 : Promise Technology, Inc. 20267
b002-b002 : ide2
b400-b407 : Promise Technology, Inc. 20267
b400-b407 : ide3
b800-b803 : Promise Technology, Inc. 20267
b802-b802 : ide3
bc00-bc3f : Promise Technology, Inc. 20267
bc00-bc07 : ide2
bc08-bc0f : ide3
bc10-bc3f : PDC20267
c000-c0ff : Accton Technology Corporation SMC2-1211TX
c000-c0ff : 8139too
c400-c4ff : Symbios Logic Inc. (formerly NCR) 53c875
c400-c47f : sym53c8xx
$ cat /proc/modules
ppp_deflate 39008 0 (autoclean)
ppp_mppe 23232 2 (autoclean)
bsd_comp 4128 0 (autoclean)
ppp_async 6720 1 (autoclean)
ppp_generic 19240 3 (autoclean) [ppp_deflate ppp_mppe bsd_comp
ppp_async]
slhc 4896 1 (autoclean) [ppp_generic]
via686a 8548 0 (unused)
eeprom 3040 0 (unused)
i2c-proc 6368 0 [via686a eeprom]
i2c-isa 1156 0 (unused)
i2c-viapro 3848 0 (unused)
i2c-core 12864 0 [via686a eeprom i2c-proc i2c-isa i2c-viapro]
binfmt_misc 5924 1
nfsd 68512 8 (autoclean)
lockd 50720 1 (autoclean) [nfsd]
sunrpc 61520 1 (autoclean) [nfsd lockd]
autofs 10564 0 (autoclean) (unused)
8139too 12672 1
ipchains 34568 1
st 25844 0 (unused)
ext3 58912 4
jbd 38500 4 [ext3]
raid5 16864 3
xor 5912 0 [raid5]
raid1 12324 1
sym53c8xx 55300 0 (unused)
sd_mod 11836 0 (unused)
scsi_mod 92824 3 [st sym53c8xx sd_mod]
Thanks in advance, especially if you actually read this far!! Only a
true Linux fan would have stayed awake to this, the 390th line of this
message. :)
Regards,
Jonathan
--
/ Jonathan R. Johnson | "Every word of God is flawless." \
| Minnetonka Software, Inc. | -- Proverbs 30:5 |
\ johnsonj AT MinnetonkaSoftware DOT com | My own words only speak for me. /
|