Hi all,
I have recently seen a couple occurrences of a "data timeout" on a
particular DLE that I cannot figure out. The host is running RH7.2,
serving GNU MailMan lists. It has 512MB of RAM and 2 20G IDE disks
running software-RAID1.
The device giving me trouble is /dev/md6, an ext2 fs which contains
mailing list archives. Its total size is 6.2GB with about 3.8GB used.
This fs doesn't change much, only the occasional twiddle when a mailing
list post comes in and the archive is updated. We don't have all that
many lists, and most of them have little volume. So I don't suspect an
actively changing filesystem to be the culprit.
In my report I see the failure due to data timeout. Here's the snippet
of detail:
/-- quimby md6 lev 3 FAILED [data timeout]
sendbackup: start [quimby:md6 level 3]
sendbackup: info BACKUP=/sbin/dump
sendbackup: info RECOVER_CMD=/sbin/restore -f... -
sendbackup: info end
| DUMP: Date of this level 3 dump: Fri Jun 13 06:13:55 2003
| DUMP: Date of last level 2 dump: Sun Jun 8 08:43:00 2003
| DUMP: Dumping /dev/md6 (/var/mmarchive) to standard output
| DUMP: Label: /var/spare
| DUMP: mapping (Pass I) [regular files]
| DUMP: mapping (Pass II) [directories]
| DUMP: estimated 915103 tape blocks.
| DUMP: Volume 1 started with block 1 at: Fri Jun 13 06:14:04 2003
| DUMP: dumping (Pass III) [directories]
| DUMP: dumping (Pass IV) [regular files]
\--------
On the host, I now see two sendbackup processes and five dump processes,
all seemingly sleeping, as they do not appear in a 'top'.
operator 2114 1 0 06:13 ? 00:00:00 /usr/local/libexec/sendbackup
operator 2116 2114 0 06:13 ? 00:00:14 /usr/local/libexec/sendbackup
operator 2117 2114 0 06:13 ? 00:00:02 dump 3usf 1048576 - /dev/md6
operator 2118 2116 0 06:13 ? 00:00:00 [sh <defunct>]
operator 2123 2117 0 06:14 ? 00:00:03 dump 3usf 1048576 - /dev/md6
operator 2124 2123 0 06:14 ? 00:00:17 dump 3usf 1048576 - /dev/md6
operator 2125 2123 0 06:14 ? 00:00:17 dump 3usf 1048576 - /dev/md6
operator 2126 2123 0 06:14 ? 00:00:17 dump 3usf 1048576 - /dev/md6
I'm not sure what that defunct shell process is all about. It looks to
have been called by one of the sendbackups. It is 08:50 EDT as I write
this, so these processes have been running for over 2.5 hours doing
nothing. My report came in at 06:52. I'm not sure what else I can
discern from the host, as the above snippet from the sendbackup debug is
the only indicator of anything amiss. The amandad debug looked normal,
having completed in 0.022 secs. There is nothing in /var/log/messages
or in dmesg about any lower-level hardware or memory problem.
As for the network, the host is in our datacenter, and the AMANDA server
is at a different location, but connected via 100Mb fiber/ethernet, so
it is effectively on a LAN with the host. The other 69 DLEs in my
config finished just fine, so I don't think the network had any problems
(it is mostly idle during backup hours).
Going back through an amoverview, it appears that I've only seen two
other errors for this DLE in my current tapecycle. Between 27-Apr and
today, I saw errors on 15-May and 30-May, but on all other runs it was
fine, including several level-0s. And checking back through my reports,
I see that the 15-May error was an unrelated DNS problem, so 30-May is
the other data timeout failure. That one was on a level-2 attempt, and
this morning's was a level-3.
Can anyone point me in a direction I haven't thought to look yet?
Perhaps I'm overreacting, seeing as how there have only been 2 errors in
the past 47 days, but it bugs me that it just quits, and that it's
happened more than once on the same DLE.
Thanks,
Eric
|