Amanda-Users

data timeout head-scratcher

2003-06-13 09:15:10
Subject: data timeout head-scratcher
From: Eric Sproul <esproul AT ntelos DOT net>
To: Amanda Users <amanda-users AT amanda DOT org>
Date: 13 Jun 2003 09:11:06 -0400
Hi all,
I have recently seen a couple occurrences of a "data timeout" on a
particular DLE that I cannot figure out.  The host is running RH7.2,
serving GNU MailMan lists.  It has 512MB of RAM and 2 20G IDE disks
running software-RAID1.  

The device giving me trouble is /dev/md6, an ext2 fs which contains
mailing list archives.  Its total size is 6.2GB with about 3.8GB used. 
This fs doesn't change much, only the occasional twiddle when a mailing
list post comes in and the archive is updated.  We don't have all that
many lists, and most of them have little volume.  So I don't suspect an
actively changing filesystem to be the culprit.

In my report I see the failure due to data timeout.  Here's the snippet
of detail:

/-- quimby     md6 lev 3 FAILED [data timeout]
sendbackup: start [quimby:md6 level 3]
sendbackup: info BACKUP=/sbin/dump
sendbackup: info RECOVER_CMD=/sbin/restore -f... -
sendbackup: info end
|   DUMP: Date of this level 3 dump: Fri Jun 13 06:13:55 2003
|   DUMP: Date of last level 2 dump: Sun Jun  8 08:43:00 2003
|   DUMP: Dumping /dev/md6 (/var/mmarchive) to standard output
|   DUMP: Label: /var/spare
|   DUMP: mapping (Pass I) [regular files]
|   DUMP: mapping (Pass II) [directories]
|   DUMP: estimated 915103 tape blocks.
|   DUMP: Volume 1 started with block 1 at: Fri Jun 13 06:14:04 2003
|   DUMP: dumping (Pass III) [directories]
|   DUMP: dumping (Pass IV) [regular files]
\--------

On the host, I now see two sendbackup processes and five dump processes,
all seemingly sleeping, as they do not appear in a 'top'.

operator  2114     1  0 06:13 ?        00:00:00 /usr/local/libexec/sendbackup
operator  2116  2114  0 06:13 ?        00:00:14 /usr/local/libexec/sendbackup
operator  2117  2114  0 06:13 ?        00:00:02 dump 3usf 1048576 - /dev/md6
operator  2118  2116  0 06:13 ?        00:00:00 [sh <defunct>]
operator  2123  2117  0 06:14 ?        00:00:03 dump 3usf 1048576 - /dev/md6
operator  2124  2123  0 06:14 ?        00:00:17 dump 3usf 1048576 - /dev/md6
operator  2125  2123  0 06:14 ?        00:00:17 dump 3usf 1048576 - /dev/md6
operator  2126  2123  0 06:14 ?        00:00:17 dump 3usf 1048576 - /dev/md6

I'm not sure what that defunct shell process is all about.  It looks to
have been called by one of the sendbackups.  It is 08:50 EDT as I write
this, so these processes have been running for over 2.5 hours doing
nothing.  My report came in at 06:52.  I'm not sure what else I can
discern from the host, as the above snippet from the sendbackup debug is
the only indicator of anything amiss.  The amandad debug looked normal,
having completed in 0.022 secs.  There is nothing in /var/log/messages
or in dmesg about any lower-level hardware or memory problem.

As for the network, the host is in our datacenter, and the AMANDA server
is at a different location, but connected via 100Mb fiber/ethernet, so
it is effectively on a LAN with the host.  The other 69 DLEs in my
config finished just fine, so I don't think the network had any problems
(it is mostly idle during backup hours).

Going back through an amoverview, it appears that I've only seen two
other errors for this DLE in my current tapecycle.  Between 27-Apr and
today, I saw errors on 15-May and 30-May, but on all other runs it was
fine, including several level-0s.  And checking back through my reports,
I see that the 15-May error was an unrelated DNS problem, so 30-May is
the other data timeout failure.  That one was on a level-2 attempt, and
this morning's was a level-3.

Can anyone point me in a direction I haven't thought to look yet? 
Perhaps I'm overreacting, seeing as how there have only been 2 errors in
the past 47 days, but it bugs me that it just quits, and that it's
happened more than once on the same DLE.

Thanks,
Eric


<Prev in Thread] Current Thread [Next in Thread>