Amanda-Users

Re: dumper abort()ing occasionally

2008-06-10 10:27:32
Subject: Re: dumper abort()ing occasionally
From: "Dustin J. Mitchell" <dustin AT zmanda DOT com>
To: "Douglas K. Rand" <rand AT meridian-enviro DOT com>
Date: Tue, 10 Jun 2008 10:17:33 -0400
On Mon, Jun 9, 2008 at 5:11 PM, Douglas K. Rand
<rand AT meridian-enviro DOT com> wrote:
> Once or twice a week my amanda backups are failing when a dumper exits
> on signal 6, SIGABRT:
>
>   Jun  5 23:27:38 scotch kernel: pid 82566 (dumper), uid 0: exited on signal 6
>   Jun  8 19:53:43 scotch kernel: pid 96672 (dumper), uid 0: exited on signal 6
>
> In looking at the source there clearly are calls to abort() in several
> places. I'm assuming that there is an overflow problem with file
> descriptors, that 4294967295 isn't a valid FD?
>
>  driver: event_register: Invalid file descriptor 4294967295

That large integer is also known as -1.  I'm guessing that when the
dumper exits unexpectedly, the driver gets an EOF from its file
descriptor and sets that fd to -1, but then incorrectly tries to
re-register it with the event system.  The pre-2.6.0 event system was
a careful balancing act, but in this case it seems to have handled the
error correctly.

The problem is to figure out why dumper aborted.  Most (all?) abort
calls in Amanda are through the error() macro, which should log a
message to the debug log as well.  But looking at the debug logs you
sent, I see no such thing.  I can't look at the logs at that URL --
the Apache user doesn't have read permission on the files themselves.

The debug logs do show the client connection timing out, though.  It's
likely that this condition is what is tickling the dumper bug, and
since 2.5.1 is no longer maintained, the solution is to stop tickling
the bug :).  See if you can figure out why that connection is timing
out -- busy network?  Downed client?  Network partition?

Dustin

-- 
Storage Software Engineer
http://www.zmanda.com

<Prev in Thread] Current Thread [Next in Thread>