> To rule out the tar version, use 1.13.25, just like anybody else.
> Installing 1.13.92 is just as experimental as is 1.13.90, and besides
> there problems with how the index is generated (without the leading
> "./") making the index unusable for amrecover. You could just as well
> disable the indexing and see if the problem goes away.
Yes, I agree. Until I get 1.13.25 to build for me, I've disabled
indexing on the larger DLEs.
--------
Although I'm not going to be digging any further into these client
hangs, I'll do you the courtesy of answering your questions, in case
there is something here in your questions and/or my answers that might
be of help to someone else on the mailing list:
> I forgot what OS you were running under?
(Sorry, I don't think I ever mentioned it.) Server is HP-UX 11.00;
clients are HP-UX 10.20, 11.00, and 11.11 (aka, 11i).
> If the program hangs, is there any amanda process using CPU?
Nope.
> Is there network traffic on the amanda ports? (verify with tcpdump or
> with snoop)
I don't think so, although I did not check with "snoop". (Good
suggestion!) But if there were ongoing traffic, I'd expect it to be
incrementing the size/offset column in 'lsof', and those numbers were
not changing.
> What is the system call where the program(s) are hanging? (verify
> with strace or truss, or similar)
Dunno... Another good suggestion!
Offhand, I think that 'tar' is waiting for more input. I suspect that
at some point -- possibly relating to the duplicated input stream? --
it's losing track of its input filehandle. Pure speculation, of course,
but that's my gut feel.
> If it is really gnutar that's hanging, which files has it
> open? (verify with lsof)
I don't have the 'lsof' output handy any more, but IIRC, it was just the
pipes & sockets, as well as the /tmp/amanda/sendbackup.*.debug file.
> Maybe gnutar crosses a mount point from some dead NFS server
> which is mounted with options "hard, nointr"? (It happened to me!)
Not in this case -- all local disk.
> And, of course, the contents of the /tmp/amanda/*debug files are
> interesting too. Maybe even the syslog or messages?
Nothing stood out in any of those places as indicative of why it might
be hanging up. From having been reading the amanda-users list for the
last couple of weeks, I've been learning more and more about looking at
those /tmp/amanda/ files, so that was one of the first places I checked.
> Is it always the same DLE? If part of it is dumped to the holdingdisk,
> this can give an estimate on where it fails.
Not always the same DLE, but always a fairly large one, and usually (or
always?) the first of the larger ones. It might even be limited to
those that exceed 2gb, but I haven't gone back to look at past failures
to determine that. Holdingdisk not in use for this particular DLE.
> Try running the command you find in /tmp/amanda/sendbackup.*.debug for
> that filesystem by hand and piping the output to "... | cat >
> /dev/null".
Worked fine. Ran that cmd and redirected the output to a file (which I
then successfully dd'd to tape), as well as piping it to 'tar -tf -',
and both times, they worked fine.
> Maybe set "maxdumps 2" or more for that machine, to avoid it blocking
> on only one FS, and doing the other DLE's in parallel.
maxdumps is already 2. Despite that, it still locked up. Maybe it
finished the other side, and rather than maxdumps being a threshold of
running dump processes, it may be a counter of how many concurrent dumps
are spawned, such that both must finish before it goes on to the next
pair of DLEs? That's just a guess, though.
> You kill sendbackup, but that one only starts runtar, which starts
> gnutar. The pipe is still there.
Actually, I killed the gnutar and its parent process, whose parent, in
turn, was the sendbackup. The result was a socket stuck in FIN_WAIT_2
status. I've seen this happen before with other applications, and I
expected that once I cleared the hung socket, the processes at the
server end would continue, but they all just sat there, staring at each
other, not doing a darned thing. *sigh*
> There is no indexing phase. The index is built at the same
> time as the
> backup is made by duplicating the output stream, once for the backup
> in to gzip, another for the index into "tar -tf - | sed ...".
Yes, Jon LeBadie explained that to me. (Quite clever!) My response to
him, however, was that given this, I would expect both 'tar' commands to
finish at the same time, both having reached the end of the input
stream. In return, he suggested that this pb with the stranded index
processes must never have been an issue before, or they would have
included something to deal with it.
So I think I'm pretty safe to tentatively put them blame on this
uncertified version of gnutar.
> What is the input of the "tar -tf -" connected to?
It's talking to one of the "dumper" processes back on the tape server.
-----------
Well, I've spent much more time on this than I planned to, and as a
result, I'm falling behind on other work, so I'm just going to fall back
to 1.13.25 and get on with life. I'm afraid I'll have to let someone
else more proficient and ambitious than I continue wrestling with the
new version of 'tar', while I go back to lurking.
Thanks to Paul, Jon, and the rest of you for all the advice & info with
this.
-- Mark
PS -- If some of you veterans feel that there might be value in
providing some means to have Amanda drop the current DLE and move on to
the next one, would one of you be so kind as to forward this mail
message to the 'amanda-hackers' mailing list? Tnx!
|