Amanda-Users

RE: Any way to advance to the next DLE? [Was: RE: gnutar version -- exactly 1.13.25 or just 1.13.25 and above?]

2003-12-17 17:45:54
Subject: RE: Any way to advance to the next DLE? [Was: RE: gnutar version -- exactly 1.13.25 or just 1.13.25 and above?]
From: Mark_Conty AT cargill DOT com
To: paul.bijnens AT xplanation DOT com
Date: Wed, 17 Dec 2003 16:39:45 -0600
> To rule out the tar version, use 1.13.25, just like anybody else.
> Installing 1.13.92 is just as experimental as is 1.13.90, and besides
> there problems with how the index is generated (without the leading
> "./") making the index unusable for amrecover.  You could just as well
> disable the indexing and see if the problem goes away.

Yes, I agree.  Until I get 1.13.25 to build for me, I've disabled 
indexing on the larger DLEs.

--------

Although I'm not going to be digging any further into these client 
hangs, I'll do you the courtesy of answering your questions, in case 
there is something here in your questions and/or my answers that might 
be of help to someone else on the mailing list:

> I forgot what OS you were running under?

(Sorry, I don't think I ever mentioned it.)  Server is HP-UX 11.00; 
clients are HP-UX 10.20, 11.00, and 11.11 (aka, 11i).

> If the program hangs, is there any amanda process using CPU?

Nope.

> Is there network traffic on the amanda ports? (verify with tcpdump or 
> with snoop)

I don't think so, although I did not check with "snoop".  (Good 
suggestion!)  But if there were ongoing traffic, I'd expect it to be 
incrementing the size/offset column in 'lsof', and those numbers were 
not changing.

> What is the system call where the program(s) are hanging? (verify
> with strace or truss, or similar)

Dunno...  Another good suggestion!

Offhand, I think that 'tar' is waiting for more input.  I suspect that 
at some point -- possibly relating to the duplicated input stream? -- 
it's losing track of its input filehandle.  Pure speculation, of course, 
but that's my gut feel.

> If it is really gnutar that's hanging, which files has it 
> open? (verify with lsof)

I don't have the 'lsof' output handy any more, but IIRC, it was just the 
pipes & sockets, as well as the /tmp/amanda/sendbackup.*.debug file.

> Maybe gnutar crosses a mount point from some dead NFS server
> which is mounted with options "hard, nointr"?  (It happened to me!)

Not in this case -- all local disk.

> And, of course, the contents of the /tmp/amanda/*debug files are
> interesting too.  Maybe even the syslog or messages?

Nothing stood out in any of those places as indicative of why it might 
be hanging up.  From having been reading the amanda-users list for the 
last couple of weeks, I've been learning more and more about looking at 
those /tmp/amanda/ files, so that was one of the first places I checked.

> Is it always the same DLE? If part of it is dumped to the holdingdisk,
> this can give an estimate on where it fails.

Not always the same DLE, but always a fairly large one, and usually (or 
always?) the first of the larger ones.  It might even be limited to 
those that exceed 2gb, but I haven't gone back to look at past failures 
to determine that.  Holdingdisk not in use for this particular DLE.

> Try running the command you find in /tmp/amanda/sendbackup.*.debug for
> that filesystem by hand and piping the output to "... | cat > 
> /dev/null".

Worked fine.  Ran that cmd and redirected the output to a file (which I 
then successfully dd'd to tape), as well as piping it to 'tar -tf -', 
and both times, they worked fine.

> Maybe set "maxdumps 2" or more for that machine, to avoid it blocking
> on only one FS, and doing the other DLE's in parallel.

maxdumps is already 2.  Despite that, it still locked up.  Maybe it 
finished the other side, and rather than maxdumps being a threshold of 
running dump processes, it may be a counter of how many concurrent dumps 
are spawned, such that both must finish before it goes on to the next 
pair of DLEs?  That's just a guess, though.

> You kill sendbackup, but that one only starts runtar, which starts 
> gnutar.  The pipe is still there.

Actually, I killed the gnutar and its parent process, whose parent, in 
turn, was the sendbackup.  The result was a socket stuck in FIN_WAIT_2 
status.  I've seen this happen before with other applications, and I 
expected that once I cleared the hung socket, the processes at the 
server end would continue, but they all just sat there, staring at each 
other, not doing a darned thing.  *sigh*

> There is no indexing phase.  The index is built at the same 
> time as the
> backup is made by duplicating the output stream, once for the backup
> in to gzip, another for the index into "tar -tf - | sed ...".

Yes, Jon LeBadie explained that to me.  (Quite clever!)  My response to 
him, however, was that given this, I would expect both 'tar' commands to 
finish at the same time, both having reached the end of the input 
stream.  In return, he suggested that this pb with the stranded index 
processes must never have been an issue before, or they would have 
included something to deal with it.

So I think I'm pretty safe to tentatively put them blame on this 
uncertified version of gnutar.

> What is the input of the "tar -tf -" connected to?

It's talking to one of the "dumper" processes back on the tape server.

-----------

Well, I've spent much more time on this than I planned to, and as a 
result, I'm falling behind on other work, so I'm just going to fall back 
to 1.13.25 and get on with life.  I'm afraid I'll have to let someone 
else more proficient and ambitious than I continue wrestling with the 
new version of 'tar', while I go back to lurking.

Thanks to Paul, Jon, and the rest of you for all the advice & info with 
this.
-- Mark

PS -- If some of you veterans feel that there might be value in 
providing some means to have Amanda drop the current DLE and move on to 
the next one, would one of you be so kind as to forward this mail 
message to the 'amanda-hackers' mailing list?  Tnx!


<Prev in Thread] Current Thread [Next in Thread>