Re: Any way to advance to the next DLE? [Was: RE: gnutar version -- exa

Mark_Conty AT cargill DOT com wrote:

Ok, so it looks like my version of gnutar (1.13.90) might have some pbs,and it keeps locking up while trying to build the index of some of thelarger DLEs of one of my dumpsets.



I'm still not convinced of that "fact".
We only know that the index was still being build.  Maybe it did not
finish because the input pipe for the index was not closed (yet? why?).

To rule out the tar version, use 1.13.25, just like anybody else.
Installing 1.13.92 is just as experimental as is 1.13.90, and besides
there problems with how the index is generated (without the leading
"./") making the index unusable for amrecover.  You could just as well
disable the indexing and see if the problem goes away.

I forgot what OS you were running under?

If the program hangs, is there any amanda process using CPU?

Is there network traffic on the amanda ports? (verify with tcpdump orwith snoop)

What is the system call where the program(s) are hanging? (verify
with strace or truss, or similar)
If it is really gnutar that's hanging, which files has it open? (verify
with lsof)

Maybe gnutar crosses a mount point from some dead NFS server
which is mounted with options "hard, nointr"?  (It happened to me!)

And, of course, the contents of the /tmp/amanda/*debug files are
interesting too.  Maybe even the syslog or messages?

So while it's sitting there, locked up, does anyone have any ideas forwhat I can try so that it gives up on this DLE and moves on to the nextone?


Is is always the same DLE? If part of it is dumped to the holdingdisk,
this can give an estimate on where it fails.

Try running the command you find in /tmp/amanda/sendbackup.*.debug for
that filesystem by hand and piping the output to "... | cat > /dev/null".

Maybe set "maxdumps 2" or more for that machine, to avoid it blocking
on only one FS, and doing the other DLE's in parallel.

Like I said, I've tried sending SIGHUP and SIGALRM to several of theprocesses at different times, but to no avail. In fact, there seems tobe a dearth in process communication by this point, because even afterI've killed the client 'sendbackup' processes, the server processesdon't seem to notice. Shouldn't the server side at least get a SIGPIPEor something? Or maybe it'd eventually time out and then move on to thenext DLE.

You kill sendbackup, but that one only starts runtar, which startsgnutar. The pipe is still there.

In this particular case, the dump is in the indexing phase, as Paul B.pointed out. Doesn't anyone else think it's odd that there isn't alsoan "itimeout" parameter, one that would limit how long it spends tryingto build the index for a DLE? Seems like it could go into


There is no indexing phase.  The index is built at the same time as the
backup is made by duplicating the output stream, once for the backup
in to gzip, another for the index into "tar -tf - | sed ...".

(The missing "gzip --fast" is talked about in my previous mail runs
on the server; I did my quick test with client/server on the same
machine.)

client-src/sendsize.c, f'rinstance. But from the way it's behaving, myguess is that the time spent indexing is not part of the "dtimeout"value, which is why the indexing phase of one of my DLEs has beensitting there since midnight... *sigh*


What is is input of the "tar -tf -" connected to?

Anyway, if someone has some ideas for how to nicely push Amanda on tothe next DLE, I'd appreciate any suggestions.



--
Paul @ Home

Re: Any way to advance to the next DLE? [Was: RE: gnutar version -- exactly 1.13.25 or just 1.13.25 and above?]