Amanda-Users

Re: Any way to advance to the next DLE? [Was: RE: gnutar version -- exactly 1.13.25 or just 1.13.25 and above?]

2003-12-17 16:37:47
Subject: Re: Any way to advance to the next DLE? [Was: RE: gnutar version -- exactly 1.13.25 or just 1.13.25 and above?]
From: Paul Bijnens <paul.bijnens AT xplanation DOT com>
To: Mark_Conty AT cargill DOT com
Date: Wed, 17 Dec 2003 22:34:01 +0100
Mark_Conty AT cargill DOT com wrote:
Ok, so it looks like my version of gnutar (1.13.90) might have some pbs, and it keeps locking up while trying to build the index of some of the larger DLEs of one of my dumpsets.


I'm still not convinced of that "fact".
We only know that the index was still being build.  Maybe it did not
finish because the input pipe for the index was not closed (yet? why?).

To rule out the tar version, use 1.13.25, just like anybody else.
Installing 1.13.92 is just as experimental as is 1.13.90, and besides
there problems with how the index is generated (without the leading
"./") making the index unusable for amrecover.  You could just as well
disable the indexing and see if the problem goes away.

I forgot what OS you were running under?

If the program hangs, is there any amanda process using CPU?
Is there network traffic on the amanda ports? (verify with tcpdump or with snoop)
What is the system call where the program(s) are hanging? (verify
with strace or truss, or similar)
If it is really gnutar that's hanging, which files has it open? (verify
with lsof)

Maybe gnutar crosses a mount point from some dead NFS server
which is mounted with options "hard, nointr"?  (It happened to me!)

And, of course, the contents of the /tmp/amanda/*debug files are
interesting too.  Maybe even the syslog or messages?


So while it's sitting there, locked up, does anyone have any ideas for what I can try so that it gives up on this DLE and moves on to the next one?

Is is always the same DLE? If part of it is dumped to the holdingdisk,
this can give an estimate on where it fails.

Try running the command you find in /tmp/amanda/sendbackup.*.debug for
that filesystem by hand and piping the output to "... | cat > /dev/null".

Maybe set "maxdumps 2" or more for that machine, to avoid it blocking
on only one FS, and doing the other DLE's in parallel.

Like I said, I've tried sending SIGHUP and SIGALRM to several of the processes at different times, but to no avail. In fact, there seems to be a dearth in process communication by this point, because even after I've killed the client 'sendbackup' processes, the server processes don't seem to notice. Shouldn't the server side at least get a SIGPIPE or something? Or maybe it'd eventually time out and then move on to the next DLE.

You kill sendbackup, but that one only starts runtar, which starts gnutar. The pipe is still there.


In this particular case, the dump is in the indexing phase, as Paul B. pointed out. Doesn't anyone else think it's odd that there isn't also an "itimeout" parameter, one that would limit how long it spends trying to build the index for a DLE? Seems like it could go into

There is no indexing phase.  The index is built at the same time as the
backup is made by duplicating the output stream, once for the backup
in to gzip, another for the index into "tar -tf - | sed ...".

(The missing "gzip --fast" is talked about in my previous mail runs
on the server; I did my quick test with client/server on the same
machine.)

client-src/sendsize.c, f'rinstance. But from the way it's behaving, my guess is that the time spent indexing is not part of the "dtimeout" value, which is why the indexing phase of one of my DLEs has been sitting there since midnight... *sigh*

What is is input of the "tar -tf -" connected to?


Anyway, if someone has some ideas for how to nicely push Amanda on to the next DLE, I'd appreciate any suggestions.


--
Paul @ Home


<Prev in Thread] Current Thread [Next in Thread>