gnutar version -- exactly 1.13.25 or just 1.13.25 and above?

Jon LaBadie writes:
> On Fri, Nov 14, 2003 at 07:15:07PM +0100, Zoltan Kato wrote:
> > /home is not NFS mounted and the directories are not transient (they 
are
> > the actual home dirs of individual users). runtar is seduid root. I 
tryed
> > to run the gtar command from the log file manually as root, and 
found that
> > it ONLY works when I run it from /home:
> > 
> > root@rozi$ cd /home/
> > root@rozi$ /opt/sfw/bin/gtar --create --file /dev/null --directory 
/home
> 
>     This is NOT related to the problem you are seeing.
> 
> However note that the version of gnutar that Sun supplies
> to install in /opt/sfw/bin/gtar, or /usr/sfw/bin/gtar in
> later releases, is not a suitable version for use with
> amanda.  It is only 1.13 and what is needed is 1.13.25.

Question:  Is it to be understood that versions of gnutar _greater_ than 
1.13.25 are also incompatible?  Or is it implied that those are also 
valid & acceptable for use with Amanda?

I ask because I'm seeing the same problem as is Mr. Kato, only I'm on an 
HP-UX server, running gnutar 1.13.90 and Amanda 2.4.4p1.  chg-scsi is 
dumping to one of four tape drives in a 4/48 DLT library.

There is this one DLE that gets to a certain point and then just sits 
there.  I am going to go back and look through past logs to see if it 
occurs at the same point in time.  Failing that, I'll extract the 
portion of the dumpfile that is actually on the tape, and skip to the 
end of it to see if the last file in the dump might somehow be causing 
the hang.

Looking at the "amstatus" output, it's already written about 600mb to 
the tape; the DLE in question takes up about 2.9gb.

This DLE runs on the Amanda tape server; I have configured most of the 
DLEs from the tape server to be "nohold", since most of those 
filesystems live in the same disk subsystem as the holding area.  This 
DLE deals with a filesystem that could hold as much as 50gb, so I broke 
into two DLEs (as I can only get about 30gb on each tape):

s2sme /deploy/lynx/packages /deploy/lynx/packages {
        user-tar-nohold
        exclude "./[0-9]*"
}
s2sme /deploy/lynx/packages/09 /deploy/lynx/packages {
        user-tar-nohold
        include "./[0-9]*"
}

This turns out to be a pretty even split of what could be as much as 
25gb per DLE.

Last night, though, the first DLE only needed a level 1, so it was only 
20kb; would have been 230mb if a level 0.  The second DLE's level 0 
shows as 2.9gb in the 'amstatus' output, but the resulting dump file 
(see below) turned out to be only 1.2gb.

BTW, etimeout is set to 7200 seconds, which has long since elapsed.  
Checking the status of one of the taper processes with 'lsof', I see 
that the size/offset value is unchanged from when I checked it at 5am 
today.  Nor have the size/offset values for the associated dumper and 
sendbackup processes changed, either.

The tail end of the amdump file (which last changed at 22:29 last night) 
has this:

driver: send-cmd time 1766.131 to taper: PORT-WRITE 00-00083 s2sme 
fffffeff9ffe0f /deploy/lynx/packages/09 0 20031215
taper: try_socksize: receive buffer size is 65536
taper: stream_server: waiting for connection: 0.0.0.0.63925
driver: result time 1766.136 from taper: PORT 63925
driver: send-cmd time 1766.136 to dumper0: PORT-DUMP 01-00084 63925 
s2sme fffffeff9ffe0f /deploy/lynx/packages/09 /deploy/lynx/packages 0 
1970:1:1:0:0:0 GNUTAR |;auth=bsd;index;include-file=./[0-9]*;
driver: state time 1766.137 free kps: 9969 space: 0 taper: writing 
idle-dumpers: 3 qlen tapeq: 0 runq: 5 roomq: 0 wakeup: 86400 
driver-idle: not-idle
driver: interface-state time 1766.137 if : free 9969
driver: hdisk-state time 1766.137
taper: stream_accept: connection from 127.0.0.1.63926
taper: try_socksize: receive buffer size is 32768
dumper: stream_client: connected to 127.0.0.1.63925
dumper: stream_client: our side is 0.0.0.0.63926
dumper: try_socksize: send buffer size is 65536
dumper: stream_client: connected to 10.2.227.75.63927
dumper: stream_client: our side is 0.0.0.0.63930
dumper: stream_client: connected to 10.2.227.75.63928
dumper: stream_client: our side is 0.0.0.0.63931
dumper: stream_client: connected to 10.2.227.75.63929
dumper: stream_client: our side is 0.0.0.0.63932
dumper: pid 4426 receive size is 65535, low water is 32768

Unless there's something there right under my nose, I don't see anything 
foreboding or otherwise problematic there.

Looking at the log file, I don't see any failures, warnings, nor errors 
there, either.  The last entry is the success msg for the preceding DLE.

The corresponding /tmp/amanda/runtar.*.debug file simply has:

runtar: debug 1 pid 8515 ruid 111 euid 0: start at Mon Dec 15 22:29:32 
2003
gtar: version 2.4.4p1
running: /opt/gnu/bin/tar: gtar --create --file - --directory 
/deploy/lynx/packa
ges --one-file-system --listed-incremental 
/var/opt/amanda/gnutar-lists/s2sme_de
ploy_lynx_packages_09_0.new --sparse --ignore-failed-read --totals 
--files-from
/tmp/amanda/sendbackup._deploy_lynx_packages_09.20031215222930.include

... and the output in /tmp/amanda/sendsize.*.debug shows that it ran 
fine.

I ran the runtar command by hand and piped the output to 'tar -tvf -'.  
It listed all the files I expected to see, and finished going through 
the ~3gb in under 3 minutes.  Granted, this was much faster, as it was 
through a pipe instead of going to the tape device, but it *did* finish, 
rather than just sitting there!  *sigh*

The corresponding chg-scsi.* file holds no error messages nor warnings, 
either.

I ran the runtar command again and this time sent it into a file.  It 
took only about 6 minutes for 1.2gb.  If I run flat out of ideas, I 
might try dumping this file to a scratch tape, to see if maybe *that* 
hangs mysteriously.

But I don't know where else to look to find out just which process is 
hung.  Does anyone have any ideas?

Also, does anyone know if there is some way to just _nudge_ an Amanda 
process, if it's locked up, so that it gives up on the current DLE and 
moves on to the next one??  I've tried sending SIGALRM and SIGHUP to the 
sendbackup, dumper, and taper processes (at different times! :), but 
that just stopped them, rather than making them skip to the next DLE.  
Checking in the source, I find that there are no references to SIGHUP in 
/client-src/ nor /server-src/, and the only SIGARLM reference is in 
client-src/killpgrp.c.

So, any ideas, folks?  Thanks!
-- Mark

PS -- I realize that I didn't include my config files, but I tried to 
provide all the necessary info above.  If not, let me know and I'll pass 
along the salient portions of the config files, too.  Tnx again!