Amanda-Users

amverify stuck, using 95% of cpu

2004-11-15 10:29:29
Subject: amverify stuck, using 95% of cpu
From: Gene Heskett <gene.heskett AT verizon DOT net>
To: amanda-users AT amanda DOT org
Date: Mon, 15 Nov 2004 10:21:46 -0500
Greetings;

In an attempt to get caught up as amanda was falling behind, I
set my runtapes for 2, and reset the tapesize to a value that
said 30 vtapes worth would be about a 99% fill of that partition.

I note a couple of things this morning.

1. amrestore seems to be stuck in a loop of some kind almost 6
hours after when it should be done with the verify phase of my script.

2. I've gotten no email from amverify yet at 9:12 am.

I'm going to kill the processes, but does anyone have an idea what
may have happened? the htop line shows that amrestore was working
on the 2nd vtape.

It took htop several tries to kill it all, and when the cpu load
was back to normal, slocate and updatedb were both still running,
and should have been done by 4:20 am.

I now have an unkillable zombie shell according to htop.

Inspecting the partition containing the vtapes, the pair that were
used last night discloses that there are no TAPE-END files on the
first (slot17) one, but that they were there for the second vtape
used, slot18.  Is this an error?

On a re-run of 'amverify Daily 17 2' its once again stuck, on vtape 17,
after having reported:

[amanda@coyote Dailys]$ amverify Daily 17 2
Tape changer is chg-disk...
2 slots...
Verify summary to root AT coyote.coyote DOT den
Defects file is /tmp/amanda/amverify.16846/defects
amverify Daily
Mon Nov 15 09:37:10 EST 2004

Loading 17 slot...
Using device file:/amandatapes/Dailys/
Waiting for device to go ready...
Rewinding...
Processing label...
Volume Dailys-17, Date 20041115
Rewinding...
Checked gene._usr_src.20041115.1
Checked gene._root.20041115.1
Checked gene._opt.20041115.1
Checked gene._usr_local.20041115.1
Checked gene._boot.20041115.1
Checked gene._lib.20041115.1
Checked gene._etc.20041115.1
Checked gene._usr_bin.20041115.1
Checked gene._home.20041115.1
Checked gene._bin.20041115.1
Checked gene._sbin.20041115.1

According to the printout, the next one is also a tiny 50kb file,
but its stuck on it just the same.  But, approximately 10 minutes
later, it continues where it stopped:

Checked coyote._usr_src.20041115.0
Checked coyote._usr_dlds-rpms.20041115.0
Checked coyote._usr_dlds-misc_FC3-i386-disk2.iso.20041115.1
Checked coyote._usr_dlds-misc_FC3_FC3-i386-disk3.iso.20041115.1
** Error detected (coyote._usr_dlds-misc_FC3_FC3-i386-disc1.iso.20041115.1)
amrestore: WARNING: not at start of tape, file numbers will be offset
amrestore:   0: restoring 
coyote._usr_dlds-misc_FC3_FC3-i386-disc1.iso.20041115.1
amrestore:   1: reached end of information
/bin/gtar: Unexpected EOF in archive
/bin/gtar: Error is not recoverable: exiting now
64+0 in
64+0 out

The printouts from 3hole.ps aren't sorted in file number order,
which makes it a bit of work to figure out what its doing. usr
cpu is about 75% and system cpu is about 25%, no nice 19 process
is getting any cpu at all.  However, a simple ctl c in the shell
I started amverify from gets an instant, total exit of that whole
process tree.  This after letting it continue for long enough to
have verified the rest of the backup, around half an hour.

I just got a bounce message from the cron daemon, the email from
the first run of the script 'backup.sh' was undeliverably after
4 hours in the queue.  And 2 more message yet wait to be read,
so the last message is from the original amverify run, and it ends
like this:

** Error detected ()
/usr/local/sbin/amverify: xrealloc: cannot reallocate 1110884480 bytes (0 bytes 
allocated)
The script /amanda/bak-indices-configs is finished.

Now, the fact that it printed the last line is out of order again, this
line is output by the script that does the configs and indexes seperately,
and before the amverify phase (whose printout is above the error
report above) is even launched.

Call me puzzled. That run of amverify STARTED on slot 18, the 2nd
vtape used.  Do I need, in my script, to save the slot and runtape numbers
it starts with and use them to run amverify with?

More clues, from an ls -l of slot17:
total 6079436
-rw-------  1 amanda disk         10 Nov 15 01:56 00000-Dailys-17
-rw-------  1 amanda disk      32768 Nov 15 01:56 00000.Dailys-17
-rw-------  1 amanda disk         12 Nov 15 01:57 00001-gene._usr_src.1
-rw-------  1 amanda disk    9404416 Nov 15 01:57 00001.gene._usr_src.1
-rw-------  1 amanda disk         11 Nov 15 01:59 00002-gene._root.1
-rw-------  1 amanda disk    1048576 Nov 15 01:59 00002.gene._root.1
-rw-------  1 amanda disk         10 Nov 15 01:59 00003-gene._opt.1
-rw-------  1 amanda disk     327680 Nov 15 01:59 00003.gene._opt.1
-rw-------  1 amanda disk         11 Nov 15 01:59 00004-gene._usr_local.1
-rw-------  1 amanda disk     589824 Nov 15 01:59 00004.gene._usr_local.1
-rw-------  1 amanda disk         10 Nov 15 02:00 00005-gene._boot.1
-rw-------  1 amanda disk      65536 Nov 15 02:00 00005.gene._boot.1
-rw-------  1 amanda disk         10 Nov 15 02:00 00006-gene._lib.1
-rw-------  1 amanda disk      65536 Nov 15 02:00 00006.gene._lib.1
-rw-------  1 amanda disk         10 Nov 15 02:00 00007-gene._etc.1
-rw-------  1 amanda disk      65536 Nov 15 02:00 00007.gene._etc.1
-rw-------  1 amanda disk         10 Nov 15 02:00 00008-gene._usr_bin.1
-rw-------  1 amanda disk      98304 Nov 15 02:00 00008.gene._usr_bin.1
-rw-------  1 amanda disk         10 Nov 15 02:01 00009-gene._home.1
-rw-------  1 amanda disk      65536 Nov 15 02:01 00009.gene._home.1
-rw-------  1 amanda disk         10 Nov 15 02:01 00010-gene._bin.1
-rw-------  1 amanda disk      65536 Nov 15 02:01 00010.gene._bin.1
-rw-------  1 amanda disk         10 Nov 15 02:59 00011-gene._sbin.1
-rw-------  1 amanda disk      65536 Nov 15 02:59 00011.gene._sbin.1
-rw-------  1 amanda disk         15 Nov 15 03:05 00012-coyote._usr_src.0
-rw-------  1 amanda disk 3655696384 Nov 15 03:05 00012.coyote._usr_src.0
-rw-------  1 amanda disk         14 Nov 15 03:07 00013-coyote._usr_dlds-rpms.0
-rw-------  1 amanda disk 1105625088 Nov 15 03:07 00013.coyote._usr_dlds-rpms.0
-rw-------  1 amanda disk         14 Nov 15 03:08 
00014-coyote._usr_dlds-misc_FC3-i386-disk2.iso.1
-rw-------  1 amanda disk  668565504 Nov 15 03:08 
00014.coyote._usr_dlds-misc_FC3-i386-disk2.iso.1
-rw-------  1 amanda disk         14 Nov 15 03:13 
00015-coyote._usr_dlds-misc_FC3_FC3-i386-disk3.iso.1
-rw-------  1 amanda disk  667549696 Nov 15 03:13 
00015.coyote._usr_dlds-misc_FC3_FC3-i386-disk3.iso.1
-rw-------  1 amanda disk         12 Nov 15 03:13 
00016-coyote._usr_dlds-misc_FC3_FC3-i386-disc1.iso.1
-rw-------  1 amanda disk   14352384 Nov 15 03:13 
00016.coyote._usr_dlds-misc_FC3_FC3-i386-disc1.iso.1
-rw-r--r--  1 amanda disk      71680 Oct 28 02:42 configuration.tar
-rw-r--r--  1 amanda disk   95354880 Oct 28 02:42 indices.tar
------------
The 'disk1.iso' is the truncated by hitting EOT file while
doing vtape 17.  It is present in its entirety on the next vtape.

Which brings up the subject of an email I sent around a week ago,
which implores that partial files, when running in a vtape environment,
should either be deleted, or written in their entirety, and then
go on to the next tape without reporting the overrun as an EOT error.

I'm going to go work on my amverify launcher.  However, I don't
believe thats the whole story here, tommorrow will tell the truth
of that I expect.  The fact that its also hung when run by hand,
using the proper syntax above is the unexpected feature/bug.

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
99.29% setiathome rank, not too shabby for a WV hillbilly
Yahoo.com attorneys please note, additions to this message
by Gene Heskett are:
Copyright 2004 by Maurice Eugene Heskett, all rights reserved.

<Prev in Thread] Current Thread [Next in Thread>