Amanda-Users

Re: amverify stuck, using 95% of cpu

2004-11-16 09:11:42
Subject: Re: amverify stuck, using 95% of cpu
From: Gene Heskett <gene.heskett AT verizon DOT net>
To: amanda-users AT amanda DOT org
Date: Tue, 16 Nov 2004 09:05:47 -0500
On Monday 15 November 2004 10:21, Gene Heskett wrote:
>Greetings;
>
>In an attempt to get caught up as amanda was falling behind, I
>set my runtapes for 2, and reset the tapesize to a value that
>said 30 vtapes worth would be about a 99% fill of that partition.
>
>I note a couple of things this morning.
>
>1. amrestore seems to be stuck in a loop of some kind almost 6
>hours after when it should be done with the verify phase of my
> script.
>
>2. I've gotten no email from amverify yet at 9:12 am.
>
>I'm going to kill the processes, but does anyone have an idea what
>may have happened? the htop line shows that amrestore was working
>on the 2nd vtape.
>
>It took htop several tries to kill it all, and when the cpu load
>was back to normal, slocate and updatedb were both still running,
>and should have been done by 4:20 am.
>
>I now have an unkillable zombie shell according to htop.
>
>Inspecting the partition containing the vtapes, the pair that were
>used last night discloses that there are no TAPE-END files on the
>first (slot17) one, but that they were there for the second vtape
>used, slot18.  Is this an error?
>
>On a re-run of 'amverify Daily 17 2' its once again stuck, on vtape
> 17, after having reported:
>
>[amanda@coyote Dailys]$ amverify Daily 17 2
>Tape changer is chg-disk...
>2 slots...
>Verify summary to root AT coyote.coyote DOT den
>Defects file is /tmp/amanda/amverify.16846/defects
>amverify Daily
>Mon Nov 15 09:37:10 EST 2004
>
>Loading 17 slot...
>Using device file:/amandatapes/Dailys/
>Waiting for device to go ready...
>Rewinding...
>Processing label...
>Volume Dailys-17, Date 20041115
>Rewinding...
>Checked gene._usr_src.20041115.1
>Checked gene._root.20041115.1
>Checked gene._opt.20041115.1
>Checked gene._usr_local.20041115.1
>Checked gene._boot.20041115.1
>Checked gene._lib.20041115.1
>Checked gene._etc.20041115.1
>Checked gene._usr_bin.20041115.1
>Checked gene._home.20041115.1
>Checked gene._bin.20041115.1
>Checked gene._sbin.20041115.1
>
>According to the printout, the next one is also a tiny 50kb file,
>but its stuck on it just the same.  But, approximately 10 minutes
>later, it continues where it stopped:
>
>Checked coyote._usr_src.20041115.0
>Checked coyote._usr_dlds-rpms.20041115.0
>Checked coyote._usr_dlds-misc_FC3-i386-disk2.iso.20041115.1
>Checked coyote._usr_dlds-misc_FC3_FC3-i386-disk3.iso.20041115.1
>** Error detected
> (coyote._usr_dlds-misc_FC3_FC3-i386-disc1.iso.20041115.1)
> amrestore: WARNING: not at start of tape, file numbers will be
> offset amrestore:   0: restoring
> coyote._usr_dlds-misc_FC3_FC3-i386-disc1.iso.20041115.1 amrestore: 
>  1: reached end of information
>/bin/gtar: Unexpected EOF in archive
>/bin/gtar: Error is not recoverable: exiting now
>64+0 in
>64+0 out
>
>The printouts from 3hole.ps aren't sorted in file number order,
>which makes it a bit of work to figure out what its doing. usr
>cpu is about 75% and system cpu is about 25%, no nice 19 process
>is getting any cpu at all.  However, a simple ctl c in the shell
>I started amverify from gets an instant, total exit of that whole
>process tree.  This after letting it continue for long enough to
>have verified the rest of the backup, around half an hour.
>
>I just got a bounce message from the cron daemon, the email from
>the first run of the script 'backup.sh' was undeliverably after
>4 hours in the queue.  And 2 more message yet wait to be read,
>so the last message is from the original amverify run, and it ends
>like this:
>
>** Error detected ()
>/usr/local/sbin/amverify: xrealloc: cannot reallocate 1110884480
> bytes (0 bytes allocated) The script /amanda/bak-indices-configs is
> finished.
>
>Now, the fact that it printed the last line is out of order again,
> this line is output by the script that does the configs and indexes
> seperately, and before the amverify phase (whose printout is above
> the error report above) is even launched.

Found that just now, I had 2 calls to amverify, one in each script :(

>Call me puzzled. That run of amverify STARTED on slot 18, the 2nd
>vtape used.  Do I need, in my script, to save the slot and runtape
> numbers it starts with and use them to run amverify with?
>
>More clues, from an ls -l of slot17:
>total 6079436
>-rw-------  1 amanda disk         10 Nov 15 01:56 00000-Dailys-17
>-rw-------  1 amanda disk      32768 Nov 15 01:56 00000.Dailys-17
>-rw-------  1 amanda disk         12 Nov 15 01:57 00001-gene._usr_src.1
>-rw-------  1 amanda disk    9404416 Nov 15 01:57 00001.gene._usr_src.1
>-rw-------  1 amanda disk         11 Nov 15 01:59 00002-gene._root.1
>-rw-------  1 amanda disk    1048576 Nov 15 01:59 00002.gene._root.1
>-rw-------  1 amanda disk         10 Nov 15 01:59 00003-gene._opt.1
>-rw-------  1 amanda disk     327680 Nov 15 01:59 00003.gene._opt.1
>-rw-------  1 amanda disk         11 Nov 15 01:59 00004-gene._usr_local.1
>-rw-------  1 amanda disk     589824 Nov 15 01:59 00004.gene._usr_local.1
>-rw-------  1 amanda disk         10 Nov 15 02:00 00005-gene._boot.1
>-rw-------  1 amanda disk      65536 Nov 15 02:00 00005.gene._boot.1
>-rw-------  1 amanda disk         10 Nov 15 02:00 00006-gene._lib.1
>-rw-------  1 amanda disk      65536 Nov 15 02:00 00006.gene._lib.1
>-rw-------  1 amanda disk         10 Nov 15 02:00 00007-gene._etc.1
>-rw-------  1 amanda disk      65536 Nov 15 02:00 00007.gene._etc.1
>-rw-------  1 amanda disk         10 Nov 15 02:00 00008-gene._usr_bin.1
>-rw-------  1 amanda disk      98304 Nov 15 02:00 00008.gene._usr_bin.1
>-rw-------  1 amanda disk         10 Nov 15 02:01 00009-gene._home.1
>-rw-------  1 amanda disk      65536 Nov 15 02:01 00009.gene._home.1
>-rw-------  1 amanda disk         10 Nov 15 02:01 00010-gene._bin.1
>-rw-------  1 amanda disk      65536 Nov 15 02:01 00010.gene._bin.1
>-rw-------  1 amanda disk         10 Nov 15 02:59 00011-gene._sbin.1
>-rw-------  1 amanda disk      65536 Nov 15 02:59 00011.gene._sbin.1
>-rw-------  1 amanda disk         15 Nov 15 03:05 00012-coyote._usr_src.0
>-rw-------  1 amanda disk 3655696384 Nov 15 03:05 00012.coyote._usr_src.0
>-rw-------  1 amanda disk         14 Nov 15 03:07 00013-coyote._usr_dlds-rpms.0
>-rw-------  1 amanda disk 1105625088 Nov 15 03:07 00013.coyote._usr_dlds-rpms.0
>-rw-------  1 amanda disk         14 Nov 15 03:08 
>00014-coyote._usr_dlds-misc_FC3-i386-disk2.iso.1
>-rw-------  1 amanda disk  668565504 Nov 15 03:08 
>00014.coyote._usr_dlds-misc_FC3-i386-disk2.iso.1
>-rw-------  1 amanda disk         14 Nov 15 03:13 
>00015-coyote._usr_dlds-misc_FC3_FC3-i386-disk3.iso.1
>-rw-------  1 amanda disk  667549696 Nov 15 03:13 
>00015.coyote._usr_dlds-misc_FC3_FC3-i386-disk3.iso.1
>-rw-------  1 amanda disk         12 Nov 15 03:13 
>00016-coyote._usr_dlds-misc_FC3_FC3-i386-disc1.iso.1
>-rw-------  1 amanda disk   14352384 Nov 15 03:13 
>00016.coyote._usr_dlds-misc_FC3_FC3-i386-disc1.iso.1
>-rw-r--r--  1 amanda disk      71680 Oct 28 02:42 configuration.tar
>-rw-r--r--  1 amanda disk   95354880 Oct 28 02:42 indices.tar
>------------ 
>The 'disk1.iso' is the truncated by hitting EOT file while
>doing vtape 17.  It is present in its entirety on the next vtape.
>
>Which brings up the subject of an email I sent around a week ago,
>which implores that partial files, when running in a vtape
> environment, should either be deleted, or written in their
> entirety, and then go on to the next tape without reporting the
> overrun as an EOT error.
>
>I'm going to go work on my amverify launcher.  However, I don't
>believe thats the whole story here, tommorrow will tell the truth
>of that I expect.  The fact that its also hung when run by hand,
>using the proper syntax above is the unexpected feature/bug.

Now, based on last nights also hung run, I've now rewrote both of
the wrapper scripts I use a bit more, and they would work IF
I think, there was a 000xx-TAPEEND set of marker files written to the
first and subsequent vtapes of a multiple vtape (runtapes > 1) setup.
But they are not being written in my case to the first vtape used
of the evening.

In any event, I'm getting an on-screen error of sorts when I invoke
an amverify run by hand using this syntax:

[amanda@coyote amanda]$ amverify Daily 19 2

The echo's on-screen are:

Tape changer is chg-disk...
2 slots...
Verify summary to root AT coyote.coyote DOT den
Defects file is /tmp/amanda/amverify.16025/defects
amverify Daily
Tue Nov 16 08:08:06 EST 2004

Loading 19 slot...
Using device file:/amandatapes/Dailys/
Waiting for device to go ready...
Rewinding...
Processing label...
Volume Dailys-19, Date 20041116
Rewinding...
aborted!
-------------
At the second rewind command issued by amverify, the cpu went to 100%
and stayed there for many minutes, at least 5.  I eventually ctl-c'd
it, which it responded to instantly, but it did take several seconds
to return the shells prompt, time I assume was used to send me an email
indicating it had been aborted.

Aha! A new clue here folks.

The first vtape used last night does NOT have the 2 TAPEND files
that amanda (apparently) uses to keep track of things.  That was slot19
so here is an ls -l of that slot it wrote for the first
vtape last night:
----------
[amanda@coyote amanda]$ ls -l /amandatapes/Dailys/slot19
total 6048616
-rw-------  1 amanda disk        10 Nov 16 02:37 00000-Dailys-19
-rw-------  1 amanda disk     32768 Nov 16 02:37 00000.Dailys-19
-rw-------  1 amanda disk        14 Nov 16 02:48 00001-coyote._tmp.1
-rw-------  1 amanda disk 982548480 Nov 16 02:48 00001.coyote._tmp.1
-rw-------  1 amanda disk        14 Nov 16 02:58 00002-coyote._usr_lib.0
-rw-------  1 amanda disk 457113600 Nov 16 02:58 00002.coyote._usr_lib.0
-rw-------  1 amanda disk        14 Nov 16 03:02 00003-coyote._usr_share.0
-rw-------  1 amanda disk 681508864 Nov 16 03:02 00003.coyote._usr_share.0
-rw-------  1 amanda disk        13 Nov 16 03:06 00004-coyote._home.0
-rw-------  1 amanda disk 184745984 Nov 16 03:06 00004.coyote._home.0
-rw-------  1 amanda disk        14 Nov 16 03:09 00005-coyote._root.1
-rw-------  1 amanda disk 346193920 Nov 16 03:09 00005.coyote._root.1
-rw-------  1 amanda disk        13 Nov 16 03:14 00006-coyote._opt.0
-rw-------  1 amanda disk 226099200 Nov 16 03:14 00006.coyote._opt.0
-rw-------  1 amanda disk        12 Nov 16 03:14 00007-coyote._usr_src.1
-rw-------  1 amanda disk   8257536 Nov 16 03:14 00007.coyote._usr_src.1
-rw-------  1 amanda disk        14 Nov 16 03:15 
00008-coyote._usr_dlds-misc_FC3-i386-disk2.iso.1
-rw-------  1 amanda disk 668565504 Nov 16 03:15 
00008.coyote._usr_dlds-misc_FC3-i386-disk2.iso.1
-rw-------  1 amanda disk        14 Nov 16 03:16 
00009-coyote._usr_dlds-misc_FC3_FC3-i386-disk3.iso.1
-rw-------  1 amanda disk 667549696 Nov 16 03:16 
00009.coyote._usr_dlds-misc_FC3_FC3-i386-disk3.iso.1
-rw-------  1 amanda disk        14 Nov 16 03:18 
00010-coyote._usr_dlds-misc_FC3_FC3-i386-disc1.iso.1
-rw-------  1 amanda disk 647036928 Nov 16 03:18 
00010.coyote._usr_dlds-misc_FC3_FC3-i386-disc1.iso.1
-rw-------  1 amanda disk        14 Nov 16 03:19 
00011-coyote._usr_dlds-misc_FC3_FC3-i386-SRPMS-disk4.iso.1
-rw-------  1 amanda disk 589856768 Nov 16 03:19 
00011.coyote._usr_dlds-misc_FC3_FC3-i386-SRPMS-disk4.iso.1
-rw-------  1 amanda disk        14 Nov 16 03:19 
00012-coyote._usr_dlds-misc_FC3_FC3-i386-SRPMS-disk2.iso.1
-rw-------  1 amanda disk 589889536 Nov 16 03:19 
00012.coyote._usr_dlds-misc_FC3_FC3-i386-SRPMS-disk2.iso.1
-rw-------  1 amanda disk        13 Nov 16 03:20 
00013-coyote._usr_dlds-misc_FC3_FC3-i386-SRPMS-disc1.iso.1
-rw-------  1 amanda disk  74285056 Nov 16 03:20 
00013.coyote._usr_dlds-misc_FC3_FC3-i386-SRPMS-disc1.iso.1
-rw-r--r--  1 amanda disk     71680 Oct  2 04:29 configuration.tar
-rw-r--r--  1 amanda disk  63836160 Oct  2 04:29 indices.tar
----------

Does this explain why it gets hung on the second rewind?

And I see my scripts need help in the runtapes > 1 area too, the last
2 files were not deleted either as they are certainly 30 days old and
invalid now.  And its 64 megs I could use for real backups.

I may have to merge those two scripts just so I can keep track of
the vars used.  Or pass vars from one to the other on the cli,
a better idea I think.  Make use of the $1 and $2 bash vars.

If this rewind hang cannot be readily fixed, I'll have to reduce
my vtape count to allow for bigger vtapes, and that means I'll
have to either edit my tapelist, removeing the 10 or so most
recently used tapes, or simply restart it all from scratch, losing
everything I have until it gets rebuilt.  Ugly thought that.  But
it probably the path that I'll take nonetheless.

Humm, thinking out loud, the method I use in the scripts to
determine the vtape used will require an amcheck run to
properly set the starting vtape slotxx, so its correct and
coherent with what amdump will do.  But thats a minor detail. 

What about the rewind hang at 99% cpu?  Thats the real killer here.
 
-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
99.29% setiathome rank, not too shabby for a WV hillbilly
Yahoo.com attorneys please note, additions to this message
by Gene Heskett are:
Copyright 2004 by Maurice Eugene Heskett, all rights reserved.

<Prev in Thread] Current Thread [Next in Thread>