Networker

[Networker] Bad problem with big restore

2005-09-21 10:18:54
Subject: [Networker] Bad problem with big restore
From: Matt Temple <mht AT RESEARCH.DFCI.HARVARD DOT EDU>
To: NETWORKER AT LISTSERV.TEMPLE DOT EDU
Date: Wed, 21 Sep 2005 10:17:49 -0400
Dear Networker users,


I was restoring a large server (900 GB) volume whose RAID5 array had failed. (Don't think that can't happen.)
All tapes were in the jukeboxes.
All volumes had passed verification.
Last night, the messages below came up on the machine doing the recover, while reading the volume Full.0517.
Eventually, I canceled the backup.
I've restarted the process, but even "add *" in the home directory takes a VERY long time to complete.


After the read failure below, the unload failed -- tape drive was hung in some way.
This morning,  I powered down the drive and powered it back up.
I ran scanner to verify the label.
I command-mounted Full.0518, but the restore did not continue -- it just sat there for
more than 45 minutes, so
now I'm restarting this huge restore of 100s of thousands of files.

Could I have done anything to get this to continue?
Why did the request-to-mount Full.0518 messages stop?

Server and Client are Linux
Server is running Networker 7.1.3
Client is running 7.1.2
(AIT-3 tape drives)

Messages from log files are below.
Thanks for reading this.   I"ve got some unhappy users right now.
Any insight would be enormously appreciated.

                                                Matt Temple


================MESSAGES ON RECOVERING CLIENT===========================

On the computer device requesting the recover I saw the following:

recover: Unable to read checksum from save stream
recover: error recovering ./Mk822/platelets/GRANULES/Plt_LAMP2/Platelet_Lamp2/WTplt_SLPI_LAMP2.103_R3D.dv Error encountered on the following files by NSR server `tapes.dfci.harvard.edu': can not read record 2972 of file 49 on 8mm AIT-3 tape Full.0517 ./Mk822/platelets/GRANULES/Plt_LAMP2/Platelet_Lamp2/WTplt_SLPI_LAMP2.100_R3D_D3D.dv @ Sun Sep 12 10:10:28 2004 ./Mk822/platelets/GRANULES/Plt_LAMP2/Platelet_Lamp2/WTplt_SLPI_LAMP2.103_R3D.dv @ Sun Sep 12 10:10:28 2004 recover: Requesting remaining 1169907 file(s) from NSR server `tapes.dfci.harvard.edu'
NSR server `tapes.dfci.harvard.edu': busy
recover: waiting 30 seconds then retrying
NSR server `tapes.dfci.harvard.edu': Unable to send
recover: waiting 30 seconds then retrying

[ 4-hour gap with no more messages -- seems to have died; finally I just canceled ]

recover: RPC error: Cancelled
Recover errors with 2 file(s)
Recover completion time: Wed Sep 21 08:55:10 2005
recover> volumes

===================ENTRIES FROM SERVER LOG FILE================================================
On the networker server:

Sep 21 01:11:50 tapes logger: NetWorker media: (waiting) waiting for 8mm AIT-3 tape Full.0517 (A00679) on tapes.dfci.harvard.edu Sep 21 01:11:51 tapes logger: NetWorker media: (waiting) waiting for 8mm AIT-3 tape Full.0517 (A00679) on tapes.dfci.harvard.edu Sep 21 01:14:16 tapes logger: NetWorker media: (info) loading volume Full.0517 into /dev/nst0
.....
Sep 21 03:42:30 tapes kernel: st0: Error with sense data: Info fld=0x30000, Current st09:00: sense key Hardware Error Sep 21 03:42:30 tapes kernel: Additional sense indicates Internal target failure Sep 21 03:42:31 tapes logger: NetWorker media: (warning) /dev/nst0 reading: Input/output error
Sep 21 03:42:31 tapes last message repeated 5 times
Sep 21 03:42:31 tapes logger: NetWorker media: (notice) Volume "Full.0517" on device "/dev/nst0": Cannot decode block. Verify the device configuration. Tape positioning by record is disabled. *********** [ this only happens when the tape drive has magically reset itself to fixed-block size, 512 bytes ] Sep 21 03:42:31 tapes logger: NetWorker media: (info) can not read record 2972 of file 49 on 8mm AIT-3 tape Full.0517 Sep 21 03:42:31 tapes logger: NetWorker media: (warning) /dev/nst0 reading: Input/output error
Sep 21 03:42:33 tapes last message repeated 49 times
Sep 21 03:43:04 tapes logger: NetWorker media: (waiting) waiting for 8mm AIT-3 tape Full.0518 (A00680) on tapes.dfci.harvard.edu Sep 21 03:43:04 tapes logger: NetWorker media: (waiting) waiting for 8mm AIT-3 tape Full.0518 (A00680) on tapes.dfci.harvard.edu Sep 21 03:43:42 tapes logger: NetWorker media: (info) unload error for jukebox `jukebox1' detected. Retrying
-------
*********** [ these messages were interposed -- but after 4:28, I didn't see any more "waiting" requests ]

Sep 21 04:28:36 tapes logger: NetWorker media: (info) unload retry for jukebox `jukebox1' failed - will retry again. Sep 21 04:28:36 tapes logger: NetWorker media: (info) unload retry for jukebox `jukebox1': sleeping 30 seconds Sep 21 04:28:47 tapes logger: NetWorker media: (critical) waiting for 8mm AIT-3 tape Full.0518 (A00680) on tapes.dfci.harvard.edu Sep 21 04:28:47 tapes logger: NetWorker media: (critical) waiting for 8mm AIT-3 tape Full.0518 (A00680) on tapes.dfci.harvard.edu Sep 21 04:29:28 tapes logger: NetWorker media: (info) unload retry for jukebox `jukebox1' failed - will retry again. Sep 21 04:29:28 tapes logger: NetWorker media: (info) unload retry for jukebox `jukebox1': sleeping 30 seconds Sep 21 04:30:20 tapes logger: NetWorker media: (info) unload retry for jukebox `jukebox1' failed - will retry again.

                       etc
                       etc
                       etc

Sep 21 07:06:31 tapes logger: NetWorker media: (info) unload retry for jukebox `jukebox1': sleeping 30 seconds Sep 21 07:06:31 tapes logger: NetWorker media: (info) unload retry for jukebox `jukebox1' failed - will retry again. Sep 21 07:07:23 tapes logger: NetWorker media: (info) unload retry for jukebox `jukebox1' failed - time limit exceeded: aborting.


----

The unload failed -- tape drive was hung in some way
I powered down the drive and powered it back up.
I ran scanner to verify the label.
I command-mounted Full.0518, but the restore did not continue -- it just sat there.
Now I'm restarting this huge restore of 100s of thousands of files.

What did I miss?


--
=============================================================
Matthew Temple                Tel:    617/632-2597
Director, Research Computing  Fax:    617/582-7820
Dana-Farber Cancer Institute  mht AT research.dfci.harvard DOT edu
44 Binney Street, LG300/300   http://research.dfci.harvard.edu
Boston, MA 02115              Choice is the Choice!

To sign off this list, send email to listserv AT listserv.temple DOT edu and type 
"signoff networker" in the
body of the email. Please write to networker-request AT listserv.temple DOT edu 
if you have any problems
wit this list. You can access the archives at 
http://listserv.temple.edu/archives/networker.html or
via RSS at http://listserv.temple.edu/cgi-bin/wa?RSS&L=NETWORKER