Dear Networker users,
I was restoring a large server (900 GB) volume whose RAID5 array had
failed. (Don't think that can't happen.)
All tapes were in the jukeboxes.
All volumes had passed verification.
Last night, the messages below came up on the machine doing the recover,
while reading the volume Full.0517.
Eventually, I canceled the backup.
I've restarted the process, but even "add *" in the home directory
takes a VERY long time to complete.
After the read failure below, the unload failed -- tape drive was hung
in some way.
This morning, I powered down the drive and powered it back up.
I ran scanner to verify the label.
I command-mounted Full.0518, but the restore did not continue -- it just
sat there for
more than 45 minutes, so
now I'm restarting this huge restore of 100s of thousands of files.
Could I have done anything to get this to continue?
Why did the request-to-mount Full.0518 messages stop?
Server and Client are Linux
Server is running Networker 7.1.3
Client is running 7.1.2
(AIT-3 tape drives)
Messages from log files are below.
Thanks for reading this. I"ve got some unhappy users right now.
Any insight would be enormously appreciated.
Matt Temple
================MESSAGES ON RECOVERING CLIENT===========================
On the computer device requesting the recover I saw the following:
recover: Unable to read checksum from save stream
recover: error recovering
./Mk822/platelets/GRANULES/Plt_LAMP2/Platelet_Lamp2/WTplt_SLPI_LAMP2.103_R3D.dv
Error encountered on the following files by NSR server
`tapes.dfci.harvard.edu': can not read record 2972 of file 49 on 8mm
AIT-3 tape Full.0517
./Mk822/platelets/GRANULES/Plt_LAMP2/Platelet_Lamp2/WTplt_SLPI_LAMP2.100_R3D_D3D.dv
@ Sun Sep 12 10:10:28 2004
./Mk822/platelets/GRANULES/Plt_LAMP2/Platelet_Lamp2/WTplt_SLPI_LAMP2.103_R3D.dv
@ Sun Sep 12 10:10:28 2004
recover: Requesting remaining 1169907 file(s) from NSR server
`tapes.dfci.harvard.edu'
NSR server `tapes.dfci.harvard.edu': busy
recover: waiting 30 seconds then retrying
NSR server `tapes.dfci.harvard.edu': Unable to send
recover: waiting 30 seconds then retrying
[ 4-hour gap with no more messages -- seems to have died; finally I just
canceled ]
recover: RPC error: Cancelled
Recover errors with 2 file(s)
Recover completion time: Wed Sep 21 08:55:10 2005
recover> volumes
===================ENTRIES FROM SERVER LOG
FILE================================================
On the networker server:
Sep 21 01:11:50 tapes logger: NetWorker media: (waiting) waiting for 8mm
AIT-3 tape Full.0517 (A00679) on tapes.dfci.harvard.edu
Sep 21 01:11:51 tapes logger: NetWorker media: (waiting) waiting for 8mm
AIT-3 tape Full.0517 (A00679) on tapes.dfci.harvard.edu
Sep 21 01:14:16 tapes logger: NetWorker media: (info) loading volume
Full.0517 into /dev/nst0
.....
Sep 21 03:42:30 tapes kernel: st0: Error with sense data: Info
fld=0x30000, Current st09:00: sense key Hardware Error
Sep 21 03:42:30 tapes kernel: Additional sense indicates Internal target
failure
Sep 21 03:42:31 tapes logger: NetWorker media: (warning) /dev/nst0
reading: Input/output error
Sep 21 03:42:31 tapes last message repeated 5 times
Sep 21 03:42:31 tapes logger: NetWorker media: (notice) Volume
"Full.0517" on device "/dev/nst0": Cannot decode block. Verify the
device configuration. Tape positioning by record is disabled.
*********** [ this only happens when the tape drive has magically reset
itself to fixed-block size, 512 bytes ]
Sep 21 03:42:31 tapes logger: NetWorker media: (info) can not read
record 2972 of file 49 on 8mm AIT-3 tape Full.0517
Sep 21 03:42:31 tapes logger: NetWorker media: (warning) /dev/nst0
reading: Input/output error
Sep 21 03:42:33 tapes last message repeated 49 times
Sep 21 03:43:04 tapes logger: NetWorker media: (waiting) waiting for 8mm
AIT-3 tape Full.0518 (A00680) on tapes.dfci.harvard.edu
Sep 21 03:43:04 tapes logger: NetWorker media: (waiting) waiting for 8mm
AIT-3 tape Full.0518 (A00680) on tapes.dfci.harvard.edu
Sep 21 03:43:42 tapes logger: NetWorker media: (info) unload error for
jukebox `jukebox1' detected. Retrying
-------
*********** [ these messages were interposed -- but after 4:28, I didn't
see any more "waiting" requests ]
Sep 21 04:28:36 tapes logger: NetWorker media: (info) unload retry for
jukebox `jukebox1' failed - will retry again.
Sep 21 04:28:36 tapes logger: NetWorker media: (info) unload retry for
jukebox `jukebox1': sleeping 30 seconds
Sep 21 04:28:47 tapes logger: NetWorker media: (critical) waiting for
8mm AIT-3 tape Full.0518 (A00680) on tapes.dfci.harvard.edu
Sep 21 04:28:47 tapes logger: NetWorker media: (critical) waiting for
8mm AIT-3 tape Full.0518 (A00680) on tapes.dfci.harvard.edu
Sep 21 04:29:28 tapes logger: NetWorker media: (info) unload retry for
jukebox `jukebox1' failed - will retry again.
Sep 21 04:29:28 tapes logger: NetWorker media: (info) unload retry for
jukebox `jukebox1': sleeping 30 seconds
Sep 21 04:30:20 tapes logger: NetWorker media: (info) unload retry for
jukebox `jukebox1' failed - will retry again.
etc
etc
etc
Sep 21 07:06:31 tapes logger: NetWorker media: (info) unload retry for
jukebox `jukebox1': sleeping 30 seconds
Sep 21 07:06:31 tapes logger: NetWorker media: (info) unload retry for
jukebox `jukebox1' failed - will retry again.
Sep 21 07:07:23 tapes logger: NetWorker media: (info) unload retry for
jukebox `jukebox1' failed - time limit exceeded: aborting.
----
The unload failed -- tape drive was hung in some way
I powered down the drive and powered it back up.
I ran scanner to verify the label.
I command-mounted Full.0518, but the restore did not continue -- it just
sat there.
Now I'm restarting this huge restore of 100s of thousands of files.
What did I miss?
--
=============================================================
Matthew Temple Tel: 617/632-2597
Director, Research Computing Fax: 617/582-7820
Dana-Farber Cancer Institute mht AT research.dfci.harvard DOT edu
44 Binney Street, LG300/300 http://research.dfci.harvard.edu
Boston, MA 02115 Choice is the Choice!
To sign off this list, send email to listserv AT listserv.temple DOT edu and type
"signoff networker" in the
body of the email. Please write to networker-request AT listserv.temple DOT edu
if you have any problems
wit this list. You can access the archives at
http://listserv.temple.edu/archives/networker.html or
via RSS at http://listserv.temple.edu/cgi-bin/wa?RSS&L=NETWORKER
|