Networker

Re: [Networker] Bad problem with big restore

2005-09-21 10:37:11
Subject: Re: [Networker] Bad problem with big restore
From: "Maarten Boot (CWEU-USERS/CWNL)" <Maarten.Boot AT NL.COMPUWARE DOT COM>
To: NETWORKER AT LISTSERV.TEMPLE DOT EDU
Date: Wed, 21 Sep 2005 16:40:49 +0200
> All volumes had passed verification.

How do you verify,

Sadly we live (as madonna sang) in a material world.
All things material can fail, so also Raid 5's and tapes.

We verify server backups by cloning them to extra offsite tapes. This has the 
advantage that the complete backup actually gets read back from the tape and 
really verified. Still tapes can break, magnetic media can fail, and so on. 

To see if the suspected tape is really broke you can run scanner on it 
completely. This will read and process the complete tape.

I am sorry that you have felt the true nature of reality,

Maarten



On Wednesday 21 September 2005 16:17, Matt Temple wrote:
> Dear Networker users,
>
>
> I was restoring a large server (900 GB) volume whose RAID5 array had
> failed. (Don't think that can't happen.)
> All tapes were in the jukeboxes.
> All volumes had passed verification.
> Last night, the messages below came up on the machine doing the recover,
> while reading the volume Full.0517.
> Eventually, I canceled the backup.
> I've restarted the process, but even  "add *" in the home directory
> takes a VERY long time to complete.
>
>
> After the read failure below,  the unload failed -- tape drive was hung
> in some way.
> This morning,  I powered down the drive and powered it back up.
> I ran scanner to verify the label.
> I command-mounted Full.0518, but the restore did not continue -- it just
> sat there for
> more than 45 minutes, so
> now I'm restarting this huge restore of 100s of thousands of files.
>
> Could I have done anything to get this to continue?
> Why did the request-to-mount Full.0518 messages stop?
>
> Server and Client are Linux
> Server is running Networker 7.1.3
> Client is running 7.1.2
> (AIT-3 tape drives)
>
> Messages from log files are below.
> Thanks for reading this.   I"ve got some unhappy users right now.
> Any insight would be enormously appreciated.
>
>                                                  Matt Temple
>
>
> ================MESSAGES ON RECOVERING CLIENT===========================
>
> On the computer device requesting the recover I saw the following:
>
> recover: Unable to read checksum from save stream
> recover: error recovering
> ./Mk822/platelets/GRANULES/Plt_LAMP2/Platelet_Lamp2/WTplt_SLPI_LAMP2.103_R3
>D.dv Error encountered on the following files by NSR server
> `tapes.dfci.harvard.edu': can not read record 2972 of file 49 on 8mm
> AIT-3 tape Full.0517
>
> ./Mk822/platelets/GRANULES/Plt_LAMP2/Platelet_Lamp2/WTplt_SLPI_LAMP2.100_R3
>D_D3D.dv @ Sun Sep 12 10:10:28 2004
>
> ./Mk822/platelets/GRANULES/Plt_LAMP2/Platelet_Lamp2/WTplt_SLPI_LAMP2.103_R3
>D.dv @ Sun Sep 12 10:10:28 2004
> recover: Requesting remaining 1169907 file(s) from NSR server
> `tapes.dfci.harvard.edu'
> NSR server `tapes.dfci.harvard.edu': busy
> recover: waiting 30 seconds then retrying
> NSR server `tapes.dfci.harvard.edu': Unable to send
> recover: waiting 30 seconds then retrying
>
> [ 4-hour gap with no more messages -- seems to have died; finally I just
> canceled  ]
>
> recover: RPC error: Cancelled
> Recover errors with 2 file(s)
> Recover completion time: Wed Sep 21 08:55:10 2005
> recover> volumes
>
> ===================ENTRIES FROM SERVER LOG
> FILE================================================
> On the networker server:
>
> Sep 21 01:11:50 tapes logger: NetWorker media: (waiting) waiting for 8mm
> AIT-3 tape Full.0517 (A00679) on tapes.dfci.harvard.edu
> Sep 21 01:11:51 tapes logger: NetWorker media: (waiting) waiting for 8mm
> AIT-3 tape Full.0517 (A00679) on tapes.dfci.harvard.edu
> Sep 21 01:14:16 tapes logger: NetWorker media: (info) loading volume
> Full.0517 into /dev/nst0
> .....
> Sep 21 03:42:30 tapes kernel: st0: Error with sense data: Info
> fld=0x30000, Current st09:00: sense key Hardware Error
> Sep 21 03:42:30 tapes kernel: Additional sense indicates Internal target
> failure
> Sep 21 03:42:31 tapes logger: NetWorker media: (warning) /dev/nst0
> reading: Input/output error
> Sep 21 03:42:31 tapes last message repeated 5 times
> Sep 21 03:42:31 tapes logger: NetWorker media: (notice) Volume
> "Full.0517" on device "/dev/nst0": Cannot decode block. Verify the
> device configuration. Tape positioning by record is disabled.
> *********** [ this only happens when the tape drive has magically reset
> itself to fixed-block size, 512 bytes ]
> Sep 21 03:42:31 tapes logger: NetWorker media: (info) can not read
> record 2972 of file 49 on 8mm AIT-3 tape Full.0517
> Sep 21 03:42:31 tapes logger: NetWorker media: (warning) /dev/nst0
> reading: Input/output error
> Sep 21 03:42:33 tapes last message repeated 49 times
> Sep 21 03:43:04 tapes logger: NetWorker media: (waiting) waiting for 8mm
> AIT-3 tape Full.0518 (A00680) on tapes.dfci.harvard.edu
> Sep 21 03:43:04 tapes logger: NetWorker media: (waiting) waiting for 8mm
> AIT-3 tape Full.0518 (A00680) on tapes.dfci.harvard.edu
> Sep 21 03:43:42 tapes logger: NetWorker media: (info) unload error for
> jukebox `jukebox1' detected.  Retrying
> -------
> *********** [ these messages were interposed -- but after 4:28, I didn't
> see any more "waiting" requests ]
>
> Sep 21 04:28:36 tapes logger: NetWorker media: (info) unload retry for
> jukebox `jukebox1' failed - will retry again.
> Sep 21 04:28:36 tapes logger: NetWorker media: (info) unload retry for
> jukebox `jukebox1': sleeping 30 seconds
> Sep 21 04:28:47 tapes logger: NetWorker media: (critical) waiting for
> 8mm AIT-3 tape Full.0518 (A00680) on tapes.dfci.harvard.edu
> Sep 21 04:28:47 tapes logger: NetWorker media: (critical) waiting for
> 8mm AIT-3 tape Full.0518 (A00680) on tapes.dfci.harvard.edu
> Sep 21 04:29:28 tapes logger: NetWorker media: (info) unload retry for
> jukebox `jukebox1' failed - will retry again.
> Sep 21 04:29:28 tapes logger: NetWorker media: (info) unload retry for
> jukebox `jukebox1': sleeping 30 seconds
> Sep 21 04:30:20 tapes logger: NetWorker media: (info) unload retry for
> jukebox `jukebox1' failed - will retry again.
>
>                         etc
>                         etc
>                         etc
>
> Sep 21 07:06:31 tapes logger: NetWorker media: (info) unload retry for
> jukebox `jukebox1': sleeping 30 seconds
> Sep 21 07:06:31 tapes logger: NetWorker media: (info) unload retry for
> jukebox `jukebox1' failed - will retry again.
> Sep 21 07:07:23 tapes logger: NetWorker media: (info) unload retry for
> jukebox `jukebox1' failed - time limit exceeded: aborting.
>
>
> ----
>
> The unload failed -- tape drive was hung in some way
> I powered down the drive and powered it back up.
> I ran scanner to verify the label.
> I command-mounted Full.0518, but the restore did not continue -- it just
> sat there.
> Now I'm restarting this huge restore of 100s of thousands of files.
>
> What did I miss?

-- 
Maarten Boot, 
Compuware Europe B.V.
Hoogoorddreef 5
1101 BA Amsterdam
Tel: +31 20 312 6511

To sign off this list, send email to listserv AT listserv.temple DOT edu and 
type "signoff networker" in the
body of the email. Please write to networker-request AT listserv.temple DOT edu 
if you have any problems
wit this list. You can access the archives at 
http://listserv.temple.edu/archives/networker.html or
via RSS at http://listserv.temple.edu/cgi-bin/wa?RSS&L=NETWORKER