Veritas-bu

[Veritas-bu] NB4.5 Datacenter bad? tape problem + L700 cleaning

2003-05-25 09:06:51
Subject: [Veritas-bu] NB4.5 Datacenter bad? tape problem + L700 cleaning
From: Len.Boyle AT sas DOT com (Len Boyle)
Date: Sun, 25 May 2003 09:06:51 -0400
Good Morning Rafal
 
The first thing that you have to do is find the reason for the error:
08:16:21.672 [1443] <2> write_data: absolute block position prior to writing 
backup header(s) is 389156, copy 1
08:16:21.672 [1443] <2> write_data: block position check: actual 389156, 
expected 504267
08:16:21.695 [1443] <16> write_data: FREEZING media id 000140, External event 
caused rewind during write, all data on media is lost

Some other process or program that has access to the tape drive has moved 
and/or writen to the tape. 
So you see the block number showing the location on the tape with a smaller 
value then the value expected. When writting they should only increase. 
 
Are you sharing the tape drives using netbackup's sso option. A problem with 
the sso setup, is one method of seeing this problem.  Just think of a media 
server A writting on a tape and media server B issuing a rewind. After the 
rewind, the next write by media server A  to the tape will overwrite the tape 
label. 
 
Let us know what you find.
 
len 
 
-----Original Message----- 
From: rafal wiosna [mailto:rafamiga AT uucp.polbox DOT pl] 
Sent: Sun 5/25/2003 5:37 AM 
To: veritas-bu AT mailman.eng.auburn DOT edu 
Cc: 
Subject: [Veritas-bu] NB4.5 Datacenter bad? tape problem + L700 cleaning




                This morning I've discovered a third frozen tape in my L700. 
That's
        strange, the library sits in a comfortable place, nobody is touching, it
        works on NB autopilot. Bad tapes?...
        
                This FROZEN tape I got I can read the images from. When greping 
logs
        I got this:
        
        08:16:21.665 [1443] <4> write_backup: begin writing backup id 
tdcf03_1053839725, copy 1, fragment 2, to media id 000140 on drive index 2
        08:16:21.666 [1443] <2> signal_parent: sending SIGUSR1 to bpbrm (pid = 
1439)
        08:16:21.666 [1443] <2> write_data: twin_index: 0 active: 1 
dont_process: 0 wrote_backup_hdr: 0 finished_buff: 0 saved_cindex: 0
        08:16:21.672 [1443] <2> write_data: absolute block position prior to 
writing backup header(s) is 389156, copy 1
        08:16:21.672 [1443] <2> write_data: block position check: actual 
389156, expected 504267
        08:16:21.695 [1443] <16> write_data: FREEZING media id 000140, External 
event caused rewind during write, all data on media is lost
        08:16:21.731 [1443] <2> log_media_error: successfully wrote to error 
file - 05/25/03 08:16:21 000140 2 WRITE_ERROR
        08:16:21.743 [1443] <2> check_error_history: called from bptm line 
14685, EXIT_Status = 84
        08:16:21.752 [1443] <2> check_error_history: drive index = 2, media id 
= 000140, time = 05/25/03 08:16:21, both_match = 0, media_match = 0, 
drive_match = 0
        08:16:21.752 [1443] <2> io_close: closing 
/usr/openv/netbackup/db/media/tpreq/000140, from bptm.c.12914
        08:16:21.753 [1443] <2> tpunmount: tpunmount'ing 
/usr/openv/netbackup/db/media/tpreq/000140
        08:16:21.754 [1443] <2> TpUnmountWrapper: SCSI RELEASE
        
                What external event at 8am on sunday?!? Good thing I belive I 
can
        unfreeze the tape and get some contents out of it anyway. I will 
unfreeze
        the tape and see what happens.
        
                But with another tape [frozen earlier] the matter is worse. 
When I
        try to read the catalog off the media I get:
        
        10:31:34.879 [9424] <2> io_open: SCSI RESERVE
        10:31:34.936 [9424] <2> io_open: file 
/usr/openv/netbackup/db/media/tpreq/000028 successfully opened
        10:31:34.937 [9424] <2> load_media: media id 000028 mounted on drive 
index 1
        10:31:34.965 [9424] <2> io_read_media_header: drive index 1, reading 
media header, buflen = 32768, buff = 0x22aaa0, copy 1
        10:31:35.222 [9424] <2> io_ioctl: command (5)MTREW 1 from (bptm.c.6182) 
on drive index 1
        10:31:46.129 [9424] <16> io_read_media_header: tapemark or blank tape 
encountered reading media header, drive index 1
        10:31:46.474 [9424] <2> check_error_history: called from bptm line 
2724, EXIT_Status = 85
        10:31:46.483 [9424] <2> check_error_history: drive index = 1, media id 
= 000028, time = 05/25/03 10:31:46, both_match = 0, media_match = 0, 
drive_match = 0
        10:31:46.483 [9424] <2> io_close: closing 
/usr/openv/netbackup/db/media/tpreq/000028, from bptm.c.12914
        10:31:46.483 [9424] <2> tpunmount: tpunmount'ing 
/usr/openv/netbackup/db/media/tpreq/000028
        10:31:46.485 [9424] <2> TpUnmountWrapper: SCSI RELEASE
        10:31:46.544 [9424] <2> bptm: EXITING with status 85 <----------
        
                Good thing it consists of 8 GB [out of 100 GB] of images 
expiring
        June 2nd so it's not so relevant to the data consistency in my backup
        library. I can live without it. But the question is this: how can I 
remove
        this tape from the library and re-use it? With the previous tape I had
        problem with I tried to expire the tape but in the frozen state it was
        impossible [or was I doint something in wrong order]. Only when I got 
the
        replacement tape [since it was first mount that caused the troubles I
        managed to replace the tape for free] I was able to re-claim the tape 
as a
        clean one, with no contents.
        
                Any suggestions what should I do now? This tape is like 3 months
        old. I know, this is magnetic media and problems can happens so is 
there any
        way to check the tape [don't care about erasing the contents] and/or 
re-use
        it?
        
                Also for cleaning, I belive my Storagetek L700 is managing the
        cleaning process. It has autoclean feature on. Okay, but the cleaning 
tapes
        were never used for 8 months. I wish I can clean the drives once a 3-4
        months JUST IN CASE to prevent an error and not wait for the first 
error to
        cause cleaning. The errorcount on either of the 3 drives is 0 -- and 
that's
        probably why the cleaning never happened. And I wish the L700 won't 
place
        cleaning in the place NB cannot reach them [it does not see it in the
        robot].
        
        --
        
__________________________________________________________________________
        rafal wiosna * Å?wiat Internet S.A. * Polbox * In ARP we trust * AR164 
V6TB
        RAFD-RIPE * Dom som surfar pĺ nättet, fĺr leka med musen * PGPID: 
3CDCB7A9
        _______________________________________________
        Veritas-bu maillist  -  Veritas-bu AT mailman.eng.auburn DOT edu
        http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu