Veritas-bu

RES: [Veritas-bu] NetBackup DataCenter and L100 tape library

2005-01-01 11:03:57
Subject: RES: [Veritas-bu] NetBackup DataCenter and L100 tape library
From: len.boyle AT sas DOT com (len boyle)
Date: Sat, 1 Jan 2005 11:03:57 -0500
Hello Kevin

I know nothing of the ATL, and that may be where your problem hides.

I would backstep and check things out at the lowest level.
Pick one tape that you can write over. To use in netbackup after this you 
may have to use bplabel to fix it up.

Remove netbackup from the problem. Do a manual load of the tape to one tape 
drive then use unix commands to test the tape.

mt -f /dev/.... status
use tar or dd to write data to the tape
mt -f /dev/... rewind
use tar or dd to read  data from the tape.

Make sure you use a fair amount of data as you want use enough of the tape 
so that you have to switch tracks on the tape. Also use a large block size.

If this works for the drive(s) then the drive hardware is ok.

If the above works then try to use robtest to load one tape and only one 
tape at a time. Check the robot to make sure that the tape is in the correct 
drive.
If all is ok, move the tape to the next drive.

I suspect that one of the two tests will fail.

Also it would be nice to see the output of the sgscan tape command.

Good luck and let use know how things are working.
len
----- Original Message ----- 
From: "kevin haritmonds" <kharitmonds AT gmail DOT com>
To: <veritas-bu AT mailman.eng.auburn DOT edu>
Sent: Saturday, January 01, 2005 3:14 AM
Subject: Re: RES: [Veritas-bu] NetBackup DataCenter and L100 tape library


> Thank you Tim and everyone else for your input. I'm sorry if this will
> be a long message but I need to give all of the details. I hope you
> don't mind.
>
> Tim: First, were the tapes frozen? You can check the tapes with either
> available_media or bpmedialist.
> Kevin: No, the tapes were not frozen:
> # /opt/openv/netbackup/bin/goodies/available_media
> NetBackup pool
> A00000  HCART2   TLD      0       25     -       -     -        DBBACKUP
> A00001  HCART2   TLD      0       26     -       -     -        DBBACKUP
>
> None pool
>
> Test1 pool
> A00002  HCART2   TLD      0       27     -       -     -        AVAILABLE
> A00003  HCART2   TLD      0       28     -       -     -        AVAILABLE
> # /opt/openv/netbackup/bin/admincmd/bpmedialist
> #
>
>
> Tim: Second, were the drives downed? You can check the drives with
> vmoprcmd -d (GUI's device monitor equivalent).
> Kevin: The drives were all up:
> # /opt/openv/volmgr/bin/vmoprcmd -d
>                                PENDING REQUESTS
>                                     <NONE>
>
>                                  DRIVE STATUS
> Drv Type   Control  User      Label  RecMID  ExtMID  Ready   Wr.Enbl. 
> ReqId
>  0 hcart2   TLD                -                     No       -        -
>  1 hcart2   TLD                -                     No       -        -
>  2 hcart2   TLD                -                     No       -        -
>  3 hcart2   TLD                -                     No       -        -
>
>                             ADDITIONAL DRIVE STATUS
> Drv DriveName            Shared    Assigned        Comment
>  0 HPUltrium2-SCSI0      No       -
>  1 HPUltrium2-SCSI1      No       -
>  2 HPUltrium2-SCSI2      No       -
>  3 HPUltrium2-SCSI3      No       -
> #
>
>
> Tim & Len: As someone else pointed out, the bptm log and syslog will
> be most useful in the initial troubleshooting. Check bptm if the tapes
> were frozen.  Check syslog if they were downed.
> Kevin: I'm not sure if I read it right, but it seems when it wanted to
> write the backup header to A00002 tape, an EOM encountered. After
> another media mounted (tape A00003), it tried to write to it but
> failed with an I/O error. I'm not sure what cause it but we used new
> LTO-2 tapes.
> # cat /usr/openv/netbackup/logs/bptm/log.010105
> <..cut..>
> 09:20:53.584 [4558] <2> select_media: getting new media id for retention 
> level 0
> 09:20:53.690 [4558] <2> vmdb_query_scratch_bypool2: server returned:
> 1 A00002 ------ 14 -------- <..cut..>
> 09:20:53.690 [4558] <2> db_byid: search for media id A00002
> 09:20:53.690 [4558] <2> db_put: write media id A00002, offset = 0
> 09:20:53.705 [4558] <2> select_media: selected media id A00002 for
> backup[0], baksrv(rl = 0) <----------
> 09:20:53.705 [4558] <2> mount_open_media: Waiting for mount of media
> id A00002 (copy 1) on server baksrv.
> 09:21:33.732 [4558] <2> io_open: SCSI RESERVE
> 09:21:33.773 [4558] <2> io_open: file
> /usr/openv/netbackup/db/media/tpreq/A00002 successfully opened
> 09:21:33.773 [4558] <2> write_backup: media id A00002 mounted on drive
> index 0, drivepath /dev/rmt/0cbn, drivename HPUltrium2-SCSI0, copy 1
> 09:21:33.773 [4558] <2> io_read_media_header: drive index 0, reading
> media header, buflen = 65536, buff = 0x23b010, copy 1
> 09:21:33.773 [4558] <2> io_ioctl: command (5)MTREW 1 from
> (bptm.c.6330) on drive index 0
> 09:21:33.840 [4558] <2> io_write_media_header: drive index 0, writing
> media header
> 09:21:33.840 [4558] <2> io_close: closing
> /usr/openv/netbackup/db/media/tpreq/A00002, from bptm.c.7311
> 09:21:33.849 [4558] <2> io_open: SCSI RESERVE
> 09:21:33.958 [4558] <2> io_open: file
> /usr/openv/netbackup/db/media/tpreq/A00002 successfully opened
> 09:21:33.958 [4558] <2> io_ioctl: command (5)MTREW 1 from
> (bptm.c.7315) on drive index 0
> 09:21:38.942 [4558] <2> io_ioctl: command (0)MTWEOF 1 from
> (bptm.c.7344) on drive index 0
> 09:21:40.751 [4558] <4> write_backup: begin writing backup id
> baksrv_1104582049, copy 1, fragment 1, to media id A00002 on drive
> index 0
> 09:21:40.751 [4558] <2> signal_parent: sending SIGUSR1 to bpbrm (pid = 
> 4555)
> 09:21:40.751 [4558] <2> write_data: twin_index: 0 active: 1
> dont_process: 0 wrote_backup_hdr: 0 finished_buff: 0 saved_cindex: 0
> 09:21:45.882 [4558] <2> write_data: absolute block position prior to
> writing backup header(s) is 2, copy 1
> 09:21:45.882 [4558] <2> io_write_back_header: drive index 0,
> baksrv_1104582049, file num = 1, mpx_headers = 0, copy 1
> 09:21:45.959 [4558] <2> write_data: completed writing backup header,
> start writing data when first buffer is available, copy 1
> 09:21:45.959 [4558] <2> write_data: first write, twin_index: 0 cindex:
> 0 dont_process: 1 wrote_backup_hdr: 1 finished_buff: 0
> 09:21:45.959 [4558] <2> write_data: received first buffer (32768
> bytes), begin writing data
> 09:21:45.960 [4558] <2> write_backup: write_data() returned,
> exit_status = 0, CINDEX = 0, TWIN_INDEX = 0, backup_status = -3
> 09:21:45.960 [4558] <2> signal_parent: sending SIGUSR1 to bpbrm (pid = 
> 4555)
> 09:21:45.960 [4558] <2> io_close: closing
> /usr/openv/netbackup/db/media/tpreq/A00002, from bptm.c.15502
> 09:21:47.574 [4558] <2> write_backup: EOM encountered writing backup
> header, entire image will be put on a new media
> 09:21:47.585 [4558] <2> write_backup: tpunmount'ing
> /usr/openv/netbackup/db/media/tpreq/A00002 after EOM
> 09:21:47.595 [4558] <2> TpUnmountWrapper: SCSI RELEASE
> 09:21:47.695 [4558] <2> add_to_vmhost_list: added baksrv to vmhost list
> 09:21:47.753 [4558] <2> select_media: getting new media id for retention 
> level 0
> 09:21:47.861 [4558] <2> vmdb_query_scratch_bypool2: server returned:
> 1 A00003 ------ 14 -------- <..cut..>
> 09:21:47.861 [4558] <2> db_byid: search for media id A00003
> 09:21:47.861 [4558] <2> db_put: write media id A00003, offset = 1
> 09:21:47.876 [4558] <2> select_media: selected media id A00003 for
> backup[0], baksrv(rl = 0) <----------
> 09:21:47.876 [4558] <2> mount_open_media: Waiting for mount of media
> id A00003 (copy 1) on server baksrv.
> 09:22:40.239 [4558] <2> io_open: SCSI RESERVE
> 09:22:40.280 [4558] <2> io_open: file
> /usr/openv/netbackup/db/media/tpreq/A00003 successfully opened
> 09:22:40.280 [4558] <2> write_backup: media id A00003 mounted on drive
> index 1, drivepath /dev/rmt/1cbn, drivename HPUltrium2-SCSI1, copy 1
> 09:22:40.280 [4558] <2> io_read_media_header: drive index 1, reading
> media header, buflen = 65536, buff = 0x23b010, copy 1
> 09:22:40.280 [4558] <2> io_ioctl: command (5)MTREW 1 from
> (bptm.c.6330) on drive index 1
> 09:22:40.348 [4558] <2> io_write_media_header: drive index 1, writing
> media header
> 09:22:40.348 [4558] <2> io_close: closing
> /usr/openv/netbackup/db/media/tpreq/A00003, from bptm.c.7311
> 09:22:40.358 [4558] <2> io_open: SCSI RESERVE
> 09:22:40.465 [4558] <2> io_open: file
> /usr/openv/netbackup/db/media/tpreq/A00003 successfully opened
> 09:22:40.465 [4558] <2> io_ioctl: command (5)MTREW 1 from
> (bptm.c.7315) on drive index 1
> 09:22:43.479 [4558] <2> io_ioctl: command (0)MTWEOF 1 from
> (bptm.c.7344) on drive index 1
> 09:22:47.276 [4558] <4> write_backup: begin writing backup id
> baksrv_1104582049, copy 1, fragment 1, to media id A00003 on drive
> index 1
> 09:22:47.276 [4558] <2> signal_parent: sending SIGUSR1 to bpbrm (pid = 
> 4555)
> 09:22:47.276 [4558] <2> write_data: twin_index: 0 active: 1
> dont_process: 0 wrote_backup_hdr: 0 finished_buff: 0 saved_cindex: 0
> 09:22:52.446 [4558] <2> write_data: absolute block position prior to
> writing backup header(s) is 2, copy 1
> 09:22:52.446 [4558] <2> io_write_back_header: drive index 1,
> baksrv_1104582049, file num = 1, mpx_headers = 0, copy 1
> 09:22:52.491 [4558] <2> write_data: completed writing backup header,
> start writing data when first buffer is available, copy 1
> 09:22:52.491 [4558] <2> write_data: first write, twin_index: 0 cindex:
> 0 dont_process: 1 wrote_backup_hdr: 1 finished_buff: 0
> 09:22:52.491 [4558] <2> write_data: received first buffer (32768
> bytes), begin writing data
> 09:22:52.493 [4558] <2> signal_parent: sending SIGUSR1 to bpbrm (pid = 
> 4555)
> 09:22:52.493 [4558] <2> write_data: attempting write error recovery, err = 
> 5
> 09:22:52.493 [4558] <2> tape_error_rec: error recovery to block 3 
> requested
> 09:22:52.493 [4558] <2> tape_error_rec: attempting error recovery,
> delay 3 minutes before next attempt, tries left = 5
> 09:25:50.267 [4604] <2> bptm: INITIATING (VERBOSE = 0): -count -cmd
> -rt 8 -rn 0 -stunit baksrv-hcart2-robot-tld-0 -den 14 -mt 2
> -masterversion 451000
> 09:25:50.268 [4604] <2> bptm: EXITING with status 0 <----------
> 09:25:52.490 [4558] <2> io_ioctl: command (0)MTWEOF 0 from
> (overwrite.c.192) on drive index 1
> 09:25:52.490 [4558] <2> io_ioctl: MTWEOF failed during error recovery, I/O 
> error
> 09:25:52.499 [4558] <2> tape_error_rec: SCSI RESERVE
> 09:25:52.516 [4558] <2> tape_error_rec: absolute block position after 
> error is 0
> 09:25:52.517 [4558] <2> set_job_details: Done
> 09:25:52.538 [4558] <16> write_data: cannot write image to media id
> A00003, drive index 1, I/O error
> 09:25:52.562 [4558] <2> log_media_error: successfully wrote to error
> file - 01/01/05 09:25:52 A00003 1 WRITE_ERROR
> 09:25:52.563 [4558] <2> check_error_history: called from bptm line
> 15215, EXIT_Status = 84
> 09:25:52.563 [4558] <2> check_error_history: drive index = 1, media id
> = A00003, time = 01/01/05 09:25:52, both_match = 0, media_match = 0,
> drive_match = 0
> 09:25:52.563 [4558] <2> io_close: closing
> /usr/openv/netbackup/db/media/tpreq/A00003, from bptm.c.13417
> 09:25:52.563 [4558] <2> tpunmount: tpunmount'ing
> /usr/openv/netbackup/db/media/tpreq/A00003
> 09:25:52.572 [4558] <2> TpUnmountWrapper: SCSI RELEASE
> 09:25:52.731 [4558] <2> bptm: EXITING with status 84 <----------
> <..cut..>
> #
> # cat /var/adm/messages
> <..cut..>
> Jan  1 09:22:52 baksrv scsi: [ID 107833 kern.warning] WARNING:
> /pci@8,700000/pci@2/scsi@4/st@6,0 (st20):
> Jan  1 09:22:52 baksrv  Error for Command: write
> Error Level: Fatal
> Jan  1 09:22:52 baksrv scsi: [ID 107833 kern.notice]    Requested
> Block: 1                         Error Block: 1
> Jan  1 09:22:52 baksrv scsi: [ID 107833 kern.notice]    Vendor: HP
>                            Serial Number:
> Jan  1 09:22:52 baksrv scsi: [ID 107833 kern.notice]    Sense Key:
> Aborted Command
> Jan  1 09:22:52 baksrv scsi: [ID 107833 kern.notice]    ASC: 0x47
> (scsi parity error), ASCQ: 0x0, FRU: 0x0
> Jan  1 09:22:52 baksrv scsi: [ID 107833 kern.warning] WARNING:
> /pci@8,700000/pci@2/scsi@4 (qus0):
> Jan  1 09:22:52 baksrv  Target synch. rate reduced. tgt 6 lun 0
>
>
> Tim: When you ran the Wizard, was NBU able to determine which drive
> (/dev/rmt/X) was associated with each robot drive number (Drive 1-4)?
> If not and you drag-and-dropped them into place, it's possible the
> drives are out of order and NBU is mounting the tapes in the "wrong"
> drives.
> Kevin: You are correct Tim, the NBU wasn't able to determine which
> drive was associated with each robot drive number. When we ran the
> "Configure Storage Devices" Wizard, the wizard detected that there are
> 4 tape drive(s) and 1 robot. But on the next wizard's page the 4
> drives have limitations:
> Device                  Host    State       Serialized  Limitations
> ------                  ----    -----       ----------  -----------
> HP Ultrium 2-SCSI F40Z  baksrv  Configured  Yes         Yes, see 
> limitations
> HP Ultrium 2-SCSI F40Z  baksrv  Configured  Yes         Yes, see 
> limitations
> HP Ultrium 2-SCSI F40Z  baksrv  Configured  Yes         Yes, see 
> limitations
> HP Ultrium 2-SCSI F45Z  baksrv  Configured  Yes         Yes, see 
> limitations
> ATL M2500 3.07          baksrv  Configured  No          None
> Limitations detail: "Unable to determine robot type. Drive is
> standalone or in unknown robot. Robot drive number is unknown."
>
> So on the next page, we have to drag and drop them into place. The
> strange thing was the robot only have 5 drive box (but in reality the
> L100 library has 6 maximum drives which we only have 4 drives out of
> 6):
> - [X] Robot 0 baksrv [ATL M2500 3.07]
>         [ ] Drive 1 (unused drive address)
>         [ ] Drive 2 (unused drive address)
>         [ ] Drive 3 (unused drive address)
>         [ ] Drive 4 (unused drive address)
>         [ ] Drive 5 (unused drive address)
> - Standalone Drives
>         [X] baksrv HPUltrium2-SCSI3 [HP Ultrium 2-SCSI F45Z]
>         [X] baksrv HPUltrium2-SCSI2 [HP Ultrium 2-SCSI F40Z]
>         [X] baksrv HPUltrium2-SCSI1 [HP Ultrium 2-SCSI F40Z]
>         [X] baksrv HPUltrium2-SCSI0 [HP Ultrium 2-SCSI F40Z]
>
> So to drag and drop it to the correct drive number, I take the steps
> described on Media Manager System Administrator Guide that uses the
> robotic test utility (robtest) and see the drive's status on Device
> Monitor.
>
>
> David: Make sure drive type is the same as the media
> Kevin: The drive and the media have the same type: "hcart2".
>
> Alex: Make sure the tapes doesn't already had something with different
> retention on them (from previous testing) while you have configure not
> to allow different retention to be written on the same media. Also,
> check the master properties -> media and make sure it "allow media
> overwrite".
> Kevin: The "Allow media overwrite" on server's Host Properties have
> all been checked. And we used new tapes.
>
> Is there any clue on what causing the problem?
>
> Best regards,
> Kevin Haritmonds
>
>
> On Fri, 31 Dec 2004 20:37:06 -0800, Alex Fong <alex.s.fong AT gmail DOT com> 
> wrote:
>> Make sure the tapes doesn't already had something with different
>> retention on them (from previous testing) while you have configure not
>> to allow different retention to be written on the same media. Also,
>> check the master properties -> media and make sure it "allow media
>> overwrite".
>>
>> Alex
>>
>> On Fri, 31 Dec 2004 21:04:18 -0600 (CST), Tim Hoke <thoke AT northpeak DOT 
>> org> 
>> wrote:
>> > If the drive types weren't the same as the media, then the mounting 
>> > never
>> > would have occurred and the jobs would have failed with a status 96.
>> >
>> > On Fri, 31 Dec 2004, David Trostli wrote:
>> >
>> > > Make sure drive type is the same as the media
>> > >
>> > > Regards,
>> > >
>> > > David
>> > >
>> > > -----Mensagem original-----
>> > > De: veritas-bu-admin AT mailman.eng.auburn DOT edu
>> > > [mailto:veritas-bu-admin AT mailman.eng.auburn DOT edu]Em nome de Tim 
>> > > Hoke
>> > > Enviada em: sexta-feira, 31 de dezembro de 2004 13:59
>> > > Para: kevin haritmonds
>> > > Cc: veritas-bu AT mailman.eng.auburn DOT edu
>> > > Assunto: Re: [Veritas-bu] NetBackup DataCenter and L100 tape library
>> > >
>> > >
>> > > Kevin,
>> > >
>> > > I've got a couple of questions for you.
>> > >
>> > > First, were the tapes frozen?
>> > > You can check the tapes with either available_media or bpmedialist.
>> > >
>> > > Second, were the drives downed?
>> > >
>> > > You can check the drives with vmoprcmd -d (GUI's device monitor
>> > > equivalent).
>> > >
>> > > As someone else pointed out, the bptm log and syslog will be most 
>> > > useful
>> > > in the initial troubleshooting.
>> > >
>> > > Check bptm if the tapes were frozen.  Check syslog if they were 
>> > > downed.
>> > >
>> > > When you ran the Wizard, was NBU able to determine which drive
>> > > (/dev/rmt/X) was associated with each robot drive number (Drive 1-4)? 
>> > > If
>> > > not and you drag-and-dropped them into place, it's possible the 
>> > > drives are
>> > > out of order and NBU is mounting the tapes in the "wrong" drives.
>> > >
>> > > Run /usr/openv/volmgr/bin/scan and see if the robot reports the drive
>> > > serial numbers.  If so, then NBU should have been able to figure out 
>> > > which
>> > > drive goes with with number.
>> > >
>> > > HTH
>> > > -Tim
>> > >
>> > > On Fri, 31 Dec 2004, kevin haritmonds wrote:
>> > >
>> > > > Hi, I'm facing a problem in Veritas NetBackup DataCenter 4.5 
>> > > > server,
>> > > > installed on Solaris 9 platform (using SunBlade 2000 machine),
>> > > > connected to Sun StorEdge L100 Tape Library with 4 drives HP LTO-2 
>> > > > and
>> > > > 96 tape slots. This is the first time we want to use the L100 
>> > > > library.
>> > > > After we installed NetBackup DataCenter 4.5 from CD, we did the 
>> > > > setup
>> > > > using Wizards: Configure Storage Devices, Configure Volumes, 
>> > > > Configure
>> > > > Backup Catalog, and Create a simple Backup Policy "test1" which 
>> > > > only
>> > > > backup server's /etc/ directory (which is localhost) to volume pool
>> > > > "Test1". We assigned two tapes (A00002 and A00003) to volume pool
>> > > > "Test1". Every time we run the policy manually, it always failed 
>> > > > with
>> > > > following detailed status:
>> > > > <..cut..>
>> > > > It looks like the drive wants to write to the tape, but suddenly 
>> > > > the
>> > > > tape was ejected a few seconds later and another tape mounted until
>> > > > all media on the pool consumed. This happens every time. FYI the 
>> > > > /etc/
>> > > > directory's size is only 7.5 MB. We are using LTO-2 200/400GB 
>> > > > tapes.
>> > > > We haven't done any backup using the L100 library.
>> > > >
>> > > > Here's the output of "tpconfig -dl":
>> > > > # /opt/openv/volmgr/bin/tpconfig -dl
>> > > > Currently defined drives and robots are:
>> > > > <..cut..>
>> > > > I have upgraded the software from NetBackup DataCenter 4.5FP_3GA to
>> > > > 4.5FP_6, but still no luck. Can anyone help me out? Thank you, any
>> > > > help would be very much appreciated.
>> > > >
>> > > > Best regards,
>> > > > Kevin Haritmonds
> _______________________________________________
> Veritas-bu maillist  -  Veritas-bu AT mailman.eng.auburn DOT edu
> http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
>