Veritas-bu

RES: [Veritas-bu] NetBackup DataCenter and L100 tape library

2005-01-01 03:14:18
Subject: RES: [Veritas-bu] NetBackup DataCenter and L100 tape library
From: kevin haritmonds <kharitmonds AT gmail DOT com> (kevin haritmonds)
Date: Sat, 1 Jan 2005 15:14:18 +0700
Thank you Tim and everyone else for your input. I'm sorry if this will
be a long message but I need to give all of the details. I hope you
don't mind.

Tim: First, were the tapes frozen? You can check the tapes with either
available_media or bpmedialist.
Kevin: No, the tapes were not frozen:
# /opt/openv/netbackup/bin/goodies/available_media
NetBackup pool
A00000  HCART2   TLD      0       25     -       -     -        DBBACKUP
A00001  HCART2   TLD      0       26     -       -     -        DBBACKUP

None pool

Test1 pool
A00002  HCART2   TLD      0       27     -       -     -        AVAILABLE
A00003  HCART2   TLD      0       28     -       -     -        AVAILABLE
# /opt/openv/netbackup/bin/admincmd/bpmedialist
# 


Tim: Second, were the drives downed? You can check the drives with
vmoprcmd -d (GUI's device monitor equivalent).
Kevin: The drives were all up:
# /opt/openv/volmgr/bin/vmoprcmd -d
                                PENDING REQUESTS
                                     <NONE>

                                  DRIVE STATUS
Drv Type   Control  User      Label  RecMID  ExtMID  Ready   Wr.Enbl.  ReqId
  0 hcart2   TLD                -                     No       -        -   
  1 hcart2   TLD                -                     No       -        -   
  2 hcart2   TLD                -                     No       -        -   
  3 hcart2   TLD                -                     No       -        -   

                             ADDITIONAL DRIVE STATUS
Drv DriveName            Shared    Assigned        Comment                   
  0 HPUltrium2-SCSI0      No       -                                         
  1 HPUltrium2-SCSI1      No       -                                         
  2 HPUltrium2-SCSI2      No       -                                         
  3 HPUltrium2-SCSI3      No       -                                         
# 


Tim & Len: As someone else pointed out, the bptm log and syslog will
be most useful in the initial troubleshooting. Check bptm if the tapes
were frozen.  Check syslog if they were downed.
Kevin: I'm not sure if I read it right, but it seems when it wanted to
write the backup header to A00002 tape, an EOM encountered. After
another media mounted (tape A00003), it tried to write to it but
failed with an I/O error. I'm not sure what cause it but we used new
LTO-2 tapes.
# cat /usr/openv/netbackup/logs/bptm/log.010105
<..cut..>
09:20:53.584 [4558] <2> select_media: getting new media id for retention level 0
09:20:53.690 [4558] <2> vmdb_query_scratch_bypool2: server returned: 
1 A00002 ------ 14 -------- <..cut..>
09:20:53.690 [4558] <2> db_byid: search for media id A00002
09:20:53.690 [4558] <2> db_put: write media id A00002, offset = 0
09:20:53.705 [4558] <2> select_media: selected media id A00002 for
backup[0], baksrv(rl = 0) <----------
09:20:53.705 [4558] <2> mount_open_media: Waiting for mount of media
id A00002 (copy 1) on server baksrv.
09:21:33.732 [4558] <2> io_open: SCSI RESERVE
09:21:33.773 [4558] <2> io_open: file
/usr/openv/netbackup/db/media/tpreq/A00002 successfully opened
09:21:33.773 [4558] <2> write_backup: media id A00002 mounted on drive
index 0, drivepath /dev/rmt/0cbn, drivename HPUltrium2-SCSI0, copy 1
09:21:33.773 [4558] <2> io_read_media_header: drive index 0, reading
media header, buflen = 65536, buff = 0x23b010, copy 1
09:21:33.773 [4558] <2> io_ioctl: command (5)MTREW 1 from
(bptm.c.6330) on drive index 0
09:21:33.840 [4558] <2> io_write_media_header: drive index 0, writing
media header
09:21:33.840 [4558] <2> io_close: closing
/usr/openv/netbackup/db/media/tpreq/A00002, from bptm.c.7311
09:21:33.849 [4558] <2> io_open: SCSI RESERVE
09:21:33.958 [4558] <2> io_open: file
/usr/openv/netbackup/db/media/tpreq/A00002 successfully opened
09:21:33.958 [4558] <2> io_ioctl: command (5)MTREW 1 from
(bptm.c.7315) on drive index 0
09:21:38.942 [4558] <2> io_ioctl: command (0)MTWEOF 1 from
(bptm.c.7344) on drive index 0
09:21:40.751 [4558] <4> write_backup: begin writing backup id
baksrv_1104582049, copy 1, fragment 1, to media id A00002 on drive
index 0
09:21:40.751 [4558] <2> signal_parent: sending SIGUSR1 to bpbrm (pid = 4555)
09:21:40.751 [4558] <2> write_data: twin_index: 0 active: 1
dont_process: 0 wrote_backup_hdr: 0 finished_buff: 0 saved_cindex: 0
09:21:45.882 [4558] <2> write_data: absolute block position prior to
writing backup header(s) is 2, copy 1
09:21:45.882 [4558] <2> io_write_back_header: drive index 0,
baksrv_1104582049, file num = 1, mpx_headers = 0, copy 1
09:21:45.959 [4558] <2> write_data: completed writing backup header,
start writing data when first buffer is available, copy 1
09:21:45.959 [4558] <2> write_data: first write, twin_index: 0 cindex:
0 dont_process: 1 wrote_backup_hdr: 1 finished_buff: 0
09:21:45.959 [4558] <2> write_data: received first buffer (32768
bytes), begin writing data
09:21:45.960 [4558] <2> write_backup: write_data() returned,
exit_status = 0, CINDEX = 0, TWIN_INDEX = 0, backup_status = -3
09:21:45.960 [4558] <2> signal_parent: sending SIGUSR1 to bpbrm (pid = 4555)
09:21:45.960 [4558] <2> io_close: closing
/usr/openv/netbackup/db/media/tpreq/A00002, from bptm.c.15502
09:21:47.574 [4558] <2> write_backup: EOM encountered writing backup
header, entire image will be put on a new media
09:21:47.585 [4558] <2> write_backup: tpunmount'ing
/usr/openv/netbackup/db/media/tpreq/A00002 after EOM
09:21:47.595 [4558] <2> TpUnmountWrapper: SCSI RELEASE
09:21:47.695 [4558] <2> add_to_vmhost_list: added baksrv to vmhost list
09:21:47.753 [4558] <2> select_media: getting new media id for retention level 0
09:21:47.861 [4558] <2> vmdb_query_scratch_bypool2: server returned: 
1 A00003 ------ 14 -------- <..cut..>
09:21:47.861 [4558] <2> db_byid: search for media id A00003
09:21:47.861 [4558] <2> db_put: write media id A00003, offset = 1
09:21:47.876 [4558] <2> select_media: selected media id A00003 for
backup[0], baksrv(rl = 0) <----------
09:21:47.876 [4558] <2> mount_open_media: Waiting for mount of media
id A00003 (copy 1) on server baksrv.
09:22:40.239 [4558] <2> io_open: SCSI RESERVE
09:22:40.280 [4558] <2> io_open: file
/usr/openv/netbackup/db/media/tpreq/A00003 successfully opened
09:22:40.280 [4558] <2> write_backup: media id A00003 mounted on drive
index 1, drivepath /dev/rmt/1cbn, drivename HPUltrium2-SCSI1, copy 1
09:22:40.280 [4558] <2> io_read_media_header: drive index 1, reading
media header, buflen = 65536, buff = 0x23b010, copy 1
09:22:40.280 [4558] <2> io_ioctl: command (5)MTREW 1 from
(bptm.c.6330) on drive index 1
09:22:40.348 [4558] <2> io_write_media_header: drive index 1, writing
media header
09:22:40.348 [4558] <2> io_close: closing
/usr/openv/netbackup/db/media/tpreq/A00003, from bptm.c.7311
09:22:40.358 [4558] <2> io_open: SCSI RESERVE
09:22:40.465 [4558] <2> io_open: file
/usr/openv/netbackup/db/media/tpreq/A00003 successfully opened
09:22:40.465 [4558] <2> io_ioctl: command (5)MTREW 1 from
(bptm.c.7315) on drive index 1
09:22:43.479 [4558] <2> io_ioctl: command (0)MTWEOF 1 from
(bptm.c.7344) on drive index 1
09:22:47.276 [4558] <4> write_backup: begin writing backup id
baksrv_1104582049, copy 1, fragment 1, to media id A00003 on drive
index 1
09:22:47.276 [4558] <2> signal_parent: sending SIGUSR1 to bpbrm (pid = 4555)
09:22:47.276 [4558] <2> write_data: twin_index: 0 active: 1
dont_process: 0 wrote_backup_hdr: 0 finished_buff: 0 saved_cindex: 0
09:22:52.446 [4558] <2> write_data: absolute block position prior to
writing backup header(s) is 2, copy 1
09:22:52.446 [4558] <2> io_write_back_header: drive index 1,
baksrv_1104582049, file num = 1, mpx_headers = 0, copy 1
09:22:52.491 [4558] <2> write_data: completed writing backup header,
start writing data when first buffer is available, copy 1
09:22:52.491 [4558] <2> write_data: first write, twin_index: 0 cindex:
0 dont_process: 1 wrote_backup_hdr: 1 finished_buff: 0
09:22:52.491 [4558] <2> write_data: received first buffer (32768
bytes), begin writing data
09:22:52.493 [4558] <2> signal_parent: sending SIGUSR1 to bpbrm (pid = 4555)
09:22:52.493 [4558] <2> write_data: attempting write error recovery, err = 5
09:22:52.493 [4558] <2> tape_error_rec: error recovery to block 3 requested
09:22:52.493 [4558] <2> tape_error_rec: attempting error recovery,
delay 3 minutes before next attempt, tries left = 5
09:25:50.267 [4604] <2> bptm: INITIATING (VERBOSE = 0): -count -cmd
-rt 8 -rn 0 -stunit baksrv-hcart2-robot-tld-0 -den 14 -mt 2
-masterversion 451000
09:25:50.268 [4604] <2> bptm: EXITING with status 0 <----------
09:25:52.490 [4558] <2> io_ioctl: command (0)MTWEOF 0 from
(overwrite.c.192) on drive index 1
09:25:52.490 [4558] <2> io_ioctl: MTWEOF failed during error recovery, I/O error
09:25:52.499 [4558] <2> tape_error_rec: SCSI RESERVE
09:25:52.516 [4558] <2> tape_error_rec: absolute block position after error is 0
09:25:52.517 [4558] <2> set_job_details: Done 
09:25:52.538 [4558] <16> write_data: cannot write image to media id
A00003, drive index 1, I/O error
09:25:52.562 [4558] <2> log_media_error: successfully wrote to error
file - 01/01/05 09:25:52 A00003 1 WRITE_ERROR
09:25:52.563 [4558] <2> check_error_history: called from bptm line
15215, EXIT_Status = 84
09:25:52.563 [4558] <2> check_error_history: drive index = 1, media id
= A00003, time = 01/01/05 09:25:52, both_match = 0, media_match = 0,
drive_match = 0
09:25:52.563 [4558] <2> io_close: closing
/usr/openv/netbackup/db/media/tpreq/A00003, from bptm.c.13417
09:25:52.563 [4558] <2> tpunmount: tpunmount'ing
/usr/openv/netbackup/db/media/tpreq/A00003
09:25:52.572 [4558] <2> TpUnmountWrapper: SCSI RELEASE
09:25:52.731 [4558] <2> bptm: EXITING with status 84 <----------
<..cut..>
#
# cat /var/adm/messages
<..cut..>
Jan  1 09:22:52 baksrv scsi: [ID 107833 kern.warning] WARNING:
/pci@8,700000/pci@2/scsi@4/st@6,0 (st20):
Jan  1 09:22:52 baksrv  Error for Command: write                  
Error Level: Fatal
Jan  1 09:22:52 baksrv scsi: [ID 107833 kern.notice]    Requested
Block: 1                         Error Block: 1
Jan  1 09:22:52 baksrv scsi: [ID 107833 kern.notice]    Vendor: HP    
                            Serial Number:
Jan  1 09:22:52 baksrv scsi: [ID 107833 kern.notice]    Sense Key:
Aborted Command
Jan  1 09:22:52 baksrv scsi: [ID 107833 kern.notice]    ASC: 0x47
(scsi parity error), ASCQ: 0x0, FRU: 0x0
Jan  1 09:22:52 baksrv scsi: [ID 107833 kern.warning] WARNING:
/pci@8,700000/pci@2/scsi@4 (qus0):
Jan  1 09:22:52 baksrv  Target synch. rate reduced. tgt 6 lun 0


Tim: When you ran the Wizard, was NBU able to determine which drive
(/dev/rmt/X) was associated with each robot drive number (Drive 1-4)? 
If not and you drag-and-dropped them into place, it's possible the
drives are out of order and NBU is mounting the tapes in the "wrong"
drives.
Kevin: You are correct Tim, the NBU wasn't able to determine which
drive was associated with each robot drive number. When we ran the
"Configure Storage Devices" Wizard, the wizard detected that there are
4 tape drive(s) and 1 robot. But on the next wizard's page the 4
drives have limitations:
Device                  Host    State       Serialized  Limitations
------                  ----    -----       ----------  -----------
HP Ultrium 2-SCSI F40Z  baksrv  Configured  Yes         Yes, see limitations
HP Ultrium 2-SCSI F40Z  baksrv  Configured  Yes         Yes, see limitations
HP Ultrium 2-SCSI F40Z  baksrv  Configured  Yes         Yes, see limitations
HP Ultrium 2-SCSI F45Z  baksrv  Configured  Yes         Yes, see limitations
ATL M2500 3.07          baksrv  Configured  No          None
Limitations detail: "Unable to determine robot type. Drive is
standalone or in unknown robot. Robot drive number is unknown."

So on the next page, we have to drag and drop them into place. The
strange thing was the robot only have 5 drive box (but in reality the
L100 library has 6 maximum drives which we only have 4 drives out of
6):
- [X] Robot 0 baksrv [ATL M2500 3.07]
         [ ] Drive 1 (unused drive address)
         [ ] Drive 2 (unused drive address)
         [ ] Drive 3 (unused drive address)
         [ ] Drive 4 (unused drive address)
         [ ] Drive 5 (unused drive address)
- Standalone Drives
         [X] baksrv HPUltrium2-SCSI3 [HP Ultrium 2-SCSI F45Z]
         [X] baksrv HPUltrium2-SCSI2 [HP Ultrium 2-SCSI F40Z]
         [X] baksrv HPUltrium2-SCSI1 [HP Ultrium 2-SCSI F40Z]
         [X] baksrv HPUltrium2-SCSI0 [HP Ultrium 2-SCSI F40Z]

So to drag and drop it to the correct drive number, I take the steps
described on Media Manager System Administrator Guide that uses the
robotic test utility (robtest) and see the drive's status on Device
Monitor.


David: Make sure drive type is the same as the media
Kevin: The drive and the media have the same type: "hcart2".

Alex: Make sure the tapes doesn't already had something with different
retention on them (from previous testing) while you have configure not
to allow different retention to be written on the same media. Also,
check the master properties -> media and make sure it "allow media
overwrite".
Kevin: The "Allow media overwrite" on server's Host Properties have
all been checked. And we used new tapes.

Is there any clue on what causing the problem?

Best regards,
Kevin Haritmonds


On Fri, 31 Dec 2004 20:37:06 -0800, Alex Fong <alex.s.fong AT gmail DOT com> 
wrote:
> Make sure the tapes doesn't already had something with different
> retention on them (from previous testing) while you have configure not
> to allow different retention to be written on the same media. Also,
> check the master properties -> media and make sure it "allow media
> overwrite".
> 
> Alex
> 
> On Fri, 31 Dec 2004 21:04:18 -0600 (CST), Tim Hoke <thoke AT northpeak DOT 
> org> wrote:
> > If the drive types weren't the same as the media, then the mounting never
> > would have occurred and the jobs would have failed with a status 96.
> >
> > On Fri, 31 Dec 2004, David Trostli wrote:
> >
> > > Make sure drive type is the same as the media
> > >
> > > Regards,
> > >
> > > David
> > >
> > > -----Mensagem original-----
> > > De: veritas-bu-admin AT mailman.eng.auburn DOT edu
> > > [mailto:veritas-bu-admin AT mailman.eng.auburn DOT edu]Em nome de Tim Hoke
> > > Enviada em: sexta-feira, 31 de dezembro de 2004 13:59
> > > Para: kevin haritmonds
> > > Cc: veritas-bu AT mailman.eng.auburn DOT edu
> > > Assunto: Re: [Veritas-bu] NetBackup DataCenter and L100 tape library
> > >
> > >
> > > Kevin,
> > >
> > > I've got a couple of questions for you.
> > >
> > > First, were the tapes frozen?
> > > You can check the tapes with either available_media or bpmedialist.
> > >
> > > Second, were the drives downed?
> > >
> > > You can check the drives with vmoprcmd -d (GUI's device monitor
> > > equivalent).
> > >
> > > As someone else pointed out, the bptm log and syslog will be most useful
> > > in the initial troubleshooting.
> > >
> > > Check bptm if the tapes were frozen.  Check syslog if they were downed.
> > >
> > > When you ran the Wizard, was NBU able to determine which drive
> > > (/dev/rmt/X) was associated with each robot drive number (Drive 1-4)?  If
> > > not and you drag-and-dropped them into place, it's possible the drives are
> > > out of order and NBU is mounting the tapes in the "wrong" drives.
> > >
> > > Run /usr/openv/volmgr/bin/scan and see if the robot reports the drive
> > > serial numbers.  If so, then NBU should have been able to figure out which
> > > drive goes with with number.
> > >
> > > HTH
> > > -Tim
> > >
> > > On Fri, 31 Dec 2004, kevin haritmonds wrote:
> > >
> > > > Hi, I'm facing a problem in Veritas NetBackup DataCenter 4.5 server,
> > > > installed on Solaris 9 platform (using SunBlade 2000 machine),
> > > > connected to Sun StorEdge L100 Tape Library with 4 drives HP LTO-2 and
> > > > 96 tape slots. This is the first time we want to use the L100 library.
> > > > After we installed NetBackup DataCenter 4.5 from CD, we did the setup
> > > > using Wizards: Configure Storage Devices, Configure Volumes, Configure
> > > > Backup Catalog, and Create a simple Backup Policy "test1" which only
> > > > backup server's /etc/ directory (which is localhost) to volume pool
> > > > "Test1". We assigned two tapes (A00002 and A00003) to volume pool
> > > > "Test1". Every time we run the policy manually, it always failed with
> > > > following detailed status:
> > > > <..cut..>
> > > > It looks like the drive wants to write to the tape, but suddenly the
> > > > tape was ejected a few seconds later and another tape mounted until
> > > > all media on the pool consumed. This happens every time. FYI the /etc/
> > > > directory's size is only 7.5 MB. We are using LTO-2 200/400GB tapes.
> > > > We haven't done any backup using the L100 library.
> > > >
> > > > Here's the output of "tpconfig -dl":
> > > > # /opt/openv/volmgr/bin/tpconfig -dl
> > > > Currently defined drives and robots are:
> > > > <..cut..>
> > > > I have upgraded the software from NetBackup DataCenter 4.5FP_3GA to
> > > > 4.5FP_6, but still no luck. Can anyone help me out? Thank you, any
> > > > help would be very much appreciated.
> > > >
> > > > Best regards,
> > > > Kevin Haritmonds