Veritas-bu

RES: [Veritas-bu] NetBackup DataCenter and L100 tape library

2005-01-01 09:02:03
Subject: RES: [Veritas-bu] NetBackup DataCenter and L100 tape library
From: thoke AT northpeak DOT org (Tim Hoke)
Date: Sat, 1 Jan 2005 08:02:03 -0600 (CST)
Kevin,

Ok, you've rulled out the obvious items... and the bptm/syslog help a lot.

I think you've either got bad tapes or drives.  I find it interesting that 
A00002 is mounted and get's a media header written to it, it looks like 
NBU is going to write data, then gets an "End of Media" (EOM).  It then 
mounts A00003, seems to begin working again, then gets a media write 
error... This time it correlates to the hardware error in syslog.

Better involve the hardware vendor to check out those hardware errors. 

Make sure your firmware is up-to-date and that your SCSI drivers are 
up-to-date too.

HTH
-Tim

On Sat, 1 Jan 2005, kevin haritmonds wrote:

> Thank you Tim and everyone else for your input. I'm sorry if this will
> be a long message but I need to give all of the details. I hope you
> don't mind.
> 
> Tim: First, were the tapes frozen? You can check the tapes with either
> available_media or bpmedialist.
> Kevin: No, the tapes were not frozen:
> # /opt/openv/netbackup/bin/goodies/available_media
> NetBackup pool
> A00000  HCART2   TLD      0       25     -       -     -        DBBACKUP
> A00001  HCART2   TLD      0       26     -       -     -        DBBACKUP
> 
> None pool
> 
> Test1 pool
> A00002  HCART2   TLD      0       27     -       -     -        AVAILABLE
> A00003  HCART2   TLD      0       28     -       -     -        AVAILABLE
> # /opt/openv/netbackup/bin/admincmd/bpmedialist
> # 
> 
> 
> Tim: Second, were the drives downed? You can check the drives with
> vmoprcmd -d (GUI's device monitor equivalent).
> Kevin: The drives were all up:
> # /opt/openv/volmgr/bin/vmoprcmd -d
>                                 PENDING REQUESTS
>                                      <NONE>
> 
>                                   DRIVE STATUS
> Drv Type   Control  User      Label  RecMID  ExtMID  Ready   Wr.Enbl.  ReqId
>   0 hcart2   TLD                -                     No       -        -   
>   1 hcart2   TLD                -                     No       -        -   
>   2 hcart2   TLD                -                     No       -        -   
>   3 hcart2   TLD                -                     No       -        -   
> 
>                              ADDITIONAL DRIVE STATUS
> Drv DriveName            Shared    Assigned        Comment                   
>   0 HPUltrium2-SCSI0      No       -                                         
>   1 HPUltrium2-SCSI1      No       -                                         
>   2 HPUltrium2-SCSI2      No       -                                         
>   3 HPUltrium2-SCSI3      No       -                                         
> # 
> 
> 
> Tim & Len: As someone else pointed out, the bptm log and syslog will
> be most useful in the initial troubleshooting. Check bptm if the tapes
> were frozen.  Check syslog if they were downed.
> Kevin: I'm not sure if I read it right, but it seems when it wanted to
> write the backup header to A00002 tape, an EOM encountered. After
> another media mounted (tape A00003), it tried to write to it but
> failed with an I/O error. I'm not sure what cause it but we used new
> LTO-2 tapes.
> # cat /usr/openv/netbackup/logs/bptm/log.010105
> <..cut..>
> 09:20:53.584 [4558] <2> select_media: getting new media id for retention 
> level 0
> 09:20:53.690 [4558] <2> vmdb_query_scratch_bypool2: server returned: 
> 1 A00002 ------ 14 -------- <..cut..>
> 09:20:53.690 [4558] <2> db_byid: search for media id A00002
> 09:20:53.690 [4558] <2> db_put: write media id A00002, offset = 0
> 09:20:53.705 [4558] <2> select_media: selected media id A00002 for
> backup[0], baksrv(rl = 0) <----------
> 09:20:53.705 [4558] <2> mount_open_media: Waiting for mount of media
> id A00002 (copy 1) on server baksrv.
> 09:21:33.732 [4558] <2> io_open: SCSI RESERVE
> 09:21:33.773 [4558] <2> io_open: file
> /usr/openv/netbackup/db/media/tpreq/A00002 successfully opened
> 09:21:33.773 [4558] <2> write_backup: media id A00002 mounted on drive
> index 0, drivepath /dev/rmt/0cbn, drivename HPUltrium2-SCSI0, copy 1
> 09:21:33.773 [4558] <2> io_read_media_header: drive index 0, reading
> media header, buflen = 65536, buff = 0x23b010, copy 1
> 09:21:33.773 [4558] <2> io_ioctl: command (5)MTREW 1 from
> (bptm.c.6330) on drive index 0
> 09:21:33.840 [4558] <2> io_write_media_header: drive index 0, writing
> media header
> 09:21:33.840 [4558] <2> io_close: closing
> /usr/openv/netbackup/db/media/tpreq/A00002, from bptm.c.7311
> 09:21:33.849 [4558] <2> io_open: SCSI RESERVE
> 09:21:33.958 [4558] <2> io_open: file
> /usr/openv/netbackup/db/media/tpreq/A00002 successfully opened
> 09:21:33.958 [4558] <2> io_ioctl: command (5)MTREW 1 from
> (bptm.c.7315) on drive index 0
> 09:21:38.942 [4558] <2> io_ioctl: command (0)MTWEOF 1 from
> (bptm.c.7344) on drive index 0
> 09:21:40.751 [4558] <4> write_backup: begin writing backup id
> baksrv_1104582049, copy 1, fragment 1, to media id A00002 on drive
> index 0
> 09:21:40.751 [4558] <2> signal_parent: sending SIGUSR1 to bpbrm (pid = 4555)
> 09:21:40.751 [4558] <2> write_data: twin_index: 0 active: 1
> dont_process: 0 wrote_backup_hdr: 0 finished_buff: 0 saved_cindex: 0
> 09:21:45.882 [4558] <2> write_data: absolute block position prior to
> writing backup header(s) is 2, copy 1
> 09:21:45.882 [4558] <2> io_write_back_header: drive index 0,
> baksrv_1104582049, file num = 1, mpx_headers = 0, copy 1
> 09:21:45.959 [4558] <2> write_data: completed writing backup header,
> start writing data when first buffer is available, copy 1
> 09:21:45.959 [4558] <2> write_data: first write, twin_index: 0 cindex:
> 0 dont_process: 1 wrote_backup_hdr: 1 finished_buff: 0
> 09:21:45.959 [4558] <2> write_data: received first buffer (32768
> bytes), begin writing data
> 09:21:45.960 [4558] <2> write_backup: write_data() returned,
> exit_status = 0, CINDEX = 0, TWIN_INDEX = 0, backup_status = -3
> 09:21:45.960 [4558] <2> signal_parent: sending SIGUSR1 to bpbrm (pid = 4555)
> 09:21:45.960 [4558] <2> io_close: closing
> /usr/openv/netbackup/db/media/tpreq/A00002, from bptm.c.15502
> 09:21:47.574 [4558] <2> write_backup: EOM encountered writing backup
> header, entire image will be put on a new media
> 09:21:47.585 [4558] <2> write_backup: tpunmount'ing
> /usr/openv/netbackup/db/media/tpreq/A00002 after EOM
> 09:21:47.595 [4558] <2> TpUnmountWrapper: SCSI RELEASE
> 09:21:47.695 [4558] <2> add_to_vmhost_list: added baksrv to vmhost list
> 09:21:47.753 [4558] <2> select_media: getting new media id for retention 
> level 0
> 09:21:47.861 [4558] <2> vmdb_query_scratch_bypool2: server returned: 
> 1 A00003 ------ 14 -------- <..cut..>
> 09:21:47.861 [4558] <2> db_byid: search for media id A00003
> 09:21:47.861 [4558] <2> db_put: write media id A00003, offset = 1
> 09:21:47.876 [4558] <2> select_media: selected media id A00003 for
> backup[0], baksrv(rl = 0) <----------
> 09:21:47.876 [4558] <2> mount_open_media: Waiting for mount of media
> id A00003 (copy 1) on server baksrv.
> 09:22:40.239 [4558] <2> io_open: SCSI RESERVE
> 09:22:40.280 [4558] <2> io_open: file
> /usr/openv/netbackup/db/media/tpreq/A00003 successfully opened
> 09:22:40.280 [4558] <2> write_backup: media id A00003 mounted on drive
> index 1, drivepath /dev/rmt/1cbn, drivename HPUltrium2-SCSI1, copy 1
> 09:22:40.280 [4558] <2> io_read_media_header: drive index 1, reading
> media header, buflen = 65536, buff = 0x23b010, copy 1
> 09:22:40.280 [4558] <2> io_ioctl: command (5)MTREW 1 from
> (bptm.c.6330) on drive index 1
> 09:22:40.348 [4558] <2> io_write_media_header: drive index 1, writing
> media header
> 09:22:40.348 [4558] <2> io_close: closing
> /usr/openv/netbackup/db/media/tpreq/A00003, from bptm.c.7311
> 09:22:40.358 [4558] <2> io_open: SCSI RESERVE
> 09:22:40.465 [4558] <2> io_open: file
> /usr/openv/netbackup/db/media/tpreq/A00003 successfully opened
> 09:22:40.465 [4558] <2> io_ioctl: command (5)MTREW 1 from
> (bptm.c.7315) on drive index 1
> 09:22:43.479 [4558] <2> io_ioctl: command (0)MTWEOF 1 from
> (bptm.c.7344) on drive index 1
> 09:22:47.276 [4558] <4> write_backup: begin writing backup id
> baksrv_1104582049, copy 1, fragment 1, to media id A00003 on drive
> index 1
> 09:22:47.276 [4558] <2> signal_parent: sending SIGUSR1 to bpbrm (pid = 4555)
> 09:22:47.276 [4558] <2> write_data: twin_index: 0 active: 1
> dont_process: 0 wrote_backup_hdr: 0 finished_buff: 0 saved_cindex: 0
> 09:22:52.446 [4558] <2> write_data: absolute block position prior to
> writing backup header(s) is 2, copy 1
> 09:22:52.446 [4558] <2> io_write_back_header: drive index 1,
> baksrv_1104582049, file num = 1, mpx_headers = 0, copy 1
> 09:22:52.491 [4558] <2> write_data: completed writing backup header,
> start writing data when first buffer is available, copy 1
> 09:22:52.491 [4558] <2> write_data: first write, twin_index: 0 cindex:
> 0 dont_process: 1 wrote_backup_hdr: 1 finished_buff: 0
> 09:22:52.491 [4558] <2> write_data: received first buffer (32768
> bytes), begin writing data
> 09:22:52.493 [4558] <2> signal_parent: sending SIGUSR1 to bpbrm (pid = 4555)
> 09:22:52.493 [4558] <2> write_data: attempting write error recovery, err = 5
> 09:22:52.493 [4558] <2> tape_error_rec: error recovery to block 3 requested
> 09:22:52.493 [4558] <2> tape_error_rec: attempting error recovery,
> delay 3 minutes before next attempt, tries left = 5
> 09:25:50.267 [4604] <2> bptm: INITIATING (VERBOSE = 0): -count -cmd
> -rt 8 -rn 0 -stunit baksrv-hcart2-robot-tld-0 -den 14 -mt 2
> -masterversion 451000
> 09:25:50.268 [4604] <2> bptm: EXITING with status 0 <----------
> 09:25:52.490 [4558] <2> io_ioctl: command (0)MTWEOF 0 from
> (overwrite.c.192) on drive index 1
> 09:25:52.490 [4558] <2> io_ioctl: MTWEOF failed during error recovery, I/O 
> error
> 09:25:52.499 [4558] <2> tape_error_rec: SCSI RESERVE
> 09:25:52.516 [4558] <2> tape_error_rec: absolute block position after error 
> is 0
> 09:25:52.517 [4558] <2> set_job_details: Done 
> 09:25:52.538 [4558] <16> write_data: cannot write image to media id
> A00003, drive index 1, I/O error
> 09:25:52.562 [4558] <2> log_media_error: successfully wrote to error
> file - 01/01/05 09:25:52 A00003 1 WRITE_ERROR
> 09:25:52.563 [4558] <2> check_error_history: called from bptm line
> 15215, EXIT_Status = 84
> 09:25:52.563 [4558] <2> check_error_history: drive index = 1, media id
> = A00003, time = 01/01/05 09:25:52, both_match = 0, media_match = 0,
> drive_match = 0
> 09:25:52.563 [4558] <2> io_close: closing
> /usr/openv/netbackup/db/media/tpreq/A00003, from bptm.c.13417
> 09:25:52.563 [4558] <2> tpunmount: tpunmount'ing
> /usr/openv/netbackup/db/media/tpreq/A00003
> 09:25:52.572 [4558] <2> TpUnmountWrapper: SCSI RELEASE
> 09:25:52.731 [4558] <2> bptm: EXITING with status 84 <----------
> <..cut..>
> #
> # cat /var/adm/messages
> <..cut..>
> Jan  1 09:22:52 baksrv scsi: [ID 107833 kern.warning] WARNING:
> /pci@8,700000/pci@2/scsi@4/st@6,0 (st20):
> Jan  1 09:22:52 baksrv  Error for Command: write                  
> Error Level: Fatal
> Jan  1 09:22:52 baksrv scsi: [ID 107833 kern.notice]    Requested
> Block: 1                         Error Block: 1
> Jan  1 09:22:52 baksrv scsi: [ID 107833 kern.notice]    Vendor: HP    
>                             Serial Number:
> Jan  1 09:22:52 baksrv scsi: [ID 107833 kern.notice]    Sense Key:
> Aborted Command
> Jan  1 09:22:52 baksrv scsi: [ID 107833 kern.notice]    ASC: 0x47
> (scsi parity error), ASCQ: 0x0, FRU: 0x0
> Jan  1 09:22:52 baksrv scsi: [ID 107833 kern.warning] WARNING:
> /pci@8,700000/pci@2/scsi@4 (qus0):
> Jan  1 09:22:52 baksrv  Target synch. rate reduced. tgt 6 lun 0
> 
> 
> Tim: When you ran the Wizard, was NBU able to determine which drive
> (/dev/rmt/X) was associated with each robot drive number (Drive 1-4)? 
> If not and you drag-and-dropped them into place, it's possible the
> drives are out of order and NBU is mounting the tapes in the "wrong"
> drives.
> Kevin: You are correct Tim, the NBU wasn't able to determine which
> drive was associated with each robot drive number. When we ran the
> "Configure Storage Devices" Wizard, the wizard detected that there are
> 4 tape drive(s) and 1 robot. But on the next wizard's page the 4
> drives have limitations:
> Device                  Host    State       Serialized  Limitations
> ------                  ----    -----       ----------  -----------
> HP Ultrium 2-SCSI F40Z  baksrv  Configured  Yes         Yes, see limitations
> HP Ultrium 2-SCSI F40Z  baksrv  Configured  Yes         Yes, see limitations
> HP Ultrium 2-SCSI F40Z  baksrv  Configured  Yes         Yes, see limitations
> HP Ultrium 2-SCSI F45Z  baksrv  Configured  Yes         Yes, see limitations
> ATL M2500 3.07          baksrv  Configured  No          None
> Limitations detail: "Unable to determine robot type. Drive is
> standalone or in unknown robot. Robot drive number is unknown."
> 
> So on the next page, we have to drag and drop them into place. The
> strange thing was the robot only have 5 drive box (but in reality the
> L100 library has 6 maximum drives which we only have 4 drives out of
> 6):
> - [X] Robot 0 baksrv [ATL M2500 3.07]
>          [ ] Drive 1 (unused drive address)
>          [ ] Drive 2 (unused drive address)
>          [ ] Drive 3 (unused drive address)
>          [ ] Drive 4 (unused drive address)
>          [ ] Drive 5 (unused drive address)
> - Standalone Drives
>          [X] baksrv HPUltrium2-SCSI3 [HP Ultrium 2-SCSI F45Z]
>          [X] baksrv HPUltrium2-SCSI2 [HP Ultrium 2-SCSI F40Z]
>          [X] baksrv HPUltrium2-SCSI1 [HP Ultrium 2-SCSI F40Z]
>          [X] baksrv HPUltrium2-SCSI0 [HP Ultrium 2-SCSI F40Z]
> 
> So to drag and drop it to the correct drive number, I take the steps
> described on Media Manager System Administrator Guide that uses the
> robotic test utility (robtest) and see the drive's status on Device
> Monitor.
> 
> 
> David: Make sure drive type is the same as the media
> Kevin: The drive and the media have the same type: "hcart2".
> 
> Alex: Make sure the tapes doesn't already had something with different
> retention on them (from previous testing) while you have configure not
> to allow different retention to be written on the same media. Also,
> check the master properties -> media and make sure it "allow media
> overwrite".
> Kevin: The "Allow media overwrite" on server's Host Properties have
> all been checked. And we used new tapes.
> 
> Is there any clue on what causing the problem?
> 
> Best regards,
> Kevin Haritmonds
> 
> 
> On Fri, 31 Dec 2004 20:37:06 -0800, Alex Fong <alex.s.fong AT gmail DOT com> 
> wrote:
> > Make sure the tapes doesn't already had something with different
> > retention on them (from previous testing) while you have configure not
> > to allow different retention to be written on the same media. Also,
> > check the master properties -> media and make sure it "allow media
> > overwrite".
> > 
> > Alex
> > 
> > On Fri, 31 Dec 2004 21:04:18 -0600 (CST), Tim Hoke <thoke AT northpeak DOT 
> > org> wrote:
> > > If the drive types weren't the same as the media, then the mounting never
> > > would have occurred and the jobs would have failed with a status 96.
> > >
> > > On Fri, 31 Dec 2004, David Trostli wrote:
> > >
> > > > Make sure drive type is the same as the media
> > > >
> > > > Regards,
> > > >
> > > > David
> > > >
> > > > -----Mensagem original-----
> > > > De: veritas-bu-admin AT mailman.eng.auburn DOT edu
> > > > [mailto:veritas-bu-admin AT mailman.eng.auburn DOT edu]Em nome de Tim 
> > > > Hoke
> > > > Enviada em: sexta-feira, 31 de dezembro de 2004 13:59
> > > > Para: kevin haritmonds
> > > > Cc: veritas-bu AT mailman.eng.auburn DOT edu
> > > > Assunto: Re: [Veritas-bu] NetBackup DataCenter and L100 tape library
> > > >
> > > >
> > > > Kevin,
> > > >
> > > > I've got a couple of questions for you.
> > > >
> > > > First, were the tapes frozen?
> > > > You can check the tapes with either available_media or bpmedialist.
> > > >
> > > > Second, were the drives downed?
> > > >
> > > > You can check the drives with vmoprcmd -d (GUI's device monitor
> > > > equivalent).
> > > >
> > > > As someone else pointed out, the bptm log and syslog will be most useful
> > > > in the initial troubleshooting.
> > > >
> > > > Check bptm if the tapes were frozen.  Check syslog if they were downed.
> > > >
> > > > When you ran the Wizard, was NBU able to determine which drive
> > > > (/dev/rmt/X) was associated with each robot drive number (Drive 1-4)?  
> > > > If
> > > > not and you drag-and-dropped them into place, it's possible the drives 
> > > > are
> > > > out of order and NBU is mounting the tapes in the "wrong" drives.
> > > >
> > > > Run /usr/openv/volmgr/bin/scan and see if the robot reports the drive
> > > > serial numbers.  If so, then NBU should have been able to figure out 
> > > > which
> > > > drive goes with with number.
> > > >
> > > > HTH
> > > > -Tim
> > > >
> > > > On Fri, 31 Dec 2004, kevin haritmonds wrote:
> > > >
> > > > > Hi, I'm facing a problem in Veritas NetBackup DataCenter 4.5 server,
> > > > > installed on Solaris 9 platform (using SunBlade 2000 machine),
> > > > > connected to Sun StorEdge L100 Tape Library with 4 drives HP LTO-2 and
> > > > > 96 tape slots. This is the first time we want to use the L100 library.
> > > > > After we installed NetBackup DataCenter 4.5 from CD, we did the setup
> > > > > using Wizards: Configure Storage Devices, Configure Volumes, Configure
> > > > > Backup Catalog, and Create a simple Backup Policy "test1" which only
> > > > > backup server's /etc/ directory (which is localhost) to volume pool
> > > > > "Test1". We assigned two tapes (A00002 and A00003) to volume pool
> > > > > "Test1". Every time we run the policy manually, it always failed with
> > > > > following detailed status:
> > > > > <..cut..>
> > > > > It looks like the drive wants to write to the tape, but suddenly the
> > > > > tape was ejected a few seconds later and another tape mounted until
> > > > > all media on the pool consumed. This happens every time. FYI the /etc/
> > > > > directory's size is only 7.5 MB. We are using LTO-2 200/400GB tapes.
> > > > > We haven't done any backup using the L100 library.
> > > > >
> > > > > Here's the output of "tpconfig -dl":
> > > > > # /opt/openv/volmgr/bin/tpconfig -dl
> > > > > Currently defined drives and robots are:
> > > > > <..cut..>
> > > > > I have upgraded the software from NetBackup DataCenter 4.5FP_3GA to
> > > > > 4.5FP_6, but still no luck. Can anyone help me out? Thank you, any
> > > > > help would be very much appreciated.
> > > > >
> > > > > Best regards,
> > > > > Kevin Haritmonds
> _______________________________________________
> Veritas-bu maillist  -  Veritas-bu AT mailman.eng.auburn DOT edu
> http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
>