Veritas-bu

RES: [Veritas-bu] NetBackup DataCenter and L100 tape library

2005-01-02 06:33:02
Subject: RES: [Veritas-bu] NetBackup DataCenter and L100 tape library
From: kevin haritmonds <kharitmonds AT gmail DOT com> (kevin haritmonds)
Date: Sun, 2 Jan 2005 18:33:02 +0700
Thank you guys for all of the input, you're all very helpful!

Tim:
Kevin, Ok, you've rulled out the obvious items... and the bptm/syslog
help a lot. I think you've either got bad tapes or drives.  I find it
interesting that A00002 is mounted and get's a media header written to
it, it looks like NBU is going to write data, then gets an "End of
Media" (EOM).  It then mounts A00003, seems to begin working again,
then gets a media write error... This time it correlates to the
hardware error in syslog. Better involve the hardware vendor to check
out those hardware errors.
Kevin:
Your analysis is correct Tim! One drive was failed. I did the steps
mentioned by Len to identify the failure drive. I have disabled the
failure drive on NBU.


Len:
I would backstep and check things out at the lowest level. Pick one
tape that you can write over. To use in netbackup after this you may
have to use bplabel to fix it up. Remove netbackup from the problem.
Do a manual load of the tape to one tape drive then use unix commands
to test the tape.
mt -f /dev/.... status
use tar or dd to write data to the tape
mt -f /dev/... rewind
use tar or dd to read  data from the tape.
Make sure you use a fair amount of data as you want use enough of the
tape so that you have to switch tracks on the tape. Also use a large
block size. If this works for the drive(s) then the drive hardware is
ok.
Kevin:
Thank you for your suggestion Len! Now I can be sure that one drive
(/dev/rmt/1) is failed. When I did the tar to the failure drive:
  # tar cvf /dev/rmt/1 usr
  a usr/ 0 tape blocks
  a usr/openwin/ 0 tape blocks
  a usr/openwin/bin/ 0 tape blocks
  a usr/openwin/bin/ctlconvert_txt 31 tape blocks
  tar: write error: I/O error
And on /var/adm/messages:
  Jan  2 08:15:38 baksrv scsi: [ID 107833 kern.warning] WARNING:
/pci@8,700000/pci@2/scsi@4 (qus0):
  Jan  2 08:15:38 baksrv  Target synch. rate reduced. tgt 6 lun 0
  Jan  2 08:15:38 baksrv scsi: [ID 107833 kern.warning] WARNING:
/pci@8,700000/pci@2/scsi@4/st@6,0 (st20):
  Jan  2 08:15:38 baksrv  Error for Command: write                  
Error Level: Fatal
  Jan  2 08:15:38 baksrv scsi: [ID 107833 kern.notice]    Requested
Block: 0                         Error Block: 0
  Jan  2 08:15:38 baksrv scsi: [ID 107833 kern.notice]    Vendor: HP  
                              Serial Number:
  Jan  2 08:15:38 baksrv scsi: [ID 107833 kern.notice]    Sense Key:
Aborted Command
  Jan  2 08:15:38 baksrv scsi: [ID 107833 kern.notice]    ASC: 0x47
(scsi parity error), ASCQ: 0x0, FRU: 0x0

The other three drives were fine. I have disabled the failure drive on NBU.


Len:
If the above works then try to use robtest to load one tape and only
one tape at a time. Check the robot to make sure that the tape is in
the correct drive. If all is ok, move the tape to the next drive.
Kevin:
Today I've enabled the "Serialization" setting on the L100 library
(the default value is disabled). When I reinstalled the NetBackup
DataCenter 4.5FP6, now all drives could be placed automatically by NBU
and I don't have to drag-and-drop them into place. I don't know why
Sun put the default option for L100 library's serialization to
disabled.


David:
check the value of DISALLOW_BACKUPS_SPANNING_MEDIA in bp.conf. It
should be set to NO. after that if the problem persist have the
hardware vendor check the drive for hw error and firmware update.
Kevin: I only did the backup on the /etc directory, so the media
haven't spanned yet (cmiiw).


David:
one of the solution to this is to make sure you have in your bp.conf
the following entries:
ALLOW_MEDIA_OVERWRITE = DBR
ALLOW_MEDIA_OVERWRITE = TAR
ALLOW_MEDIA_OVERWRITE = CPIO
ALLOW_MEDIA_OVERWRITE = ANSI
ALLOW_MEDIA_OVERWRITE = MTF1
Kevin: I already did that, but thanks anyway for the input.


Rockey:
Research the hardware side; look for bent or loose SCSI connection. 
If this has fibre involved look for a fracture.  When you run the
tpautoconf -d command, make sure all drives have the correct and same
firmware. Additionally you may want to review the drive block size to
make sure it is compatible with the media being used (e.g. don't use a
16K block on a 64K drive).
Kevin: We used SCSI cables. Hmm this is interesting... How can I tell
the block size of the drives and the media?


Today I could do the backup successfully, but several media (A00002,
A00004, A00005 tapes) got rejected. There's always an error on bptm
log saying that an EOM encountered while writing backup header, the
tape was ejected a few seconds later, and another tape mounted.
Previously when I did the hardware test, I have used the A00002 tape
using tar to backup and restore 500 MB of data and it works fine (so
physically the tape must be good, right?). I have manually relabeled
the tapes using command: "bplabel -m A00002 -d hcart2 -p test1" with
no luck, the NBU still "rejected" the tape.

According to the bptm log:
09:40:38.010 [2778] <2> write_data: received first buffer (32768
bytes), begin writing data
09:40:38.010 [2778] <2> write_backup: write_data() returned,
exit_status = 0, CINDEX = 0, TWIN_INDEX = 0, backup_status = -3
09:40:38.010 [2778] <2> signal_parent: sending SIGUSR1 to bpbrm (pid = 2775)
09:40:38.010 [2778] <2> io_close: closing
/usr/openv/netbackup/db/media/tpreq/A00002, from bptm.c.15502
09:40:39.621 [2778] <2> write_backup: EOM encountered writing backup
header, entire image will be put on a new media
09:40:39.633 [2778] <2> write_backup: tpunmount'ing
/usr/openv/netbackup/db/media/tpreq/A00002 after EOM
09:40:39.643 [2778] <2> TpUnmountWrapper: SCSI RELEASE
09:40:39.745 [2778] <2> add_to_vmhost_list: added baksrv to vmhost list
09:40:39.793 [2778] <2> select_media: getting new media id for retention level 0
<..cut..>
10:13:15.841 [3537] <2> write_data: received first buffer (32768
bytes), begin writing data
10:13:15.842 [3537] <2> write_backup: write_data() returned,
exit_status = 0, CINDEX = 0, TWIN_INDEX = 0, backup_status = -3
10:13:15.842 [3537] <2> signal_parent: sending SIGUSR1 to bpbrm (pid = 3536)
10:13:15.842 [3537] <2> io_close: closing
/usr/openv/netbackup/db/media/tpreq/A00004, from bptm.c.15502
10:13:17.447 [3537] <2> write_backup: EOM encountered writing backup
header, entire image will be put on a new media
10:13:17.461 [3537] <2> write_backup: tpunmount'ing
/usr/openv/netbackup/db/media/tpreq/A00004 after EOM
10:13:17.471 [3537] <2> TpUnmountWrapper: SCSI RELEASE
10:13:17.560 [3537] <2> add_to_vmhost_list: added baksrv to vmhost list
10:13:17.630 [3537] <2> db_lock_media: unable to lock media at offset 2 (A00005)
10:13:17.630 [3537] <2> select_media: getting new media id for retention level 0

# ./bpmedialist
Server Host = baksrv
id     rl  images   allocated        last updated      density  kbytes restores
          vimages   expiration       last read         <------- STATUS ------->
--------------------------------------------------------------------------------
A00003   0      6   01/02/2005 09:39  01/02/2005 10:23  hcart2     47840       0
               6   01/09/2005 10:23        N/A       
A00006   0      2   01/02/2005 10:11  01/02/2005 10:23  hcart2     22688       0
               2   01/09/2005 10:23        N/A       
A00007   0      6   01/02/2005 10:11  01/02/2005 10:39  hcart2     40480       0
               6   01/09/2005 10:39        N/A       

Several other tapes (A00003, A00006, A00007) were fine (the A00003
tape also got rejected the first time, but strangely when I ran the
manual backup the second time it works fine). I'm lost here, is there
some kind of way to locate this problem?


Best regards,
Kevin Haritmonds

On Sat, 1 Jan 2005 20:49:43 -0800, Rockey Reed <Rockey.Reed AT veritas DOT com> 
wrote:
> Kevin,
> 
> Research the hardware side; look for bent or loose SCSI connection.  If this
> has fibre involved look for a fracture.  When you run the tpautoconf -d
> command, make sure all drives have the correct and same firmware.
> Additionally you may want to review the drive block size to make sure it is
> compatible with the media being used (e.g. don't use a 16K block on a 64K
> drive).
> 
> Please let us all know the solution . . . this is getting interesting.
> 
> Thanks,
> 
> Rockey J. Reed
> 
> -----Original Message-----
> From: veritas-bu-admin AT mailman.eng.auburn DOT edu
> [mailto:veritas-bu-admin AT mailman.eng.auburn DOT edu] On Behalf Of kevin
> haritmonds
> Sent: Saturday, January 01, 2005 3:14 AM
> To: veritas-bu AT mailman.eng.auburn DOT edu
> Subject: Re: RES: [Veritas-bu] NetBackup DataCenter and L100 tape library
> 
> Thank you Tim and everyone else for your input. I'm sorry if this will
> be a long message but I need to give all of the details. I hope you
> don't mind.
> 
> Tim: First, were the tapes frozen? You can check the tapes with either
> available_media or bpmedialist.
> Kevin: No, the tapes were not frozen:
> # /opt/openv/netbackup/bin/goodies/available_media
> NetBackup pool
> A00000  HCART2   TLD      0       25     -       -     -        DBBACKUP
> A00001  HCART2   TLD      0       26     -       -     -        DBBACKUP
> 
> None pool
> 
> Test1 pool
> A00002  HCART2   TLD      0       27     -       -     -        AVAILABLE
> A00003  HCART2   TLD      0       28     -       -     -        AVAILABLE
> # /opt/openv/netbackup/bin/admincmd/bpmedialist
> #
> 
> Tim: Second, were the drives downed? You can check the drives with
> vmoprcmd -d (GUI's device monitor equivalent).
> Kevin: The drives were all up:
> # /opt/openv/volmgr/bin/vmoprcmd -d
> <..cut..>
> 
> Tim & Len: As someone else pointed out, the bptm log and syslog will
> be most useful in the initial troubleshooting. Check bptm if the tapes
> were frozen.  Check syslog if they were downed.
> Kevin: I'm not sure if I read it right, but it seems when it wanted to
> write the backup header to A00002 tape, an EOM encountered. After
> another media mounted (tape A00003), it tried to write to it but
> failed with an I/O error. I'm not sure what cause it but we used new
> LTO-2 tapes.
> # cat /usr/openv/netbackup/logs/bptm/log.010105
> <..cut..>
> # cat /var/adm/messages
> <..cut..>
> 
> Tim: When you ran the Wizard, was NBU able to determine which drive
> (/dev/rmt/X) was associated with each robot drive number (Drive 1-4)?
> If not and you drag-and-dropped them into place, it's possible the
> drives are out of order and NBU is mounting the tapes in the "wrong"
> drives.
> Kevin: You are correct Tim, the NBU wasn't able to determine which
> drive was associated with each robot drive number. When we ran the
> "Configure Storage Devices" Wizard, the wizard detected that there are
> 4 tape drive(s) and 1 robot. But on the next wizard's page the 4
> drives have limitations:
> <..cut..>
> 
> So on the next page, we have to drag and drop them into place. The
> strange thing was the robot only have 5 drive box (but in reality the
> L100 library has 6 maximum drives which we only have 4 drives out of
> 6):
> <..cut..>
> So to drag and drop it to the correct drive number, I take the steps
> described on Media Manager System Administrator Guide that uses the
> robotic test utility (robtest) and see the drive's status on Device
> Monitor.
> 
> David: Make sure drive type is the same as the media
> Kevin: The drive and the media have the same type: "hcart2".
> 
> Alex: Make sure the tapes doesn't already had something with different
> retention on them (from previous testing) while you have configure not
> to allow different retention to be written on the same media. Also,
> check the master properties -> media and make sure it "allow media
> overwrite".
> Kevin: The "Allow media overwrite" on server's Host Properties have
> all been checked. And we used new tapes.
> 
> Is there any clue on what causing the problem?
> 
> Best regards,
> Kevin Haritmonds
> 
> On Fri, 31 Dec 2004 20:37:06 -0800, Alex Fong <alex.s.fong AT gmail DOT com> 
> wrote:
> > Make sure the tapes doesn't already had something with different
> > retention on them (from previous testing) while you have configure not
> > to allow different retention to be written on the same media. Also,
> > check the master properties -> media and make sure it "allow media
> > overwrite".
> >
> > Alex
> >
> > On Fri, 31 Dec 2004 21:04:18 -0600 (CST), Tim Hoke <thoke AT northpeak DOT 
> > org>
> wrote:
> > > If the drive types weren't the same as the media, then the mounting
> never
> > > would have occurred and the jobs would have failed with a status 96.
> > >
> > > On Fri, 31 Dec 2004, David Trostli wrote:
> > >
> > > > Make sure drive type is the same as the media
> > > >
> > > > Regards,
> > > >
> > > > David
> > > >
> > > > -----Mensagem original-----
> > > > De: veritas-bu-admin AT mailman.eng.auburn DOT edu
> > > > [mailto:veritas-bu-admin AT mailman.eng.auburn DOT edu]Em nome de Tim 
> > > > Hoke
> > > > Enviada em: sexta-feira, 31 de dezembro de 2004 13:59
> > > > Para: kevin haritmonds
> > > > Cc: veritas-bu AT mailman.eng.auburn DOT edu
> > > > Assunto: Re: [Veritas-bu] NetBackup DataCenter and L100 tape library
> > > >
> > > >
> > > > Kevin,
> > > >
> > > > I've got a couple of questions for you.
> > > >
> > > > First, were the tapes frozen?
> > > > You can check the tapes with either available_media or bpmedialist.
> > > >
> > > > Second, were the drives downed?
> > > >
> > > > You can check the drives with vmoprcmd -d (GUI's device monitor
> > > > equivalent).
> > > >
> > > > As someone else pointed out, the bptm log and syslog will be most
> useful
> > > > in the initial troubleshooting.
> > > >
> > > > Check bptm if the tapes were frozen.  Check syslog if they were
> downed.
> > > >
> > > > When you ran the Wizard, was NBU able to determine which drive
> > > > (/dev/rmt/X) was associated with each robot drive number (Drive 1-4)?
> If
> > > > not and you drag-and-dropped them into place, it's possible the drives
> are
> > > > out of order and NBU is mounting the tapes in the "wrong" drives.
> > > >
> > > > Run /usr/openv/volmgr/bin/scan and see if the robot reports the drive
> > > > serial numbers.  If so, then NBU should have been able to figure out
> which
> > > > drive goes with with number.
> > > >
> > > > HTH
> > > > -Tim
> > > >
> > > > On Fri, 31 Dec 2004, kevin haritmonds wrote:
> > > >
> > > > > Hi, I'm facing a problem in Veritas NetBackup DataCenter 4.5 server,
> > > > > installed on Solaris 9 platform (using SunBlade 2000 machine),
> > > > > connected to Sun StorEdge L100 Tape Library with 4 drives HP LTO-2
> and
> > > > > 96 tape slots. This is the first time we want to use the L100
> library.
> > > > > After we installed NetBackup DataCenter 4.5 from CD, we did the
> setup
> > > > > using Wizards: Configure Storage Devices, Configure Volumes,
> Configure
> > > > > Backup Catalog, and Create a simple Backup Policy "test1" which only
> > > > > backup server's /etc/ directory (which is localhost) to volume pool
> > > > > "Test1". We assigned two tapes (A00002 and A00003) to volume pool
> > > > > "Test1". Every time we run the policy manually, it always failed
> with
> > > > > following detailed status:
> > > > > <..cut..>
> > > > > It looks like the drive wants to write to the tape, but suddenly the
> > > > > tape was ejected a few seconds later and another tape mounted until
> > > > > all media on the pool consumed. This happens every time. FYI the
> /etc/
> > > > > directory's size is only 7.5 MB. We are using LTO-2 200/400GB tapes.
> > > > > We haven't done any backup using the L100 library.
> > > > >
> > > > > Here's the output of "tpconfig -dl":
> > > > > # /opt/openv/volmgr/bin/tpconfig -dl
> > > > > Currently defined drives and robots are:
> > > > > <..cut..>
> > > > > I have upgraded the software from NetBackup DataCenter 4.5FP_3GA to
> > > > > 4.5FP_6, but still no luck. Can anyone help me out? Thank you, any
> > > > > help would be very much appreciated.
> > > > >
> > > > > Best regards,
> > > > > Kevin Haritmonds
> _______________________________________________
> Veritas-bu maillist  -  Veritas-bu AT mailman.eng.auburn DOT edu
> http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
>