Veritas-bu

RES: [Veritas-bu] NetBackup DataCenter and L100 tape library

2005-01-03 09:07:17
Subject: RES: [Veritas-bu] NetBackup DataCenter and L100 tape library
From: kevin haritmonds <kharitmonds AT gmail DOT com> (kevin haritmonds)
Date: Mon, 3 Jan 2005 21:07:17 +0700
The problem has solved. Today I patched the OS with Solaris 9 Cluster
Patch, uninstall the NBU DataCenter 4.5 FP6, install the NBU
Enterprise Server 5.0, apply the latest patch for NBU Enterprise
Server 5.0, apply the latest Mappings file. It works like charm, no
more EOM problems :).

To summarize, here's the steps I've done:
1. Configure the L100 library:
      - Serialization: Enabled
      - Emulation: ATL M2500
   This will make NBU be able to place each drive into its robotic
drive number automatically when we run the "Configure Storage Devices"
wizard.
2. Apply Solaris 9 Cluster Patch:
      # unzip 9_Recommended.zip
      # ./install_cluster
      # init 6
3. Uninstall NBU DataCenter 4.5 FP6.
4. Install NBU Enterprise Server 5.0.
5. Install the latest NBU Enterprise Server 5.0 patch:
      # tar xvf NB_50_4_M_273716.solaris.tar
      # tar xvf NB_CLT_50_4_M_273719.tar
      # ./Vrts_pack.install, enter the pack name: NB_50_4_M
6. Install the latest Mappings file:
      # tar xvf Mappings_5_272595.tar
      # cp device_mappings.txt /usr/openv/share/device_mappings.txt
7. Run NBU:
      # /jnbSA &
8. Run the wizards.

I'm not sure what's the root cause, it could be the Solaris 9 Cluster
Patch, or the patch for NBU Enterprise Server 5.0, or both.

I really appreciate all the help from you. Thank you very much!

Cheers,
Kevin Haritmonds

On Sun, 2 Jan 2005 18:33:02 +0700, kevin haritmonds
<kharitmonds AT gmail DOT com> wrote:
> Thank you guys for all of the input, you're all very helpful!
> 
> Tim:
> Kevin, Ok, you've rulled out the obvious items... and the bptm/syslog
> help a lot. I think you've either got bad tapes or drives.  I find it
> interesting that A00002 is mounted and get's a media header written to
> it, it looks like NBU is going to write data, then gets an "End of
> Media" (EOM).  It then mounts A00003, seems to begin working again,
> then gets a media write error... This time it correlates to the
> hardware error in syslog. Better involve the hardware vendor to check
> out those hardware errors.
> Kevin:
> Your analysis is correct Tim! One drive was failed. I did the steps
> mentioned by Len to identify the failure drive. I have disabled the
> failure drive on NBU.
> 
> Len:
> I would backstep and check things out at the lowest level. Pick one
> tape that you can write over. To use in netbackup after this you may
> have to use bplabel to fix it up. Remove netbackup from the problem.
> Do a manual load of the tape to one tape drive then use unix commands
> to test the tape.
> mt -f /dev/.... status
> use tar or dd to write data to the tape
> mt -f /dev/... rewind
> use tar or dd to read  data from the tape.
> Make sure you use a fair amount of data as you want use enough of the
> tape so that you have to switch tracks on the tape. Also use a large
> block size. If this works for the drive(s) then the drive hardware is
> ok.
> Kevin:
> Thank you for your suggestion Len! Now I can be sure that one drive
> (/dev/rmt/1) is failed. When I did the tar to the failure drive:
>  # tar cvf /dev/rmt/1 usr
> <..cut..>
> 
> The other three drives were fine. I have disabled the failure drive on NBU.
> 
> Len:
> If the above works then try to use robtest to load one tape and only
> one tape at a time. Check the robot to make sure that the tape is in
> the correct drive. If all is ok, move the tape to the next drive.
> Kevin:
> Today I've enabled the "Serialization" setting on the L100 library
> (the default value is disabled). When I reinstalled the NetBackup
> DataCenter 4.5FP6, now all drives could be placed automatically by NBU
> and I don't have to drag-and-drop them into place. I don't know why
> Sun put the default option for L100 library's serialization to
> disabled.
> 
> David:
> check the value of DISALLOW_BACKUPS_SPANNING_MEDIA in bp.conf. It
> should be set to NO. after that if the problem persist have the
> hardware vendor check the drive for hw error and firmware update.
> Kevin: I only did the backup on the /etc directory, so the media
> haven't spanned yet (cmiiw).
> 
> David:
> one of the solution to this is to make sure you have in your bp.conf
> the following entries:
> ALLOW_MEDIA_OVERWRITE = DBR
> ALLOW_MEDIA_OVERWRITE = TAR
> ALLOW_MEDIA_OVERWRITE = CPIO
> ALLOW_MEDIA_OVERWRITE = ANSI
> ALLOW_MEDIA_OVERWRITE = MTF1
> Kevin: I already did that, but thanks anyway for the input.
> 
> Rockey:
> Research the hardware side; look for bent or loose SCSI connection.
> If this has fibre involved look for a fracture.  When you run the
> tpautoconf -d command, make sure all drives have the correct and same
> firmware. Additionally you may want to review the drive block size to
> make sure it is compatible with the media being used (e.g. don't use a
> 16K block on a 64K drive).
> Kevin: We used SCSI cables. Hmm this is interesting... How can I tell
> the block size of the drives and the media?
> 
> Today I could do the backup successfully, but several media (A00002,
> A00004, A00005 tapes) got rejected. There's always an error on bptm
> log saying that an EOM encountered while writing backup header, the
> tape was ejected a few seconds later, and another tape mounted.
> Previously when I did the hardware test, I have used the A00002 tape
> using tar to backup and restore 500 MB of data and it works fine (so
> physically the tape must be good, right?). I have manually relabeled
> the tapes using command: "bplabel -m A00002 -d hcart2 -p test1" with
> no luck, the NBU still "rejected" the tape.
> 
> According to the bptm log:
> <..cut..>
> 
> # ./bpmedialist
> <..cut..>
> 
> Several other tapes (A00003, A00006, A00007) were fine (the A00003
> tape also got rejected the first time, but strangely when I ran the
> manual backup the second time it works fine). I'm lost here, is there
> some kind of way to locate this problem?
> 
> 
> Best regards,
> Kevin Haritmonds
> 
> On Sat, 1 Jan 2005 20:49:43 -0800, Rockey Reed <Rockey.Reed AT veritas DOT 
> com> wrote:
> > Kevin,
> >
> > Research the hardware side; look for bent or loose SCSI connection.  If this
> > has fibre involved look for a fracture.  When you run the tpautoconf -d
> > command, make sure all drives have the correct and same firmware.
> > Additionally you may want to review the drive block size to make sure it is
> > compatible with the media being used (e.g. don't use a 16K block on a 64K
> > drive).
> >
> > Please let us all know the solution . . . this is getting interesting.
> >
> > Thanks,
> >
> > Rockey J. Reed
> >
> > -----Original Message-----
> > From: veritas-bu-admin AT mailman.eng.auburn DOT edu
> > [mailto:veritas-bu-admin AT mailman.eng.auburn DOT edu] On Behalf Of kevin
> > haritmonds
> > Sent: Saturday, January 01, 2005 3:14 AM
> > To: veritas-bu AT mailman.eng.auburn DOT edu
> > Subject: Re: RES: [Veritas-bu] NetBackup DataCenter and L100 tape library
> >
> > Thank you Tim and everyone else for your input. I'm sorry if this will
> > be a long message but I need to give all of the details. I hope you
> > don't mind.
> >
> > Tim: First, were the tapes frozen? You can check the tapes with either
> > available_media or bpmedialist.
> > Kevin: No, the tapes were not frozen:
> > # /opt/openv/netbackup/bin/goodies/available_media
> > NetBackup pool
> > A00000  HCART2   TLD      0       25     -       -     -        DBBACKUP
> > A00001  HCART2   TLD      0       26     -       -     -        DBBACKUP
> >
> > None pool
> >
> > Test1 pool
> > A00002  HCART2   TLD      0       27     -       -     -        AVAILABLE
> > A00003  HCART2   TLD      0       28     -       -     -        AVAILABLE
> > # /opt/openv/netbackup/bin/admincmd/bpmedialist
> > #
> >
> > Tim: Second, were the drives downed? You can check the drives with
> > vmoprcmd -d (GUI's device monitor equivalent).
> > Kevin: The drives were all up:
> > # /opt/openv/volmgr/bin/vmoprcmd -d
> > <..cut..>
> >
> > Tim & Len: As someone else pointed out, the bptm log and syslog will
> > be most useful in the initial troubleshooting. Check bptm if the tapes
> > were frozen.  Check syslog if they were downed.
> > Kevin: I'm not sure if I read it right, but it seems when it wanted to
> > write the backup header to A00002 tape, an EOM encountered. After
> > another media mounted (tape A00003), it tried to write to it but
> > failed with an I/O error. I'm not sure what cause it but we used new
> > LTO-2 tapes.
> > # cat /usr/openv/netbackup/logs/bptm/log.010105
> > <..cut..>
> > # cat /var/adm/messages
> > <..cut..>
> >
> > Tim: When you ran the Wizard, was NBU able to determine which drive
> > (/dev/rmt/X) was associated with each robot drive number (Drive 1-4)?
> > If not and you drag-and-dropped them into place, it's possible the
> > drives are out of order and NBU is mounting the tapes in the "wrong"
> > drives.
> > Kevin: You are correct Tim, the NBU wasn't able to determine which
> > drive was associated with each robot drive number. When we ran the
> > "Configure Storage Devices" Wizard, the wizard detected that there are
> > 4 tape drive(s) and 1 robot. But on the next wizard's page the 4
> > drives have limitations:
> > <..cut..>
> >
> > So on the next page, we have to drag and drop them into place. The
> > strange thing was the robot only have 5 drive box (but in reality the
> > L100 library has 6 maximum drives which we only have 4 drives out of
> > 6):
> > <..cut..>
> > So to drag and drop it to the correct drive number, I take the steps
> > described on Media Manager System Administrator Guide that uses the
> > robotic test utility (robtest) and see the drive's status on Device
> > Monitor.
> >
> > David: Make sure drive type is the same as the media
> > Kevin: The drive and the media have the same type: "hcart2".
> >
> > Alex: Make sure the tapes doesn't already had something with different
> > retention on them (from previous testing) while you have configure not
> > to allow different retention to be written on the same media. Also,
> > check the master properties -> media and make sure it "allow media
> > overwrite".
> > Kevin: The "Allow media overwrite" on server's Host Properties have
> > all been checked. And we used new tapes.
> >
> > Is there any clue on what causing the problem?
> >
> > Best regards,
> > Kevin Haritmonds
> >
> > On Fri, 31 Dec 2004 20:37:06 -0800, Alex Fong <alex.s.fong AT gmail DOT 
> > com> wrote:
> > > Make sure the tapes doesn't already had something with different
> > > retention on them (from previous testing) while you have configure not
> > > to allow different retention to be written on the same media. Also,
> > > check the master properties -> media and make sure it "allow media
> > > overwrite".
> > >
> > > Alex
> > >
> > > On Fri, 31 Dec 2004 21:04:18 -0600 (CST), Tim Hoke <thoke AT northpeak 
> > > DOT org>
> > wrote:
> > > > If the drive types weren't the same as the media, then the mounting
> > never
> > > > would have occurred and the jobs would have failed with a status 96.
> > > >
> > > > On Fri, 31 Dec 2004, David Trostli wrote:
> > > >
> > > > > Make sure drive type is the same as the media
> > > > >
> > > > > Regards,
> > > > >
> > > > > David
> > > > >
> > > > > -----Mensagem original-----
> > > > > De: veritas-bu-admin AT mailman.eng.auburn DOT edu
> > > > > [mailto:veritas-bu-admin AT mailman.eng.auburn DOT edu]Em nome de Tim 
> > > > > Hoke
> > > > > Enviada em: sexta-feira, 31 de dezembro de 2004 13:59
> > > > > Para: kevin haritmonds
> > > > > Cc: veritas-bu AT mailman.eng.auburn DOT edu
> > > > > Assunto: Re: [Veritas-bu] NetBackup DataCenter and L100 tape library
> > > > >
> > > > >
> > > > > Kevin,
> > > > >
> > > > > I've got a couple of questions for you.
> > > > >
> > > > > First, were the tapes frozen?
> > > > > You can check the tapes with either available_media or bpmedialist.
> > > > >
> > > > > Second, were the drives downed?
> > > > >
> > > > > You can check the drives with vmoprcmd -d (GUI's device monitor
> > > > > equivalent).
> > > > >
> > > > > As someone else pointed out, the bptm log and syslog will be most
> > useful
> > > > > in the initial troubleshooting.
> > > > >
> > > > > Check bptm if the tapes were frozen.  Check syslog if they were
> > downed.
> > > > >
> > > > > When you ran the Wizard, was NBU able to determine which drive
> > > > > (/dev/rmt/X) was associated with each robot drive number (Drive 1-4)?
> > If
> > > > > not and you drag-and-dropped them into place, it's possible the drives
> > are
> > > > > out of order and NBU is mounting the tapes in the "wrong" drives.
> > > > >
> > > > > Run /usr/openv/volmgr/bin/scan and see if the robot reports the drive
> > > > > serial numbers.  If so, then NBU should have been able to figure out
> > which
> > > > > drive goes with with number.
> > > > >
> > > > > HTH
> > > > > -Tim
> > > > >
> > > > > On Fri, 31 Dec 2004, kevin haritmonds wrote:
> > > > >
> > > > > > Hi, I'm facing a problem in Veritas NetBackup DataCenter 4.5 server,
> > > > > > installed on Solaris 9 platform (using SunBlade 2000 machine),
> > > > > > connected to Sun StorEdge L100 Tape Library with 4 drives HP LTO-2
> > and
> > > > > > 96 tape slots. This is the first time we want to use the L100
> > library.
> > > > > > After we installed NetBackup DataCenter 4.5 from CD, we did the
> > setup
> > > > > > using Wizards: Configure Storage Devices, Configure Volumes,
> > Configure
> > > > > > Backup Catalog, and Create a simple Backup Policy "test1" which only
> > > > > > backup server's /etc/ directory (which is localhost) to volume pool
> > > > > > "Test1". We assigned two tapes (A00002 and A00003) to volume pool
> > > > > > "Test1". Every time we run the policy manually, it always failed
> > with
> > > > > > following detailed status:
> > > > > > <..cut..>
> > > > > > It looks like the drive wants to write to the tape, but suddenly the
> > > > > > tape was ejected a few seconds later and another tape mounted until
> > > > > > all media on the pool consumed. This happens every time. FYI the
> > /etc/
> > > > > > directory's size is only 7.5 MB. We are using LTO-2 200/400GB tapes.
> > > > > > We haven't done any backup using the L100 library.
> > > > > >
> > > > > > Here's the output of "tpconfig -dl":
> > > > > > # /opt/openv/volmgr/bin/tpconfig -dl
> > > > > > Currently defined drives and robots are:
> > > > > > <..cut..>
> > > > > > I have upgraded the software from NetBackup DataCenter 4.5FP_3GA to
> > > > > > 4.5FP_6, but still no luck. Can anyone help me out? Thank you, any
> > > > > > help would be very much appreciated.
> > > > > >
> > > > > > Best regards,
> > > > > > Kevin Haritmonds