Veritas-bu

[Veritas-bu] FW: media write errors - 84

2001-01-10 17:36:32
Subject: [Veritas-bu] FW: media write errors - 84
From: Joshua Fielden jfielden AT excitecorp DOT com
Date: Wed, 10 Jan 2001 14:36:32 -0800
I see them with ADIC jb's with Q DLT7k's, Solaris 2.6 and 8 (one of each until 
upgrade completed) media servers (4500's). Legato software, the week before, 
didn't give these errors at all, so I know it's not the hardware. :) This has 
been happening for ~4 months, and we have isolated a few sets of factors that 
make it more likely this will happen. What really complicates the mix is we 
also get 'timeout waiting for media manager to mount volume' errors. The 
factors we notice that aggrivate this are:

1) Legato tapes are not overwritten when introduced into the jukebox. Maybe you 
have tapes with a format NBU doesn't like? The disconnected processes of adding 
media to the database and labelling them means there can be tapes that MM 
thinks are valid, but error out with a different label header. We're still 
trying to figure out an elegant way to label 3-500 tapes at a time, in a 
resonable time-frame, upon introduction to the JB. (we have 14k tapes, so this 
will go on for a while) I'm actually contemplating requiring all tapes to be 
degaussed before re-introduction to avoid this, and throwing tapes away based 
on time, not mounts. I am eagerly awaiting the day Veritas decides to make 
their backup solution one product, and not a media product and a catalog 
product that happen to ship on the same CD.

2) System load on the media server seems to make this more likely. The media 
server that runs a load of 18 through the night (on 10 procs) is more likely to 
fail than the one that runs a load of 2 on 8 procs.

3) Backups load also makes it more likely. Once we start to push 80-100 
megabyte/sec through a media server, it's more likely to choke. We've done the 
whole shared memory/buffers tuning, and these boxen have 5G of RAM each, so 
memory/buffers is not the overall issue. We are bringing on-line a media server 
that will have 12G, so we will test if memory makes a difference or not. ( I 
love re-purposing over-powered hardware ;-) )

The other factor that may be adding to this is the fact that we are in an SSO 
config, where the media servers are each scanhost for a JB, and the master only 
holds the voldb. This will be rectified soon, and we'll see if it alleviates 
some of these problems.

JF


On Wed, Jan 10, 2001 at 01:26:09PM -0800, KevinB AT paccessglobal DOT com spake 
unto the multitudes:
> 
> I see these periodically on a Qualstar 6430 with 2 DLT 7000's.  Twice I have
> had to replace the drive and the errors stop for a while.  Since this is
> common across vendors of Robots I would suspect that either it is a NB issue
> (though the sense key information that I am seeing in the logs is passed
> from the drive) or the drives themselves.  The messages below did not
> indicate the drive type, are they all DLT 7000s?
> 
> -----Original Message-----
> From: Keahey, Ricky L [mailto:ricky.l.keahey AT intel DOT com]
> Sent: Wednesday, January 10, 2001 8:20 AM
> To: 'Collins, Kathy'; 'veritas-bu AT mailman.eng.auburn DOT edu'
> Subject: RE: [Veritas-bu] FW: media write errors - 84
> 
> 
> Kathy,
> 
> We have an environment here that is giving us 84, 85, and 86 errors
> consistantly.  We have involved ATL and Veritas and both say that it is the
> other product causing the problem.  I am very frustrated with this so if you
> get any fixes or if any one else on this list has seen this problem, I would
> appreciate you sending mail to us to help us figure this out.  I don't have
> much hair on my head left, but by the looks of things, I won't have any by
> the time this problem is fixed.
> 
> Thanks,
> Rick
> 
> -----Original Message-----
> From: Collins, Kathy [mailto:KCollins AT coral-energy DOT com]
> Sent: Wednesday, January 10, 2001 8:02 AM
> To: 'veritas-bu AT mailman.eng.auburn DOT edu'
> Subject: [Veritas-bu] FW: media write errors - 84
> 
> 
> This is a status update to a message I posted back in November.  I have been
> attempting to resolve our media write errors in NetBackup.  Our problem of
> getting
> these errors once or twice a night eventually only occurred on one drive,
> which
> we had already replaced once.  We had the drive replaced a second time, and
> still saw the errors just on that drive.  Then we switched this drive with
> another
> to see if the problem followed the drive or stayed with the location.  It
> followed
> the drive.  We had the drive replaced a third time about a week ago and
> haven't
> seen the write errors on the drive since.  
> 
> The next day we got two write errors on two other drives, drives that have
> never 
> had this error before.  We have also seen a few ioctl (MTWEOF) and (MTWFSF) 
> errors on the drive that we replaced.  These errors freeze the tape
> immediately.  
> I'm not convinced that these tapes are bad.
> 
> I had a few replies from people on the list with similar problems, both
> stating that
> the problem has never gone away, no matter what they tried.  Both replaced
> drives
> several times.   
> 
> We have the same version of NetBackup installed on an Ultra 2 connected to
> an
> L3500 with none of the above problems.
> 
> Does anyone have any other suggestions on what I can try to stop these
> errors
> from occurring?  Are there lots of you having the same problem?  Or is it
> just the
> three of us?
> 
> Thanks,
> Kathy
> 
> >  -----Original Message-----
> > From:       Collins, Kathy  
> > Sent:       Monday, November 20, 2000 4:05 PM
> > To: 'veritas-bu AT mailman.eng.auburn DOT edu'
> > Subject:    media write errors - 84
> > 
> > Hi,
> > 
> > I'm using NetBackup 3.2 with Solaris 2.6 on an E450 connected to an
> > L11000.
> > I upgraded from NetBackup 3.2 patch 328 to patch 363 on November 3rd.
> > About a week later, 
> > our media errors came along much more frequently.  Although the errors
> > appear the same as
> > previous media error in the messages log, the wording is different in the
> > Problems report of
> > NetBackup.   Here is a log I've been keeping with the stats on the errors.
> > Sometimes it looks
> > like a drive problem, sometimes like a tape problem.  These tapes all have
> > only 15 to 20 mounts.
> > The actual error reads "cannot write image to media id DOA763, drive index
> > 2, I/O error", whereas
> > previous media errors read "write error on media id...".  Both show up
> > with a status code of 84.
> > 
> > 11/11 20:33 DOA763  drive index 5  cannot write image to media id...
> > 11/12 09:33 DOA871  drive index 5  cannot write image to media id...
> > 11/13 07:16 DOA885  drive index 5  cannot write image to media id...
> > -
> > replaced drive index 5 (/dev/rmt/5)
> > -
> > 11/13 20:57 DOA656  drive index 5  cannot write image to media id...
> > 11/14 07:19 DOA885  drive index 5  cannot write image to media id...
> > 11/14 20:35 DOA860  drive index 2  cannot write image to media id...
> > 11/16 06:47 DOA651  drive index 2  cannot write image to media id...
> > 11/16 07:37 DOA651  drive index 5  cannot write image to media id...
> > 11/16 18:35 DOA651  drive index 5  cannot write image to media id...
> > 11/16 23:22 DOA878  drive index 2  cannot write image to media id...
> > 11/17 18:37 DOA860  drive index 2  cannot write image to media id...
> > 11/19 19:45 DOA773  drive index 2  cannot write image to media id...
> > 11/20 02:34 DOA773  drive index 2  cannot write image to media id...
> > 11/20 06:58 DOA763  drive index 2  cannot write image to media id...
> > 
> > I'm having drive index 2 replaced tomorrow, although it didn't stop
> > the errors in drive index 5.  Does anyone know of this problem and if
> > it could be related to patch 363?  Any recommendations on what patch
> > I should jump to assuming that it may be related to the patch?  If I
> > go directly to 3.4, will I be able to restore my data that was backed
> > up with 3.2?
> > 
> > Thanks for any feedback.
> > Regards,
> > Kathy
> > 
> > 
> > Kathy Collins
> > Coral Energy, L.L.P
> > Phone: 713.230.3426
> > kcollins AT coral-energy DOT com
> > 
> _______________________________________________
> Veritas-bu maillist  -  Veritas-bu AT mailman.eng.auburn DOT edu
> http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
> 
> _______________________________________________
> Veritas-bu maillist  -  Veritas-bu AT mailman.eng.auburn DOT edu
> http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
> _______________________________________________
> Veritas-bu maillist  -  Veritas-bu AT mailman.eng.auburn DOT edu
> http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu

-- 
"they're like an 800 pound gorilla coming toward you.  you have to respect the 
gorilla and what it does, but you don't have to give it whatever it asks for."
Joshua Fielden, Senior Systems Administrator and Backups Team Lead
eXcite@Home, Inc. jfielden AT excitecorp DOT com 650-556-3316



<Prev in Thread] Current Thread [Next in Thread>