Veritas-bu

[Veritas-bu] Serious master issue...

2007-02-14 17:24:38
Subject: [Veritas-bu] Serious master issue...
From: hampus.lind at rps.police.se (Hampus Lind)
Date: Wed, 14 Feb 2007 23:24:38 +0100
This is the help I am getting from Symantec... hang tight, next mail is soon
to arrive...


Hampus Lind
Rikspolisstyrelsen
National Police Board
Tel dir: +46 (0)8 - 401 99 43
Tel mob: +46 (0)70 - 217 92 66
E-mail: hampus.lind at rps.police.se


-----Ursprungligt meddelande-----
 
> You haven't really answered anything, just talked about how things should
> work when everything is OK.


That's just it; NetBackup is not doing anything abnormal here (a.k.a. it's
operating as designed given the environment it's running in).  There's
nothing we can do at the software level to "fix" performance bottlenecks at
the filesystem level; it'll operate as quickly as the system calls allow it
to.  Every problem you brought up can be traced back to this one core issue.

Think of it this way:  if you fill your gas tank with the wrong type of
petrol and as a result the vehicle starts sputtering, when you bring it to
the mechanic, they'll make the assessment that the engine is working as well
as it can given the circumstances.  The problem is the petrol, not the
engine.

> 1. I could have a problem with my db, but if the bpdbm -consistency 2
check
> wont finish who can I tell? If the bpdbm -consistency 2 check hangs, again
> how can I tell whats wrong?

Like I said previously, bpdbm -consistency is the tool... we don't have
"alternate" tools or anything like that (what's the point in re-inventing
the wheel?).  You only other option is to manually check each and every
image for oddities.  And even then there's no guarantee you'll spot the
corruption if it exists because over half of your images files are going to
be in binary format, which is impossible to examine using your eyeballs.

> 2. Maybe I haven?t been clear with my problem. The bpdbm processes don?t
go
> away, they are always there and are always working with something... So
how
> can I move on?

I didn't find any evidence that bpdbm was caught in an infinite loop.  All
PID's are making progress with their respective tasks, albiet slow progress.
I even checked to make sure the bpdbm's weren't stepping on each other's
feet.  They're all doing seperate tasks independently of each other, and no
process was performing a redundant task that another bpdbm was processing.
All evidnce points to file-read operations taking a lot of time to complete,
and that's a problem that can be fixed by an application.

Let's say for the sake of argument we could change NetBackup's behavior so
that it doesn't spawn so many processes at once (which isn't actually
possible, but let's just assume for a second).  Will that solve the problem?
No.  It will still have to perform the same number of operations because it
still has to go through the same data set as in the present situation.  In
fact, the process might be made *worse* not better, because the entire
operation would in fact take longer.

Disabling it entirely is not possible under NetBackup without shutting down
bpdbm entirely (and it would be a bad idea anyways as the images cleanup
process is vital for the application to function), which means of course
then just about nothing would work under NetBackup.  You will get no
backups, and defintely no restores.

So in summary, you have to wait until it finishes on its own.  If the
process takes more than 12 hours to complete, that means you're really
stuck.  Absolutely nothing can be done at the software level until something
is done with the images database or the filesystem it resides on is fixed.

> 4. Our db is about 60-65 GB, there are netbackup customers with much
bigger
> nbu databases. And this should by a enterprise solution and therefore be
> able to handle this payload.

Not many customers have as many individual images.  Keep in mind here that
there's more to this than "how much data am I backing up".  If the bulk of
your backups are Oracle RMAN, then the number of inodes in your environment
increases dramaticly.  I can almost always tell the difference between RMAN
backups and regular backups when looking at the images database just by
looking at the number of streams being generated at one go.  The difference
is not insignificant.

Since images databases are unique to each and every customer (no two images
databases are the same in a production environment), I can't give you the
cookie-cutter solution that I am certain you would like to have.  These
sorts of things have to be analyzed in a case-by-case basis, and even
Enterprise solutions are limited by the environment they are running in.
You could own the nicest, most expensive BMW in the world, but if you don't
have a road to drive it on, it probably won't work as well as you'd like.

> 5. I have followed HP?s suggestings:
> - I have patched the OS

Recently?

> - I have run defrag on that filesystem

That's not a bad idea, but that usually has a minimal effect with modern-day
UNIX operating systems, including HP, because the filesystem driver does
that on the fly during normal operation anyways.

> - I have increased scsi_queue depth

That will prevent SCSI write failures, but won't necessarily make things run
faster.  It's like standing in line at the bank, you're not going to go any
faster if the line is longer; you're just adding more people in the line.

I've done my best to find a problem that we can address at the software
level.  I can't find anything to negate HP's recommendation, and if I could
I'd have relayed that information.  At this point I'm not sure what else to
tell you, as the logs aren't changing their story.  The bpdbm process is
doing what it was designed to do, but something external is throttling it
back.  That's where the problem is and hence that's why I'm suggesting you
follow HP's solution.

Outside of that we'll have to look into a consulting solution to
re-architect this envionrment to distribute the images database somewhat.