Veritas-bu

[Veritas-bu] Serious master issue...

2007-02-14 17:46:03
Subject: [Veritas-bu] Serious master issue...
From: dpreston at LANDAM.com (Preston, Douglas L)
Date: Wed, 14 Feb 2007 17:46:03 -0500
 How much memory is in your master server,  is the swap area used growing?  


Doug Preston
Systems Engineer
Land America Tax and Flood Services
Phone 626-339-5221 Ext 104
Email  dlpreston at landam.com


------------------------------------------------------------------------------------
NOTICE: This electronic mail transmission may constitute a communication that 
is legally privileged. It is not intended for transmission to, or receipt by, 
any unauthorized persons. If you have received this electronic mail 
transmission in error, please delete it from your system without copying it, 
and notify the sender by reply e-mail, so that our address record can be 
corrected.
------------------------------------------------------------------------------------


-----Original Message-----
From: veritas-bu-bounces at mailman.eng.auburn.edu [mailto:veritas-bu-bounces 
at mailman.eng.auburn.edu] On Behalf Of Hampus Lind
Sent: Wednesday, February 14, 2007 2:25 PM
To: Veritas-bu at mailman.eng.auburn.edu
Subject: Re: [Veritas-bu] Serious master issue...

This is the help I am getting from Symantec... hang tight, next mail is soon to 
arrive...


Hampus Lind
Rikspolisstyrelsen
National Police Board
Tel dir: +46 (0)8 - 401 99 43
Tel mob: +46 (0)70 - 217 92 66
E-mail: hampus.lind at rps.police.se


-----Ursprungligt meddelande-----
 
> You haven't really answered anything, just talked about how things 
> should work when everything is OK.


That's just it; NetBackup is not doing anything abnormal here (a.k.a. it's 
operating as designed given the environment it's running in).  There's nothing 
we can do at the software level to "fix" performance bottlenecks at the 
filesystem level; it'll operate as quickly as the system calls allow it to.  
Every problem you brought up can be traced back to this one core issue.

Think of it this way:  if you fill your gas tank with the wrong type of petrol 
and as a result the vehicle starts sputtering, when you bring it to the 
mechanic, they'll make the assessment that the engine is working as well as it 
can given the circumstances.  The problem is the petrol, not the engine.

> 1. I could have a problem with my db, but if the bpdbm -consistency 2
check
> wont finish who can I tell? If the bpdbm -consistency 2 check hangs, 
> again how can I tell whats wrong?

Like I said previously, bpdbm -consistency is the tool... we don't have 
"alternate" tools or anything like that (what's the point in re-inventing the 
wheel?).  You only other option is to manually check each and every image for 
oddities.  And even then there's no guarantee you'll spot the corruption if it 
exists because over half of your images files are going to be in binary format, 
which is impossible to examine using your eyeballs.

> 2. Maybe I haven?t been clear with my problem. The bpdbm processes 
> don?t
go
> away, they are always there and are always working with something... 
> So
how
> can I move on?

I didn't find any evidence that bpdbm was caught in an infinite loop.  All 
PID's are making progress with their respective tasks, albiet slow progress.
I even checked to make sure the bpdbm's weren't stepping on each other's feet.  
They're all doing seperate tasks independently of each other, and no process 
was performing a redundant task that another bpdbm was processing.
All evidnce points to file-read operations taking a lot of time to complete, 
and that's a problem that can be fixed by an application.

Let's say for the sake of argument we could change NetBackup's behavior so that 
it doesn't spawn so many processes at once (which isn't actually possible, but 
let's just assume for a second).  Will that solve the problem?
No.  It will still have to perform the same number of operations because it 
still has to go through the same data set as in the present situation.  In 
fact, the process might be made *worse* not better, because the entire 
operation would in fact take longer.

Disabling it entirely is not possible under NetBackup without shutting down 
bpdbm entirely (and it would be a bad idea anyways as the images cleanup 
process is vital for the application to function), which means of course then 
just about nothing would work under NetBackup.  You will get no backups, and 
defintely no restores.

So in summary, you have to wait until it finishes on its own.  If the process 
takes more than 12 hours to complete, that means you're really stuck.  
Absolutely nothing can be done at the software level until something is done 
with the images database or the filesystem it resides on is fixed.

> 4. Our db is about 60-65 GB, there are netbackup customers with much
bigger
> nbu databases. And this should by a enterprise solution and therefore 
> be able to handle this payload.

Not many customers have as many individual images.  Keep in mind here that 
there's more to this than "how much data am I backing up".  If the bulk of your 
backups are Oracle RMAN, then the number of inodes in your environment 
increases dramaticly.  I can almost always tell the difference between RMAN 
backups and regular backups when looking at the images database just by looking 
at the number of streams being generated at one go.  The difference is not 
insignificant.

Since images databases are unique to each and every customer (no two images 
databases are the same in a production environment), I can't give you the 
cookie-cutter solution that I am certain you would like to have.  These sorts 
of things have to be analyzed in a case-by-case basis, and even Enterprise 
solutions are limited by the environment they are running in.
You could own the nicest, most expensive BMW in the world, but if you don't 
have a road to drive it on, it probably won't work as well as you'd like.

> 5. I have followed HP?s suggestings:
> - I have patched the OS

Recently?

> - I have run defrag on that filesystem

That's not a bad idea, but that usually has a minimal effect with modern-day 
UNIX operating systems, including HP, because the filesystem driver does that 
on the fly during normal operation anyways.

> - I have increased scsi_queue depth

That will prevent SCSI write failures, but won't necessarily make things run 
faster.  It's like standing in line at the bank, you're not going to go any 
faster if the line is longer; you're just adding more people in the line.

I've done my best to find a problem that we can address at the software level.  
I can't find anything to negate HP's recommendation, and if I could I'd have 
relayed that information.  At this point I'm not sure what else to tell you, as 
the logs aren't changing their story.  The bpdbm process is doing what it was 
designed to do, but something external is throttling it back.  That's where the 
problem is and hence that's why I'm suggesting you follow HP's solution.

Outside of that we'll have to look into a consulting solution to re-architect 
this envionrment to distribute the images database somewhat.

_______________________________________________
Veritas-bu maillist  -  Veritas-bu at mailman.eng.auburn.edu 
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu