WOW…good info…I
should be a little clearer, on my media servers (we have 7) they are Gig Fiber
direct runs to the backup network core…the majority of my clients are
100mb due to old switches. So the speed from media server to tape is Gig…SAN
attached. We have been running in the neighborhood of 3000 – 7000 KB/sec,
which is pretty good. Yes we have some clients that are running between
900 – 1000 KB/sec and VMs which share the physical nics, these are
running around 1000 KB/sec. So all in all, things are working well…yet,
we are running into the day on backups on some of our QA / Dev environments.
And yes we did need to perform a
recovery exercise…it was basically a Disaster Recovery when our SAN
decided to crash due to a power outage and a faulty UPS subsequently corrupting
a few TB of SAN data. And yes recovery effort was slow since we may have
had 10 – 20 jobs on the same tape about 150 servers over 3 (12 hour) days.
Yea painful, very painful. We are looking to move that 20 number down and
still keep the backups in our window. Tedious process but it is being
worked on.
Dan
From: Rosenkoetter,
Gabriel [mailto:Gabriel.Rosenkoetter AT radian DOT biz]
Sent: Friday, November 02, 2007 10:53 AM
To: veritas-bu AT mailman.eng.auburn DOT edu
Subject: Re: [Veritas-bu] General questions for everyone
Wow,
you have a lot of problems there. I'm picking the three big ones.
First,
you don't mention how many media servers you have, but you do mention your
network interface speed as 100 Mb/s. 100 Mb/s is roughly 8 MB/s (being
generous). That means that in order to feed your 20 LTO-3s with even the
minimum 10 MB/s they need to keep from backhitching, you would need to have 25
media servers... but you can't write to the same drive with more than one media
server, so it is literally impossible for you to supply the mininum input speed
to actually spin your drives without shoe-shining. In point of fact, if you
really only have 100 Mb inputs into your media servers you can NOT drive an
LTO-3 with any one of your media servers without causing it to backhitch. You
can't get data to it fast enough. Yes, this is a huge problem. Invest in
gigabit Ethernet or starting doing everything with BCVs/snapshots exported to
your media servers.
Second,
have you performed any recovery tests since you bumped your MPX up to that
astronimical 20? You should. In general, recovery becomes outrageously painful
if not impossible when you stray above 4, or that's the standard advice anyway.
It's been a while since I checked, so if you can manage to pull a restore
successfully and meet your RTO with a 20 MPX, then more power to you, but test
it.
Third,
although the standard advice is to "just trust NetBackup" and let it
leave things in queue if it needs to (there are a variety of legit reasons it
might be doing that, like jobs per policy or number of streams available across
all drives), I've found that trust bpsched not to have a mental breakdown when
trying to enqueue that many jobs at the same time is not really a great plan.
Spreading your start times out a bit, so that bpsched can make its way through
initiating all the streams, is my preferred method. (In your case, you'd
probably want to kick jobs off in batches of 100 clients every twenty minutes
or so starting at 17:00, modulo special-case clients. You don't really have to
care too much about balancing volume of data between those policies, provided
they're all going into the same pool with the same retention daily.) If letting
NBU take care of it is working for you, great. (No, staggering won't help the
things you describe, though it also won't hinder them, but it'll keep the
memory usage on the scheduler sane and there have definitely been scaling bugs
with bpsched in the past... I've forgotten at precisely which 5.1 MP, but it
was not a lot of fun when it happened to three different 2000-client
/ 8 media server environments I cared about at the time.)
--
gabriel rosenkoetter
Radian Group Inc, Unix/Linux/VMware Sysadmin / Backup & Recovery
gabriel.rosenkoetter AT radian DOT biz, 215 231 1556
From: Cruice, Daniel (US - Glen Mills)
[mailto:dcruice AT deloitte DOT com]
Sent: Thursday, November 01, 2007 5:14 PM
To: veritas-bu AT mailman.eng.auburn DOT edu
Subject: [Veritas-bu] General questions for everyone
Say you have over 900 clients to backup from 5:00pm –
8:00am…20 LTO3 tapes drives in a library. 99% of the environment is
Windows including my media servers / Master node and I am running multiplexing
(20) in some cases. Right now 90% of all my jobs kick off at 5:00 on the
dot. Seems that many of my jobs when they kick off will sit in a Queued
status for 15 – 20 minutes at the kick-off, the active jobs will
increment every few seconds. I understand I’ll have jobs queued
once the multiplexing hits the threshold for number of jobs per tape, or if all
my tape drives are being used. But was just wondering if I staggered my
start time would help load up the tapes / writing to tape any quicker, or
simply go to an active state sooner? But unfortunately I am running on a
100mb network, but it is segregated from my production network.
Suggestions?
Thanks
Dan Cruice
This
message (including any attachments) contains confidential information intended
for a specific individual and purpose, and is protected by law. If you are not
the intended recipient, you should delete this message and are hereby notified
that any disclosure, copying, or distribution of this message, or the taking of
any action based on it, is strictly prohibited.