Re: [Veritas-bu] General questions for everyone

WOW…good info…I should be a little clearer, on my media servers (we have 7) they are Gig Fiber direct runs to the backup network core…the majority of my clients are 100mb due to old switches. So the speed from media server to tape is Gig…SAN attached. We have been running in the neighborhood of 3000 – 7000 KB/sec, which is pretty good. Yes we have some clients that are running between 900 – 1000 KB/sec and VMs which share the physical nics, these are running around 1000 KB/sec. So all in all, things are working well…yet, we are running into the day on backups on some of our QA / Dev environments.

And yes we did need to perform a recovery exercise…it was basically a Disaster Recovery when our SAN decided to crash due to a power outage and a faulty UPS subsequently corrupting a few TB of SAN data. And yes recovery effort was slow since we may have had 10 – 20 jobs on the same tape about 150 servers over 3 (12 hour) days. Yea painful, very painful. We are looking to move that 20 number down and still keep the backups in our window. Tedious process but it is being worked on.

Thanks

Dan

From: Rosenkoetter, Gabriel [mailto:Gabriel.Rosenkoetter AT radian DOT biz]
Sent: Friday, November 02, 2007 10:53 AM
To: veritas-bu AT mailman.eng.auburn DOT edu
Subject: Re: [Veritas-bu] General questions for everyone

Wow, you have a lot of problems there. I'm picking the three big ones.

First, you don't mention how many media servers you have, but you do mention your network interface speed as 100 Mb/s. 100 Mb/s is roughly 8 MB/s (being generous). That means that in order to feed your 20 LTO-3s with even the minimum 10 MB/s they need to keep from backhitching, you would need to have 25 media servers... but you can't write to the same drive with more than one media server, so it is literally impossible for you to supply the mininum input speed to actually spin your drives without shoe-shining. In point of fact, if you really only have 100 Mb inputs into your media servers you can NOT drive an LTO-3 with any one of your media servers without causing it to backhitch. You can't get data to it fast enough. Yes, this is a huge problem. Invest in gigabit Ethernet or starting doing everything with BCVs/snapshots exported to your media servers.

Second, have you performed any recovery tests since you bumped your MPX up to that astronimical 20? You should. In general, recovery becomes outrageously painful if not impossible when you stray above 4, or that's the standard advice anyway. It's been a while since I checked, so if you can manage to pull a restore successfully and meet your RTO with a 20 MPX, then more power to you, but test it.

Third, although the standard advice is to "just trust NetBackup" and let it leave things in queue if it needs to (there are a variety of legit reasons it might be doing that, like jobs per policy or number of streams available across all drives), I've found that trust bpsched not to have a mental breakdown when trying to enqueue that many jobs at the same time is not really a great plan. Spreading your start times out a bit, so that bpsched can make its way through initiating all the streams, is my preferred method. (In your case, you'd probably want to kick jobs off in batches of 100 clients every twenty minutes or so starting at 17:00, modulo special-case clients. You don't really have to care too much about balancing volume of data between those policies, provided they're all going into the same pool with the same retention daily.) If letting NBU take care of it is working for you, great. (No, staggering won't help the things you describe, though it also won't hinder them, but it'll keep the memory usage on the scheduler sane and there have definitely been scaling bugs with bpsched in the past... I've forgotten at precisely which 5.1 MP, but it was not a lot of fun when it happened to three different 2000-client / 8 media server environments I cared about at the time.)

--
gabriel rosenkoetter
Radian Group Inc, Unix/Linux/VMware Sysadmin / Backup & Recovery
gabriel.rosenkoetter AT radian DOT biz, 215 231 1556

From: Cruice, Daniel (US - Glen Mills) [mailto:dcruice AT deloitte DOT com]
Sent: Thursday, November 01, 2007 5:14 PM
To: veritas-bu AT mailman.eng.auburn DOT edu
Subject: [Veritas-bu] General questions for everyone

Say you have over 900 clients to backup from 5:00pm – 8:00am…20 LTO3 tapes drives in a library. 99% of the environment is Windows including my media servers / Master node and I am running multiplexing (20) in some cases. Right now 90% of all my jobs kick off at 5:00 on the dot. Seems that many of my jobs when they kick off will sit in a Queued status for 15 – 20 minutes at the kick-off, the active jobs will increment every few seconds. I understand I’ll have jobs queued once the multiplexing hits the threshold for number of jobs per tape, or if all my tape drives are being used. But was just wondering if I staggered my start time would help load up the tapes / writing to tape any quicker, or simply go to an active state sooner? But unfortunately I am running on a 100mb network, but it is segregated from my production network.

Suggestions?

Thanks

Dan Cruice

This message (including any attachments) contains confidential information intended for a specific individual and purpose, and is protected by law. If you are not the intended recipient, you should delete this message and are hereby notified that any disclosure, copying, or distribution of this message, or the taking of any action based on it, is strictly prohibited.

_______________________________________________
Veritas-bu maillist  -  Veritas-bu AT mailman.eng.auburn DOT edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu