I think we’re crossing wires a bit here. I agree with your
view point if you are talking about production storage. All of my most important
production data is sitting on expensive, high-speed, fiber channel, fully redundant,
highly fault-tolerant, replicated to DR, SAN drives. I think the maximum size
we use for these raid groups is 8 disks. I’m not sure what our rebuild
times are on these arrays, but I certainly agree with your point about bigger
arrays taking longer to rebuilt. But, I was not talking about production.
My backup DSSUs are all 15 disk raid5, (actually, the MD1220s I
put in last week are 24 disk RAID6) and my point is that they are much less critical
than production. I agree with you that there is risk in making such big raid5
arrays, but my point is that risk is mitigated. Think of all the things that
would have to go wrong, to make such a failure critical.
1)
Whatever redundancy and fault tolerance we have built into the production
storage must fail.
2)
The DR copy must be corrupt / bad / down / unreadable.
3)
The backup data must be so fresh, it had not been written to
tape yet.
4)
The data must be so critical, that a previous full + existing incrementals
(on tape) are worthless
5)
The needed backup data must reside on the DSSU that fails.
6)
Two disks must fail on that DSSU.
That is a lot of bad juju (or admin incompetence) that must all
happen at once to make a DSSU failure critical, and around he were like to call
that acceptable risk. In my case it comes down to $$$$. Sure, I could create
some ridiculously fast backup performance and replicate deduped data to DR. I
would be using FC or SAS disks, RAID10, and dedupe appliances all the way. It
would be awesome, but also expensive and uncalled for (from the business’s
perspective.)
Your requirements may vary. But, I don’t think it’s
appropriate to say “Don’t ever do this”, because it works
great here. Perhaps, “Don’t ever do this in production” but I
hesitate even to say that. How about “carefully consider the risks,
opportunities, strengths and weaknesses of any proposed storage solution before
purchasing and implementing”?
-Jonathan
From: Lightner, Jeff
[mailto:jlightner AT water DOT com]
Sent: Tuesday, June 29, 2010 3:54 PM
To: Martin, Jonathan; veritas-bu AT mailman.eng.auburn DOT edu
Subject: RE: [Veritas-bu] Destaging going slow
“Gotten away with” I’m sure meant
“hasn’t been bit by” not “has evaded authorities”
in this context I’m sure.
Monitoring systems is great and certainly can help prevent the
situation many find themselves in where they aren’t monitoring and lose a
disk without realizing it then lose another later and go crashing to the floor.
However, many of us have run into the scenario where we
ARE monitoring and know exactly when the first drive failed and/or have a hot
spare that automatically starts rebuilding the moment it does fail but despite
being that proactive have had another drive fail while the rebuild was in
progress and thereby lost the entire RAID5 set. RAID5 is certainly
better than JBOD because it does provide some redundancy but in very large
arrays it makes sense to try to use a better RAID level OR to split it into
multiple RAID5 sets to minimize how much is lost when this happens. The
more disks you have in a single RAID5 set the more likely it is you’re
going to experience such a double disk failure at some point.
From:
veritas-bu-bounces AT mailman.eng.auburn DOT edu
[mailto:veritas-bu-bounces AT mailman.eng.auburn DOT edu] On Behalf Of Martin,
Jonathan
Sent: Tuesday, June 29, 2010 3:13 PM
To: veritas-bu AT mailman.eng.auburn DOT edu
Subject: Re: [Veritas-bu] Destaging going slow
First of all, suck it Neil Conner. I’m about to
disagree with Ed (again) and there is nothing you and your fish eating friends
can do about it.
Gotten away with it? I’m not stealing bread from the
supermarket, I’ve made a calculated decision. I run 18 MD1000s in this
configuration globally and I have yet to lose an array. The added capacity and
speed benefit of a 15 disk raid array (no hot spare) is plenty worth the risk
of the array going down. Further, this risk is mitigated with properly
configured Dell OpenManage which alerts me immediately if a disk fails so I can
have it replaced.
Sure, I may eventually lose an array, but this is backup data.
It’s importance to most businesses is somewhere between Dev and QA, and
if I were (worst case scenario) to lose a Raid5 and 10TB of backup data, then
I’d inform the appropriate application groups and move on. It’s not
like most of the data isn’t probably ok (backup not needed) or on tape
(array not needed) or has incremental available (also on tape.) This
isn’t a production database or file server, it’s backups, IMO,
Ed’s “every bit counts” attitude is completely out of step
with the real world.
-Jonathan
PS: I do agree with Ed about 1TB disks, but in my case because
of the poor performance not the raid implications. 15 x 500GB SATA in a Raid-5
is the backbone of my operation.
From:
veritas-bu-bounces AT mailman.eng.auburn DOT edu
[mailto:veritas-bu-bounces AT mailman.eng.auburn DOT edu] On Behalf Of Ed Wilts
Sent: Tuesday, June 29, 2010 10:09 AM
To: veritas-bu AT mailman.eng.auburn DOT edu
Subject: Re: [Veritas-bu] Destaging going slow
From a storage perspective, I've got all disks in a Dell
MD1000 enclosure configured in a single 15 disk RAID-5.
Don't ever do this. Jonathan has obviously gotten away with this (so far)
but using large drives (e.g. 1TB) in a 15-member RAID-5 set is just asking to
lose the array due to a double-disk failure.
I've done several recoveries for our Windows Server Team because they're
configured large RAID-5 sets and had double-disk failures.
../Ed
Proud
partner. Susan G. Komen for the Cure.
Please consider our environment before printing this e-mail or
attachments.
----------------------------------
CONFIDENTIALITY NOTICE: This e-mail may contain privileged or confidential
information and is for the sole use of the intended recipient(s). If you are
not the intended recipient, any disclosure, copying, distribution, or use of
the contents of this information is prohibited and may be unlawful. If you have
received this electronic transmission in error, please reply immediately to the
sender that you have received the message in error, and delete it. Thank you.
----------------------------------