Hi,
Is anyone backing total volumes of this order? and if so, what sort of
scaling, design, hardware?
I take it, that's the size of your filesystems? Not the estimated size
of the backup set (i.e. all cycles in retention period)?
Assuming it is,
Yes. about 700TB and still growing.
Keeping the individual filesets to 1Tb so that tape run isn't
excessive.
Largish changer - I'm about to retiire a 500-slot neo8000 with 7
LTO5 drives in favour of a 120-slot Scalar i500 with 6 LTO6s.
If you don't have enough slots you'll be feeding it multiple times
during long weekends (we can easily peak at 20 tapes/day if multiple
fulls get kicked off).
If you don't have enough drives you won't keep up, let alone cope
with the inevitable drive failures and 2 day turnaround for a
replacement. You absolutely must have at least 1 more drive than you
think you need to cope with the backup load. Apart from anything
else it means you can run urgent restores without interrupting
backups in progress.
Large data safes. You'll need something like a Phoenix FS1903,
probably a couple (these hold about 800 LTOs apiece) and a strong
floor for them to sit on.
The tapes, safes and changer should all sit in close proximity in a
temperature-controlled _clean_ environment, preferably in their own
room, which is accessed as infrequently as possible. Dust kills
drives and human skin is one of the worst contaminants because it's
greasy with most other dust types being abrasive. Consider an air
scrubber and clean-room "flypaper" sticky sheets on the door
threshold.
Large (200Gb+), high performance SSD for spool. Consumer drives
become a bottleneck.
Something similar (raid1) for database, 500Gb or so.
Postgresql - just works. Mysql doesn't scale this large very well -
It will work but you'll be constantly fighting with it.
LOTS of ram for the DB box. I have 48Gb in a 5year old machine. It's
due for an upgrade, but just about anything newer than 5 years with
a E5 CPU or better will do the job nicely.
10Gb/s connectivity. You can fudge it with LACP on 1Gb/s but it
becomes a bottleneck. Ditto on the fileservers themselves.
A decent network switch. Huawei 6800 series are nicely specced
(1TB/s throughput) and run rings around equivalently priced
Cisco/Juniper kit - which mostly all use the same Broadcom
Trident2/2+/3 chipset anyway.
We run 14 month retention on the backup cycle, with a full every 3
months, nightly incrementals and 4-weekly differentials. Rapidly
changing data in smaller sets gets monthly full backups. Thankfully
this is science data, as financial stuff may need to be retained up
to 7 years.
The most common restore is for accidental deletions but we've had to
pull a few fileset restores over the years - usually because someone
cheaped out and didn't RAID their box on the basis "its easy to
rebuild".
It never is unless it's a cookie cutter - which they never are after
a week of operation - and it's less disruptive to change a dead
drive in a raidset anyway (this can be done hot on Linux systems
using mdraid).
There's only ever been one major central store restore and that was
a runaway rm -rf. Unfortunately one group has a 200TB system which
is beyond warranty but not being replaced because of budgets. It's
being driven hard and sooner or later it's going to drop its bundle.
I'm not looking forward to that day.
Regarding the data safes: People say "Iron Mountain", but backups
are not archives. You're going to cycle the tapes and retrieving
them is much easier if they're local. A good fire safe will survive
an intense fire for 60 minutes and a 10 metre drop (simulating
building collapse) with the insides not going above 50C, but it's
best to site your safes where they're least likely to get that kind
of experience and pipe the data to them and the tape library.
Your single biggest hurdle is getting enough budget for the job.
Management usually won't spend enough on decent storage systems and
they'll heavily resist spending on backup systems. "Raid is not
backup" usually doesn't sink in unless they've been burned a few
times.