Bacula-users

Re: [Bacula-users] Hardware Specs & Sizing for Bacula?

2016-02-21 12:21:03
Subject: Re: [Bacula-users] Hardware Specs & Sizing for Bacula?
From: Alan Brown <ajb2 AT mssl.ucl.ac DOT uk>
To: "bacula-users AT lists.sourceforge DOT net" <bacula-users AT lists.sourceforge DOT net>
Date: Sun, 21 Feb 2016 17:14:01 +0000
On 20/02/16 09:46, paul.hutchings wrote:
Wow that's quite a guide - appreciate that :)

I have to say that so far I've been very impressed with Bacula, my biggest struggle has been finding the time to dedicate to it, and not trying to do everything using the product we currently use as a reference point, which is hard when you've used it for 10 years every day.

Comparing with what you're comfortable with is always the hard part. The thing is, Bacula runs rings around almost every other product out there - and it's worth buying the supported version.


I think I get the "centralise" argument, but we may see that Linux file server grow to be 30-40TB over the next few years and even whilst it won't all need backing up I would be concerned if the folks using it do a job that generates a huge amount of data as we just don't have 10GbE everywhere yet so I can't help but think that being able to throw a £2.5k autoloader on that server directly just to deal with that one box may be simpler but I'm very open to reasons for and against.


10GbE is relatively cheap if you only apply it to the machines that you actually need it on - look to Huawei Cloudengine 6800s if you want to start thinking about TRILL and distributed routing (https://tools.ietf.org/html/draft-ietf-trill-irb-10) which is a huge win for campus and datacentre work, or a cheaper 10GB base switch such as the S6800 if you can't justify this (Hint: List price is one thing but discounts are always available and Huawei's stuff is 1/2 the price of Cisco already, plus has greater capabilities built in for the price - cisco will nickel and dime at every step, including maintenance charges)

Incremental backups mean that most daily backups are small(ish). 1Tb will backup in 6-7 hours if you set things up properly but once you start spooling, you'll find that 2 1Tb backups run in the same time (one spooling up whilst one despooling to tape)

More importantly - if you're spending money on a tape library then you'll get more bang for the buck to spend more money on one unit than to buy extra units or standalone drives dedicated to other purposes. A tape drive attached to a single machine can only be used on that machine whilst one in a changer can be used for a group of them. It doesn't make much sense from a logistical or economics point of view to not centralise tape systems.

The SSD speed delta makes sense, with our current backup product there is a similar database and their spinning disk requirement was insane vs. a single SSD (their database is disposable worst case, though it's still backed up daily to avoid grief).  We'd backup the database with Bacula too.


From an absolute point of view, bacula's database is disposable too - but if you want to restore anyhting less than a full backup file it's the difference between running an entire backup set vs skipping to the parts actually needed - and recovering it means running every single tape in order to find what's on them.

We dump the database every day and copy off to another machine. It's regarded as that critical.

In our case the SSD is Intel S3500 which is "enterprise" grade by most definitions, we have one free 2.5" bay in the server should another be needed to RAID up, or I guess we have internal options - the box wasn't purchased with Bacula specifically in mind, plans have changed a little.


plans always do.

The point about raiding is that even enterprise drives fail - and "enterprise drives" tend to be designed from the outset to be used in RAID configuration, so their ERC setup allows them to mark blocks bad quickly and move to the next request, letting the RAID recover what's lost - vs consumer drives which try very hard to recover data when they encounter badblocks.

Bearing that in mind, plus the criticality of the database and the requirement to have low seek latency for inserts, a raid1 pair is the best solution.

> Your tape suggestions are very helpful too thanks, we'd have 2x LTO6 drives for Baculaw and we have physical storage covered, your comments about cleaning and debris are interesting as Spectra sell their own "certified" media which is supposed to deal with many issues and I did find myself deliberating whether it's worth the premium but as you say backups are critical so I won't be giving £10/extra per tape too much consideration.


Bear in mind that even if you buy in "precleaned" tapes, time, handling and normal wear+tear means that eventually they'll need cleaning anyway (or you'll be swapping out drives with increasing frequency and/or putting cleaning tapes into them too often - bear in mind that cleaning tapes are abrasive)

Tape drives and libraries are extremely susceptable to dust contamination, but make no effort to prevent it getting in (and the fans pretty much guarantee that dust is pulled through the mechanism). Human skin dust is one of the worst contaminants as it's greasy.  Mptapes.com's LTO cleaner brochure shows the relative sizes of the contaminants to tape bands.

You should also acquire a data cartridge chip reader as this will report on the health of the tape themselves - mptapes' one is OEMed by at least a dozen resellers who all sell it for a lot more than mptapes price (eg: Fujitsu tried to sell it to us for £2000 when it's about $900 buying direct)

Once you realise how bad things can get and then look at the location of most tape drives/libraries you'll get a bit more paranoid about their physical environment. (One example: We had building work done in part of the server room. Despite plastic sheeting being extensively used to wall off the work, every single one of our LTO drives died within 3 weeks)

A clean room and the return of the white-coated priesthood of computing is probably over the top but you do want to ensure the devices are in a very low traffic environment and put some form of hepa air scrubber in the room if you can (there are a number of such units available for fairly low prices.)

I'm going to do some reading but any clarity around the "spool" function where tape is concerned would be good as I'm not entirely clear in my head if it relates to backing up directly to tape, or if you're doing D2D2T for the 2T part?


You can't practically back directly to LTO tape  across a network, It has to be D2D2T. LTO speeds are far higher than disk speeds, let alone gigabit networking and if you try without spooling your throughput will be abysmal as the tape drives drop firstly to lower run rates and then start to shoeshine when data flows hiccup (they always do on networks and when backing up up from fileservers)

In addition, spooling allows you to run multiple concurrent backups which is where things really start to win. I run 400 incremental backups per night on our systems in about an hour because of this, vs needing 6-8 hours without.

The caveat is that you need a _fast_ spool disk (seek latency). The 3500 would be a good choice. (put the database on something like a pair of SM846s, you need seek speed but not absolute speed)

I feel reasonable optimistic that we're on the right track if we do go with Bacula, sounds like there are some tweaks to the hardware we have but it doesn't sound like we'd need to literally start over, my biggest concerns are around us having the time to absorb it all.

Thanks again - appreciate the time you took there.

Correct. Your base hardware is fine. It's just the memory and storage choices that need revising.



-------- Original Message --------
Subject: Re: [Bacula-users] Hardware Specs & Sizing for Bacula?
Local Time: February 19, 2016 10:02 pm
UTC Time: February 19, 2016 10:02 PM


On 19/02/16 19:10, paul.hutchings wrote:
Alan thanks, I omitted that we have a Spectra LTO6 library which would be SAS attached to the server in question but I didn't mention it as my initial query was more about the hardware specs.

It all ties together.



The rough plan would be D2D2T and we'd probably run one of our fileservers (linux) directly to a local directly attached small LTO6/7 library as it's not data where we need a long retention and it feels dumb to be running it over the network just to send it to tape to keep for a week.


You're better off centralising it. Seriously. Even if you're only keeping the backups a few days.


The hardware we have happens to have an 800GB SSD in it by a lucky coincidence,which I thought could be used for the Postgres database (not used Bacula enough to know how big the database may grow)

800Gb is big enough, but is it fast enough? ("There are SSDs and there are SSDs"),
With the kind of use it's getting you need to know the speed of garbage collection as this is going to be the driving factor far more than trim commands. On top of that you need to know the endurance of the drives so you can calculate when they'll need replacing (consumer SSDs run about 1000 times capacity, enterprise will generally run to 100 times more than that.)

By way of comparison, 2million file full backups were taking hours to insert attributes to the database on a raid 6 6-drive spinning set. That came down to "5 minutes" when I moved to a raid1 pair of samsung 840pro 500GB, but became "an hour" after a while. Flushing and trimming the disks brought the speed back down, but as the same blocks are being repeatedly written there's no trim sent in normal operation and they're gradually slowing down again even though they're now on a controller which supports trim commands. Whilst 840s are fairly notorious for their GC speed they're faster than most consumer drives and the plan is to replace them with a pair of SM843s as 500Gb isn't large enough anyway.

Raid is a must - you really don't want to try a restore without an intact database. This is worst -case disaster scenario material and you need to treat the database as business-critical - which it is when things go wrong. (The database can also function as an IDS (intrusion detection system). Bacula manuals have details on how to use it for that)

> which I image would benefit from it immensely but I'm not clear what the "spool" is that you're referring to - a quick dig suggests it could be attribute spooling or a spool area for data that's going to tape?



Correct on both counts - and if you're feeding LTO-anything you MUST spool or you'll shoeshine the tapes and you MUST use SSDs for concurrent backups on anything faster than LTO3 as the raw speed of the tapes is faster than the sequential read speed of even 15krpm drives.

The moment you start randomly seeking on spinning media your throughput and iops will plummet, raid or no raid. On top of that, spinning drives used for spool will self-destruct regularly (even HGST 7k4s) due to the cumulative seek load, which translates to unnecessary downtime and hassle.

On the current back up box I'm using a raid0 5-disk set of 64GB Intel E25 drives. At the time they were $800 each and the spool area is really only about half the size it needs to be. Spending that kind of money now will get you an extremely nice, blisteringly fast PCIe SSD. You need at least 600MB/s (sustained) so stay away from SAS/SATA for the spool.


Sounds like we're good on the hardware but if necessary throw in some RAM.


If you keep with the plan of spinning drives, you'll regret it very quickly.

More ram is a must, as is proper database tuning. (postgres is good but you need to tell it how much ram there is available to use and give it optimisations for ssds. Mysql is tuning hell)

Use separate SSDs for the database and spool. Consider SSD for the OS

Use software RAID and dump the PERCs unless you can switch them to IT (initiator-target) mode from IR (Initiator-raid), as they'll slow you down (PERCs are mpt2sas based - this is a low end SAS chipset with significant RAID performance limits)

The spool device is disposable. Everything else is not. Your backup system needs to be treated as business-critical and built accordingly, along with the tape storage (a filing cabinet or shelves is nowhere near good enough). When things go bang you need it to work first time in order to be up and running as quickly as possible.

Some of the more paranoid people I know use 3-4 way raid1 mirroring on the OS and database disksets, specifically so they can keep one disk from each raidset in the datasafe at all times.

With regard to safes: We use 2 of the large ones pictured at http://www.phoenixsafeusa.com/primary-designation/media-safes - these hold ~800 LTOs apiece and they should be positioned close to your tape library - which in turn should be in a temperature/humidty controlled dust-free environment _out_ of your main server areas (The last thing you want if the server rooms catch fire is to lose your backup system too and server rooms always end up dusty, which kills tape drives)

If you buy a lot of tapes, consider a LTO cleaner from mptapes.com - most tape-drive related contamination incidents we've seen have been the result of new media with contamination on it contaminating the drives, which in turnm crosscontaminated a lot of other tapes. This shows as drives requesting excessive cleaning cycles and tapes showing as "full" at significantly less than their raw capacity (in the worst cases tapes were only holding 100Gb of data, the rest was taken up by rewrites)



We're so new to Bacula that I'll be blunt and admit there's lots I simply haven't got my head around yet if we do go with it so apologies if some of this is dumb/obvious to most of you :)

Bacula installations range from home systems to major banks. There's no "one size fits all" but there are some fairly important guidelines you need to adhere to in order to ensure that your backups are there and usable when you need them (which is always a high-stress event no matter if it's "I just deleted XYZ important file and I need it back NOW" or "the main fileserver caught fire and we need to rebuild it", so plan ahead)

As a rule of thumb for LTO - try not to let individual backup sets go much over 1TB. The bigger they are, the greater the chances of something going wrong during the backup/restore procedure and you don't want full backups going over 24 hours in any case as this starts interfering with daily backups of the backup server itself.

If someone tells you they need a 12TB filesystem, its quite likely they don't and they haven't thought through what happens if it needs fscking (which is another good reason for keeping backed-up filesets under 1TB. Beyond that fscks at startup can eat a lot of time even when parallelised. One such machine here gets rebooted every 6 months and usually spends a day in fsck before it's ready for use.)




-------- Original Message --------
Subject: Re: [Bacula-users] Hardware Specs & Sizing for Bacula?
Local Time: February 19, 2016 6:58 pm
UTC Time: February 19, 2016 6:58 PM


On 19/02/16 18:12, paul.hutchings wrote:
We're new to Bacula and are still considering if it's viable for us.

Our test environment is quite small (it is a test environment) and when I read the docs I'm not sure how recent they are when they relate to hardware specs.

For example if I were to suggest box with dual 8 core E5 CPUs, hardware PERC RAID card with 1GB cache, 48TB of 7.2k SATA in RAID6 and 32GB (or more) of RAM running as a SD would people be thinking "hmmm may need more horsepower" or would people be thinking "that should handle hundred/thousands of clients"?


It depends. For SD-only use, your CPU is overkill and even 16Gb of ram would be overkill

Ram requirements are for the director and database. These can be on the same box and probably should be to avoid networking penalties. You don't need VMs - and really shouldn't play that game on backup-dedicated hardware as VMs come with performance penalties ranging from noticeable to major.

Even with the DB and DIR on the box, your CPUs are more than adequate.

Assuming SD + DIR + Postgres (don't mess with Mysql for million+file installations, it doesn't scale well) then, I'd add more ram. It's cheap enough these days that you should think about running at least 96GB if you're backing up tens of TB and tens of millions of files (even more if you can afford it)

The real issue if you're running backups at this scale: Disk is a liability. It's too slow and drives will end up shaking themselves to pieces, making the backup pool your Single point of failure. You _need_ tape - A decent robot and several drives along with a suitably sized data safe.

We currently back up about 250 million files over 400TB and I'm currently using a Quantum i500 with 14U extension and 6 SAS LTO6 drives, previously we had a Overland Neo8000 with 7 FC LTO5 drives.

Once you bite the bullet and use tape, dump the sata spinning disks. Use something like a raid1 pair of 500GB SM843s for your OS, put in second dedicated 1TB raid1 pair for the database and use a _fast_ 200-800GB PCIe flash drive for spool.

10GB networking is an absolute must. Don't try to play games with 1Gb/s bonding. Any given data stream will only run at 1Gb/s maximum.


On the other hand, the setup above would be an expensive waste of time for backing up 10TB of data - although for that size you could keep the spinning media and keep the rest - but bear in mind that 48TB is only going to allow 3 full backups of 15TB (any fewer than 3 full backups is asking for trouble), without taking differentials or incrementals into account.

For 20TB+ you may want to look at a single-drive tape autochanger capable of holding at least 10 tapes. The last thing you want to be doing is feeding new LTO6/7s into it every 2-3 hours when a full backup is running (yes, they will fill up that quickly)



Director could be on the same physical box but would ideally be a VM with a couple of CPU cores and as much RAM as is needed to handle a couple dozen clients, though the largest two clients are around 10TB and each have millions of files.

Impression I get is that network and disk will be a bottleneck way before RAM and CPU should be?





------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140


_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users




------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140


_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users

------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users