[HOWTO] Restore 1,000's of VMs in Hours

rowl

ADSM.ORG Senior Member
Joined
May 18, 2006
Messages
266
Reaction score
10
Points
0
Website
Visit site
I am being asked how we can restore 1,000's of VMs in a few hours (no more details than that). This is for recovery from patching cycle gone wrong, ransomware, or other large scale software corruption.

My first thought would be that this problem would be best solved by crash consistent storage level snapshots augmented with a VM backup solution that offered instant access for those VMs that couldn't be booted from the snapshots.

Curious if others have had these sort of vague requirements in their environments and what you came up with.

Thanks,
-Rowl
 
Wow. That's a tall order.
I'm honestly not sure how that could be done without a massive overhaul of the entire infrastructure. 10, 100gb Ethernet? 32g or InfiniBand SAN? SSD's for all your storage for both TSM and VM's ?
Even then, the TSM Server would likely have to be pretty beefy. Heck, would it make sense to to have 5+ servers and storage so they could each process a subset of the workload?

Perhaps Spectrum Protect Plus? I know there's some new kids on the backup block like zerto that claim to be able to do just that.

I'd be interested, rowl, if you do manage to come up with a way.
 
Started looking into this, 1,000 average sized VMs would be around 100TB in size. The network infrastructure required to move that much data in "hours" doesn't seem realistic, not to mention if the source/target could support that sort of I/O load. I think we need a way to bring up the VM's live on the backup storage, then start a long process of vMotion jobs to get everything back where it belongs on production storage.

I wonder what 1 PB of solid state storage for my disk pools will cost :)
 
Right.
Short of 'hot standby'? Or maybe delayed replication, or replication with versions.
Perhaps a HA cluster for every workload in VM? But that doesn't stop ransomware at least.
Everything should be NOW right? :)

I feel your pain, we have 700ish VM's and its been stated to my face many times, we couldn't rely on TSM for a true DR scenario due to the amount of time it takes to restore a single VM. So, when I start down the path of infrastructure requirements at the physical layer, and the costs associated with that upgrade. Server requirements for VM farm and TSM. The fact that Corporate made us buy 7200rpm drives for 'capacity not speed' I kindly point out those facts and such limitations.

What gets me is these cloud vendors claim they are able to deploy hundreds of vm's in minutes, and executive leadership seems to think that it also translates to 'restoring and hundreds of vm's in minutes' as well. I would love to see a product that can restore 700VM's when each VM has over 1tb of storage associated with it in minutes. From scratch and not just simply reattaching the vmdk's that were present on disk.
Just from the VM Host CPU and IO limitations alone, I'm not entirely sure its feasible within normal budgetary means.

Then again, I could be wrong and would love someone to prove me wrong.
 
Back
Top