Best approach for large 3+tb db backups

Jeff_Jeske

ADSM.ORG Senior Member
Joined
Jul 17, 2006
Messages
483
Reaction score
7
Points
0
Location
Stevens Point, WI
Website
http
We have an Oracle database running on linux that is 3.8TB in size. Our current approach to protect this server is to stop the DB, perform a Clariion snapshot, start the DB, the present the snapshot to another server and then backed up from the snapshot host. Note: the snapshot is taken from the secondary copy at the DR site.

Functionally this works well but due to resource limitations we only give the backup job one mount point and thus an incr job runs for 28 hours.

This week a Linux admin accidently deleted one of this server's filesystems. By design, the mirrored Clariion disk immediately scrammed the DR site copy as well. In addition you can't rollback a Clariion snap to a secondary and immediately use it at the primary. So... we were left rolling tape. As it turns out you can't simply restore just a piece of database and expect it to be happy. We were forced to do a full 3.8TB restore from tape.

This took us about 36 hours from launch to completion. The restore worked as designed but our clients and managers where not happy that it took us 36 hours to recover. (EVEN THOUGH THEY WERE TOLD THIS TIME AND TIME AGAIN)

None-the-less we are now looking for a better way to protect this database via TSM. I'm looking to hear from those experienced in large database backups. Do you use the Oracle TDP? Do you stream the backup to multiple drives? Any and all information welcomed.
 
Last edited:
Regardless what method you choose to do the backup (tdp, flash, cold, etc) your limtiation is your mount points. Generally, to faciliate a quicker backup/restore turnaround for such a large database you need to allocate more mount points and drive more data streams.
 
I'm curious if you lean towards SAN backup for a client this size. The reason for our mount point limitation is not only drive availability but also TSM server NIC limitations. It doesn't take many large file backups to flood the TSM server NIC.
 
Could look at LAN free or if you wanted to sick to LAN based backups look at ether channels for both the DB server and your TSM server to accommodate.


Otherwise since this is EMC a BCV setup might work well and in that specific situation could have probably been used for the restore.
 
Yeah, I'd definitely lean that way, especially if your TSM NIC is already burdened. I'd also isolate the large db files to go straight to tape (lan-free) using include statements. This will reduce the burden on your TSM Server NIC and maximize the tape drive capabilities. You should be able to do this with include statements binding files of a particular name to the lan-free class, and still push the smaller files over the lan to a diskpool. This would be the most efficient way, as you don't want those small files to go lan-free. I have heard of an undocumented include.size option to bind mgmtclass based on the size of the file, but I have never experimented with this directly.

Also, you'll want to tune the TSM client settings, maximizing txnbytelimit, windowsize, (no compression), etc to send large packets.

Again, the main trick to maximize throughput on lan-free backup is to maintain that data stream to the tape drive. Having large files to send clearly makes this easier.
 
Lanfree works great provided you can guarantee a tape mount when you need it. From your description it would appear you will only get one backup thread out of your current setup so this is about the only way to get a speed increase without making more changes. And you have already run into the limitations inherent in relying on snaps.

TDP client will alllow online backups of the database through RMAN and also will allow your DBA to use versioning just in case the DB is corrupted or in some way damaged. A replicated copy of that would also be corrupted or damaged. But with the TDP you can go back to a known good version. And you can combine TDP and lanfree.

Or you can use RMAN to dump the backup to a file if you have the disk space and use a product like Quest SW's LiteSpeed to compress it down considerably. That of course adds extra steps to any DR recovery but should speed up the backup time
 
We've investigated the etherchannel but at the time it didn't fit into our redundant IP network which uses different switches for different NICs. I think it required we keep the entire channel on one switch thus creating a SPOF.

The BCV approach is for Symm disk not Clariion. I agree this database should be on tier 1 storage but at nearly twice the price for the R1 and then additional cost for the BCV I don't think that is realistic at this time.

I suspect the quick win would be adding a second drive and using lan free.

Our SQL environment is using LiteSpeed but now with newer SQL versions providing native compression agents we are moving away from Quest. It may still be a good fit for this server. How does the Oracle TDP compare to LiteSpeed?

We have an IBM 3584 with 8 drives and used LTO3 drives are fairly cheap these days. I'm recommending we purchase additional drive capacity. Hopefully multistreaming the job over fibre to multiple drives will speed things along.
 
Last edited:
We've investigated the etherchannel but at the time it didn't fit into our redundant IP network which uses different switches for different NICs cards. I think it required we keep the entire channel on one switch thus creating a SPOF.

The BCV approach is for Symm disk not Clariion. I agree this database should be on tier 1 storage but at nearly twice the price for the R1 and then additional cost for the BCV I don't think that is realistic at this time.

I suspect the quick win would be adding a second drive and using lan free.

Yeah it's surprising they're not willing spend the extra money at this point given their recent experience. The redundant part of ether channel can be somewhat of a draw back, you could work around that by using DNS round robin and have two ether channels one on each switch with a common DNS name. I'm not sure how that would affect performance of the setup though.

LAN Free probably cost wise is probably the next best option especially if you're running 4/8Gb SAN gear.

Just a side thought have they entertained using 10Gb LAN interfaces on the server and switch side if you wanted to maintain just a LAN backup?
 
I'm backing up Oracle on Solaris. I have some backups that are 4TB and some smaller ones. All are 1TB or greater.

We have a mirrored disks on an EMC SAN. When a backup is not running the mirror is broken and Oracle uses half of the disks. Each night a Perl script runs that puts Oracle in hot backup mode, reattaches and syncs the disks, mirror is broken, Oracle is taken out of hot backup mode and the disks are mounted on the TSM server where they are backed up to tape and copies are made to send to the vault. This takes 14-16 hours for 3.5-4TB.

We backup Oracle Archive Logs every ten minutes 24/7. The mirrored disks are synced daily and only written to tape twice a week. This save a lot of tape and time.

If we need to roll back to the previous day we can reverse sync the disks. This is very fast when compared to a tape restore.

To do this you must have a different disk for the Oracle Archive Logs and back these up with a different process. If you let Oracle write archive logs to the same disk that is being synced it may never complete. All of our applications are still available during the backup.

During disaster recovery testing restores can take 16-36 hours from tape depending on what they want to restore.

We have been doing this for several years and it works very well.
 
MikeyD - I've been trying to get 10GB IP but since TSM is the only app that provides more than a trickle down the 1GB pipe it's a hard sell. Strangely enough it's still hard for "them" to understand that TSM moves more data in one day than any other server based application.

rlkeeney - Thanks for the feedback. Do you know if your database is on Symm or Clariion? The prolblem with Clariion is the resynch process takes a significant amount of time. Resynching the restored primary with the previously fractured secondary took almost 40 hours to resynch.

I believe taking a snapshot at the primary site may provide faster recovery but that is only a small scale option since the reserve lun pool is limited in capacity. Again this would be an excellent candidate for Symm disk.

Thanks to all that have provided input. Keep it coming ....
 
In my honest opinion, doing a LAN based backup for this size DB is not an option IF speed and down time is a critical requirement.

The setup I would go for is by taking a point-in-time copy of the database via SAN-to-SAN. A backup of the copied DB via LAN Free method or LAN method (if you so choose) is done.

This method, however, only takes full DB backups. To compensate, incrementals can be done in between full DB backups which should not take long if done frequently.
 
Back
Top