ADSM-L

Strategies for DR recovery of large clients

2002-09-10 11:31:26
Subject: Strategies for DR recovery of large clients
From: Werner Kliewer <VKliewer AT MPI.MB DOT CA>
To: ADSM-L AT VM.MARIST DOT EDU
Date: Tue, 10 Sep 2002 10:29:48 -0500
I am working on our first TSM based DR plan for our data centre. We currently 
have a successfully tested several times plan using system specific tools, such 
as Sysback/6000 for AIX, BRM for AS/400, ArcServe for Novell and WinNT. We have 
been asked to convert to TSM to consolidate all backups in one tool.

We have recently installed an NSM running TSM version 4.1.3.0, soon to be 
upgraded to 4.2.2.1 because that is the newest version certified for the NSM. 
It is attached to an LTO library. There is no HSM activated, but the TSM Server 
is backing up numerous clients that have to be restored in a DR scenario. Many 
of them are NT4.0, Windows 2000 and soon Windows XP servers of various sizes. 
There are also several AIX 4.3.3.ML09 servers.

I have the TSM Server recovery down to probably the bare minimum time of 4-6 
hours, depending on the power of the machine it is being recovered on. This 
does not include creating the disk pools, which can take another 12-18 hours 
but is not part of the critical path.

The biggest Windows servers run Exchange or SQL Server, which  tend to back up 
large blobs of data that are relatively easy to restore.

Two of the AIX servers are p680's with 750 logical volumes, 500 filesystems and 
1.5-2 terabytes of total data each, 250-350gb of data backed up nightly. For my 
current test, the DR media pool is 107 cartridges. I have to restore the data 
in stages. Neither the command line client nor the GUI will allow me to select 
all the filespaces I need at once for the first pass. The command line tells 
complains that it is too long and the GUI simply fails before starting if I 
choose too many filesystems. Even if one of them worked properly, I would have 
to pass at least 75 of those tapes to do the restore. This initial pass 
includes great chunks of the /home filesystem where things are changing every 
day.

For my DR test, I am running on a p610 with a single, stand-alone LTO drive. It 
takes about 12 hours to pass those 75 tapes once. I will have to pass them 3-5 
times for each restore. For the real DR test, I will have an F50 and 3 LTO 
drives, but I will be restoring at the same time as all the other critical 
servers, so I will be lucky to get a single drive to myself, and it is the 
LOAD, UNLOAD and LOCATE parts of the process that take up the bulk of the time. 
Actual data transfer is quite well optimized, once the data is located.

I am currently looking at 2 possible ways to improve this. None of the servers 
will have direct attached backup/recovery devices of sufficient capacity, 
throughput, or reliability to be useful. We cannot afford enough drives to 
cover all the servers. All restores must be done via the TSM server.

One possibility is to use BACKUP SETs. But I am concerned that BACKUP SETs are 
oriented to local restore scenarios and am not sure how easy they are to manage 
and restore from a central storage (NSM/TSM) point of view. There is also some 
concern about the additional TSM activity creating the BACKUP SETs would cause 
on an already fairly close to capacity NSM.

The second option is to do full system ARCHIVES, but this would cause activity 
on both the NSM and the client, neither of which have available windows for 
this activity.

Because either of these possibilities would, of necessity, be occasional (at 
best once a week), there is the additional issue of how easy it is to bring the 
system up to the most current backup after the restore. Would a multi-filespace 
simple restore be intelligent enough to pass only the last 7 days of tape or 
would it pass all tapes with those filespaces on them?

A third possibility I have thought of recently is to isolate these very large 
servers in their own COPY POOLs, effectively co-locating only these servers, 
but I am not convinced this would reduce the number of tapes passed by the DR 
restores, and it would certainly increase the total number of tapes in the DR 
set and increase off-site reclamation activity, which already takes the better 
part of the day shift most days.

Sorry for the long post. And thanks for any hints more experienced TSM users 
can provide.

I have a 48 hour DR Test window in which to build the TSM Server, restore data 
to clients, hand the clients to the DBA's to rebuild the databases, and test 
the end user applications. In the past, we ran SYSBACK/6000 on the AIX servers 
and were usually home and cooled in 36 hours. But the AIX servers have been 
consolidated from 7 large servers to 2 enormous servers and the SYSBACK/6000 
would also struggle to complete in the 48 hour window. The point of going with 
TSM was to try to improve on this.

P.S. Where are the AS/400 clients? When we were sold the product, we were told 
it ran on AS/400. We foolishly assumed this meant the client first, server 
second ...

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
Werner (Vern) Kliewer
Sr. ITS Analyst
Mid-Range Support
Manitoba Public Insurance
(204)-985-7745
vkliewer AT mpi.mb DOT ca
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>