[ADSM-L] Not too risky.... ( was Re: Dear Tuscon )

>> On Fri, 20 Mar 2009 09:39:23 -0500, Nick Laflamme <dplaflamme AT GMAIL DOT 
>> COM> said:


> My heart leapt when my RSS reader presented me an article in the TSM
> udpates feed from IBM with the heading, "Keeping more than one TSM
> server database backup on a tape." As I'm implementing a new server
> using 3592 drives, I haven't been happy with my options for this
> particular issue. Maybe, I thought, I was about to learn something
> of immediate use and high value!

> My heart sank when I read the actual article, which might be
> paraphrased as, "Sorry, Charlie, too risky."

I say, bunk.  Of course, your decisions have to be guided by your own
sense of paranoia, but I think a blanket "too risky" is just plain
wrong.

If you actually measure your risks, I think you'll find you can lower
them, not raise them, and get "more than one DB backup on a tape" as a
side effect.

Here's what I do:

My library manager is also the server-to-server virtual volume target
for all my infrastructure's database backups.  The DB backups are thus
primary archive data, from the perspective of the LM instance.  I then
make offsite and onsite copies of these primary stgpools.  I end up
with three different physical copies of the same backup run.

Contrast with direct backups to volumes: You can do a normal full and
a snapshot, in the interest of having something to take offsite and
something to keep onsite.  But they are -different- backups.  They
require different procedures, and only one of them can (for example)
be used as part of a full/incremental scheme.

Further, you have to re-do work.  If you want "a backup onsite, and a
backup offsite", you have to run two backups; you can't copy a DB
backup at all. More of your 24-hour clock occluded with DB-intensive
maintenance tasks.  Just what you need.

---

Media risk in the direct-backup case is the basic media failure risk
of the device in question.  Low for any modern media, astronomically
so for 3592-class volumes.  But not zero, as we all well know.

Media risk in my case is basic-media-failure _cubed_.  I'll handwave
around the procedural risks, "did I manage to make my copies", and
address that separately.  If you'll grant me the copies, you can
clearly see that I need three different pieces of media to have failed
in order to miss my restore: the primary, the onsite copy, and the
offsite copy.

Better still, if you want more belts and suspenders, go to town.  Two
copy stgpools? why not four: two onsite, two offsite!  We could go for
one-googolth risk levels.  That'd be silly, but achievable.
One-molarth is probably adequate for humans.

---

I handwaved at procedural risks, but I don't intend to just ignore
them: Yes, you have to maintain the copy stgpools in order to get that
increased security.  But we do that all the time, every day.  if our
TSM administrative scheduling isn't adequate to maintain a few small
copy pools (mine total under 3T each) it's not adequate to manage the
DB backups in the first place.

---

Note I haven't specifically addressed 'more than one DB backup on a
tape' yet.  It's offstage, behind that 'the DB backups are primary
data, from the perspective of the LM instance' dodge.  I've managed my
servers' DB backups in a variety of ways.  Right now, I collocate them
by node, to prevent server_a from occluding a restore by server_b.
but all the fulls and incrementals for a given machine are on one
tape.

---

Finally, don't be misled by the eggs-to-basket ratio.  It's an
emotionally persuasive argument, but irrelevant to your needs.  You
don't care about the other eggs, the other DB backups: you care about
a particular one.

If you wanted Monday's full, and a tape has gone bad, this doesn't
somehow mean you want Friday's full instead.  This means you're
falling back.  What I'm suggesting is that you 'fall back' to another
copy of the full backup you wanted in the first place.


- Allen S. Rout