DIRMC at one - observations

I've been using a DIRMC disk pool for a year now, and thought I should
share my observations about the use of that feature.

Environment:
TSM 4.1.5.0 on AIX 4.3.3 - ML 8 most of the time but ML 10 for the last
1-1/2 weeks.
RS/6000 2-way F50 until 2-way 6H1 1-1/2 weeks ago.
About 125 clients mostly Windows, but a few Unix, plus TDP Domino.
28 GB TSM DB
7 TB data online; 18 TB data total
Backups go to 3583-L72 LTO and 3575-L18 Magstar XL, four drives each
Copies go to HP 4/40 DLT 8000, four drives
All libraries are double-connected SCSI giving a max sustained throughput
of 32+ MB/s per library.

We expect to upgrade to (i)TSM 5.x within the next 1-2 months.

Suggestions based on my experiences:
1) Cap the size of the "folderdisk" (that's what I called my directory
disk pool);
2) Migrate a small percentage of the disk pool at a time - do not dump the
entire pool to one tape;
3) TSM database tuning is critical - if your DB performance is poor, this
is where you will see it;
4) Alternatively, keep directory information in its own isolated data path
entirely;
5) Monitor  "mount" requirements - my folderdisk is set for maximum of  12
mounts - the most I have ever seen in service is eight.

I had implemented a 2 GB folderdisk.  At first it was devclass DISK, but
it didn't take many objects on that pool to hint that copypool reclamation
would soon take weeks to accomplish.   So very early on, I recreated the
folderdisk with devclass FILE volumes.  That solved the copypool
reclamation issue.

Monitor the maximum number of folderdisk volumes the system needs to mount
simultaneously.  I did that by checking the number of "filling" volumes
daily.  With my clients scattered across a 12-hour backup window (6 pm to
6 am), I never found more than 8 volumes filling, even though I originally
had 50 volumes in the pool.  Usage patterns show that most of the time the
system could get by with just two simultaneous mounts.

Volumes don't have to be very large.  My system averages about 1 KB per
directory object, so a 10 MB volume will hold about 10,000 directories. My
current implementation uses 50 MB volumes.

At first I had migration set for HI=99, LO=0.  That is too high because
you would have to  get down to just one filling volume to reach 99%.  I
dropped the HI point to 95% to ensure I would get migration when I still
had more than one volume capable of accepting data.

I left the LO point at zero.  That was a BIG mistake.  About a week before
the cut-in of the new 6H1 server, the pool hit its HI point and began
migrating to the 3575.  That migration took THREE DAYS to migrate about
two million directory objects.  That was on the old server - 2 x 32 bit x
332 MHz with 1 GB RAM and DB carved into 38 1-GB volumes.

All two million directory objects went to the same 3570 tape.  That was
another BIG mistake.  Although that tape held two million directories, it
totalled less than 2 GB on a tape that held 7 GB uncompressed - and we've
averaged about 12 GB with compression.  So the tape also held another 10
GB or so of ordinary file data.

Much of which was purged during expiration one week after the new server
was placed into service.  When that tape hit its reclamation point, TSM
had to move all 2 million directory objects to another tape.  Even on the
new server with a 768 MB DB buffer pool and properly tuned VM, that
reclamation occured at 10s of KB per second, vs. 5-10 MB/s that those
drives are capable of.  Since TSM allows only one reclamation per storage
pool, the entire system appeared almost hung while TSM chewed on that
tape.  I finally cancelled the reclamation to force TSM to resume with the
next tape in the reclamation list.  When TSM resumed reclaiming the tape
with all the directories, I forced it to spread the directories across a
few tapes instead of putting them all on one destination tape.

So now, I have a 1 GB folderdisk ( 20 volumes x 50 MB) with migration
settings of HI=90 and LO=85.  It would be nice to think that only about
50,000 directories would migrate when the threshold was hit, but I believe
TSM starts with the node that has the largest amount of data in the pool.
But even if it migrates down to 80% as a result of node data size, that is
still only about 100,000 directories in that chunk, so future reclamations
of that output tape would only be delayed by an hour or two.

Also note that the DB will grow as you accumulate more directories.  So
I'm considering capping the number of directories I keep to support
point-in-time restores.

The DIRMC option is a very valuable tool.  It allows the TSM administrator
to control where directory information is saved, especially if they send
primary archive tapes offsite like we do.  But it is NOT a "set and
forget" feature.  You will have to keep an eye on it and try to avoid the
issues I've  seen when using it.

Just an FYI...

Tab Trepagnier
TSM Administrator
Laitram LLC