ADSM-L

Fundamental Migration Design Flaw

2003-10-03 15:01:42
Subject: Fundamental Migration Design Flaw
From: Roger Deschner <rogerd AT UIC DOT EDU>
To: ADSM-L AT VM.MARIST DOT EDU
Date: Fri, 3 Oct 2003 13:59:53 -0500
I am having problems on my ITSM V5.1 server with a disk storage pool
completely filling up. When that happens, the server attempts to mount a
tape for each client backup session, there are of course not enough tape
drives, so everything comes crashing to a halt and a large number of
client nodes don't get backed up.

What was amazing when this first happened, was that there was no
migration process running, despite HIGHMIG=75. The server had made no
attempt to protect itself. I started tracking the answers to Q STGPOOL,
and I reread the doc three more times just to be sure, and I think I
have found the problem - open files. That is, files which are in the
process of being transmitted across the net from client nodes to the
server. File sizes are balooning these days. 1gb individual files are
commonplace.

This disk storage pool has cacheing turned OFF.

 Storage      Device       Estimated     Pct     Pct   High   Low   Next Stora-
 Pool Name    Class Name    Capacity    Util    Migr    Mig   Mig   ge Pool
                                (MB)                    Pct   Pct
 -----------  ----------  ----------   -----   -----   ----   ---   -----------
 DESKTOPDIS-  DISK          42,000.0   100.0    57.4     75    25   DESKTOPTAP-
  KPOOL                                                              EPOOL

At that time, migration was incredibly not running, and plenty of tape
drives were free. It could have saved the day. The server looks at the
Pct Migr number to tell when to start and stop migration. THIS IS
WORKING EXACTLY HOW IT IS DESIGNED AND DOCUMENTED. And it is also very
wrong.

To verify what I was seeing, I restarted the server, and Pct Util
dropped to 60%. Yup, it's open files.

1. I know I need a lot more disks! Hardware arrives at its own pace,
dictated by budgets, Purchasing Departments, how long it takes me to
bolt it into the rack, how long dsmfmt takes (too long), etc. More disks
should be online by sometime next week, I hope.

2. I have already adjusted the settings to limit the number of sessions
and spread out the scheduled backups for the entire night, from 5PM to
8AM. I cannot spread it any further.

3. To deal with this, I am lowering the migration threshold until it no
longer fills up. During tonight's backup window, it will be set at 15%,
even though that is a bit extreme. Basically, it will migrate any closed
file almost as soon as it is closed. That's no way to run a railroad. Is
there any other workaround, or perhaps could this be fixed? Migration
algorithms should work to prevent fillups like I am experiencing, but
they don't, so it is broken.

Roger Deschner      University of Illinois at Chicago     rogerd AT uic DOT edu
============ "In theory, theory and practice are the same, =============
========= but in practice, theory and practice are different." =========

<Prev in Thread] Current Thread [Next in Thread>