Re: Squishing the unused space out of my ADSM database

1999-05-24 17:08:51
Subject: Re: Squishing the unused space out of my ADSM database
From: Brian D Chase <bdc AT WORLD.STD DOT COM>
Date: Mon, 24 May 1999 14:08:51 -0700
On Mon, 24 May 1999, Richard C. Dempsey wrote:

> It would seem that if we could take a DUMPDB file and sort it on the
> keys before handing it to LOADDB, then the LOADDB process would use
> space more efficiently and it would run faster.  However, since the
> DUMPDB file contains lots of tables, a naive sort of the whole file
> could be disastrous.
> I don't suppose that there's any documentation of the DUMPDB file
> format? On the one hand, sorting a table on the way out of the DB
> would seem pretty straightforward.  OTOH, in the circumstances for
> which DUMPDB/LOADDB was designed, it might not be feasible.  Thus, a
> SORTDB utility makes more sense.
> Is there a strong consensus that SORTDB is an important feature to
> badger IBM/Tivoli about?

I'd consider anything which significantly reduces the size and increases
the performance of my database to be a worthwhile endeavour.  I'd imagine
that my situation is a fairly extreme one.  My environment is one in which
file numbers fluctuate greatly over long cycles.  My high water mark is
very high, and my low water mark is relatively low (say 50% of my high
mark).  I don't know if other people see such extremes.

The text you forwarded on ADSM database structure was very insightful.
Thanks for sending it.

I could use some clarification on one aspect of what was said by David
Bohm.  One of the server developers may have to answer this one for me.
I'm assuming that the percent utilization number for the server might not
be calculated based on the internal utilization of the B-trees, but rather
the space available for B-tree splits.  Let me try to clarify that with a
more concrete example if I can.

Let's say we're under ADSM V1 where the B-tree structures were simple and
non-optimized, so you averaged around 50% utilization of the database
within the B-tree structures implementing the database.  Now under this V1
scenerio, what is reported when you fill up your database to the point
where adding any more items would cause a B-tree split?  You've exhausted
the available space into which a split can be made.  Would the percent
utilization of the database be reported by ADSM as 100% (because no more
space is available for splits) or would it be 50% (because that's the
actual internal utilization of the B-tree data structures)?

I guess what I'm fundamentally after is the exact nature of what ADSM's
percent utilization number reflects.

> ---------------
> ---------------
> (David Bohm, ADSM server development):
> The ADSM server data base contains different objects.  Most of the
> objects are b-tree tables.  The cause of using more space for the LOADDB
> than was actually used in the data base that was dumped with the DUMPDB
> command is a result of the algorithm used to perform the DUMPDB/LOADDB
> and the characteristics of a b-tree object...
> When a record is to be inserted into a node in a b-tree and that record
> does not fit then a split occurs.  In a standard b-tree algorithm 1/2 of
> the data goes in one leaf node and the other 1/2 goes into another leaf
> node.  When this happens randomly over time you get a tree where about
> 50% of the data base is unused space.  With the V2 ADSM server we added
> a little more intelligence in the split process.  There are many tables
> in the ADSM server where a new record will always be the highest key
> value in that table.  If the insert is the highest key value then
> instead of doing a 1/2 and 1/2 split we just add a new leaf node with
> that single record.  This results in closer to 100% utilization in each
> of the leaf nodes in the ADSM server.
> This now takes us to the DUMPDB/LOADDB process.  One of the purposes of
> this process is to recover from a corrupted data base index.  What this
> means is we ignore the index on the DUMPDB process and only dump the
> b-tree leaf nodes (plus another type of data base object called a
> bitvector).  These leaf nodes are not stored physically in the data base
> in key order, which means they get dumped out of key sequence.  The
> LOADDB will take the records from each of those leaf nodes and then
> perform inserts of those records into the new data base.  This means we
> take those pages that were nearly 100% utilized because of the efficient
> b-tree split algorithm and convert them into 50% utilized pages because
> of having to use the generic b-tree page split algorithm.
> We do not "compress" records in the data base.  The data in the data
> base is encoded to reduce space requirements.  The data will always be
> written in the encoded form to the data base as it is required for us to
> properly interpret the data in the data base pages.  This encoding is
> performed with any writes of records into the ADSM data base, including
> the LOADDB since it calls the same routines to perform the writes into
> the data base as the rest of the server functions.
> APAR IC13101 also describes this.