Re: The Size Question

On Sat, Jul 08, 2006 at 01:50:09AM +0200, Peter Kunst wrote:
> Jon LaBadie wrote:
> >On Fri, Jul 07, 2006 at 05:11:50PM -0400, lois AT zmanda DOT com wrote:

> >>Dump images spanning multiple media volumes:
> >And no more simple recovery of spanned dumps using
> >standard unix tools when amanda is not available.
> >That needs to be pointed out in any revision.
> 
> Indeed. Just had another case last week, where the dd method tells me 
> "this is chunk 3 of 4"... well, on which tapes do i find the other 
> chunks, i asked myself.

Run the amandatape program after every backup, and you will always have
the answer to this question right on your tape labels.


But back to the tape chunking: I don't like this static chunking.  You
have to choose between either wasting a lot of tape or having lots of
chunks, making manual recovery a fragile task.  So you trade waste against
inconvenience.

It would be much better if the chunking algorithm would take into account
how much tape is available for the next chunk. Something like

  chunkfactor 3/4  #  0<=chunkfactor<=1
  minsize     1GB

The chunkfactor specifies how much of the rest of the tape will be
allocated for the next chunk.  The minsize specifies the minimum size
of a chunk, to avoid a high number of chunks because they are getting
too small.

With the above specification and a tapesize of 100GB we would get:

 first  chunk:  75.00GB==3/4*(100)
 second chunk:  18.75GB==3/4*(100-75)
 third  chunk:   4.69GB==3/4*(100-75-18.75)
 fourth chunk:   1.17GB==3/4*(100-75-18.75-4.69)
 fifth  chunk:   0.29GB==3/4*(100-75-18.75-4.69-1.17)  # forced to 1GB

 # since the fifth chunk is less than minsize, it is foced to minsize (1GB).
 # If it don't fit to the tape, we start over again on the next tape:

 fifth  chunk:  75.00GB==3/4*(100)    # goes to a new tape

On the new tape, the fifth chunk would again start over at 75GB because
the new tape has 100GB available again.

We ended up with only four chunks (the fifth is ignored since we got a
write error) on the first tape and have wasted only 0.29GB (that is 0.29%)
of the tape.  We will never waste more than 1% (minsize/tapesize) and we
will never get more than four chunks of a given dump on a single tape.

(In contrast, the current algorithm will end up with 100 chunks if you
 don't want to waste more than 1%)

With growing chunkfactor (limit is 1), you get a lower number of chunks,
but you risk to waste more tape if you get an early write error.  So now
you trade risk against waste (instead of inconvenience against waste, as
in current algorithm).  With reliable tapes, you can have higher
chunkfactor and end up with a low number of chunks.  With unreliable tapes
it is better to go lower chunkfactor because the earlier you get a write
error the more tape you waste.  Either way, you have bigger chunks at the
start of the tape, honoring the assumption that probability of write
errors is higher on the end of the tape.


For vtape usage (you have no risk for write errors here), chunks can be
made exactly the size that is needed to fill the vtape with:

  chunkfactor 1
  minsize     0

So we never waste any disk space on vtapes any more!


The current chunking behavior can be achieved with:

  chunkfactor 0     # will always be smaller than minsize
  minsize     10GB  # thus all chunks will be 10GB


I have already implemented such a system in a different project (Jon, you
remember the ssh based system I mentioned a couple of months ago?) and I
am pretty happy with this algorithm.