Tape Spanning - 105 hour backup of 1.3 Tb

This post documents our experience with the version 2.5.0p2 tape spanning
option in the hopes that someone has suggestions for getting it working
acceptably. Even a "me too" would be useful information.

System info:
============
Solaris 8 (sparc) 64-bit
SunFire V880 (4-way) with 8Gb of memory
Amanda 2.5.0p2 compiled as a 32-bit executable using gcc 3.4.5
SDLT 600 tape robot (~400Gb per tape x 13 tapes)
4 individual 36Gb holding disks
most DLE's ~ 36Gb to ~72Gb, total ~300Gb
2 large DLE's of ~500Gb each

  NOTE: the 500Gb volumes are larger than one tape and also larger
  than the total size of our holding disks

Goal:
=====
Our normal daily backup works fine with the above configuration, but excludes
the 500Gb volumes which we backup via other means.  With the advent of the
2.5.0 tape spanning feature, we've been attempting to implement a second weekly
Amanda configuration to backup *EVERYTHING* onto one set of tapes for off-site
disaster recovery purposes.

Conclusions:
============
1) Tape splitting does seem to (sort of) work with the default tiny
   memory buffer size, but it doesn't seem possible to speed it up via
   the use a more efficient larger buffer size (disk or memory).
2) The lack of a decent buffer makes backups as much as 10 times 
   slower than a "normal" amanda backup. Our experience was 
   1.3 Tb backed up in 105 hours (4.4 days!). 
3) Using the default buffer size results in many thousands of split files
   on each tape rather than the preferred ~10 files.
4) Using a disk buffer of > 2Gb does not work as this seems to cause
   memmap problems (I assume this is a 32-bit limitation). This means
   that using a splitsize of ~10% of tape size as recommended is 
   impossible. 
5) Using a disk buffer of < 2Gb works for the first DLE, but after 
   that amanda always falls back to using a memory buffer of size 
   fallback_splitsize. 
6) Using a reasonable fallback_splitsize (e.g. 1Gb) causes amanda to
   run out of memory after a few DLEs. 
7) Using the default fallback_splitsize of 10Mb does seem to work 
   but it is very slow
8) It's possible to compile amanda as a 64-bit executable, but this
   caused other problems.
9) The diskbuffer code seems to waste a lot of time recreating and
   zeroing out the buffer for each DLE.

In other words, our experience is that tape spanning in 2.5.0p2
isn't currently workable for our 1.3 Tb system.

If anyone has sucessfully tape spanned a 1+ Tb backup at an acceptable
speed with a reasonable tape split size, please let me know.


Blow-by-blow details:
=====================
Since our largest volumes far exceed our holding disk space (and always will),
we configured the new run with:

dumpcycle 0             i.e. level 0 every time
runspercycle 0
runtapes 13             i.e. use all tapes if needed

and used a dumptype containing:

estimate server         i.e. we're dumping everything so minimize plan time
record no               i.e. don't mess up our daily set
strategy noinc          i.e. don't attempt incrementals
holdingdisk no          i.e. don't try to use a holding disk
tape_splitsize 30 Gb    i.e. write tape in 30Gb splits
split_diskbuffer "/a/volume/with/enough/space"
fallback_splitsize 2Gb  i.e. don't fall back to the tiny 10Mb default

TRY #1:
------
With these settings, the run core dumped within a few minutes. Looking at the
code, we noted that:

  - the tape split code uses memmap to map the diskbuffer
  - our amanda is compiled 32-bit which implies a maximum address space
    of ~4Gb

TRY #2:
------
Based on this, we changed the following options to fit within the 32-bit
limit:

tape_splitsize 2 Gb
fallback_splitsize 1Gb

This dumped the first DLE in 2Gb chunks (hooray!) as the following
amdump records indicate:

  driver: dumping megawatt:/vol02 directly to tape
  driver: send-cmd time 2.348 to taper: PORT-WRITE 00-00001 [cont'd]
    megawatt fffffeff9ffeffff07 /vol02 0 20060703 2097152 /amdump4 1048576
  taper: r: buffering 2097152kb split chunks in mmapped file [cont'd]
    /amdump4/splitdump_buffer_CQaaeV

Note that the diskbuffer size (2097152 kb) and fallback size (1048576 kb) are
listed in the PORT-WRITE record (this is relevant to the 64-bit case below).

After the tape indicated that it sucessfully completed taping the first
DLE (/vol02), the next DLE produced the following records in amdump:

  driver: error time 1600.885 serial gen mismatch
  driver: dumping megawatt:/vol06 directly to tape
  driver: send-cmd time 1601.080 to taper: PORT-WRITE 00-00002 [cont'd]
     megawatt fffffeff9ffeffff07 /vol06 0 20060703 2097152 /amdump4 1048576

And in the log file:

  INFO taper mmap failed (Not enough space): using fallback split [cont'd]
    size of 1048576kb to buffer megawatt:/vol06.0 in-memory

The second DLE (/vol06) *WAS* dumped to tape in 1Gb chunks as per the
log message. Similarly, DLE's 3-5 were backed up in 1Gb chunks.

DLE 6 also fell back using a 1Gb memory buffer, but then failed with the
following message in the log file:

  FATAL taper taper.c@511: memory allocation failed [cont'd]
    (1073741824 bytes requested)

TRY #3:
-------
We were a bit puzzled by the memory allocation failure above since the
machine has lots of unused memory, but decided to go ahead with another
test using:

tape_splitsize 2 Gb
fallback_splitsize 10Mb (i.e. the default value)

Again, the first volume dumped using the disk buffer of 2Gb, but subsequent
DLE's were split up into 10Mb chunks (about 150 thousand in all) over 4
tapes. The entire backup took 105 hours or about 4.4 days to complete, which
is about 10 times slower (in terms of Gb/hr not total time) than our normal
backups (we normally get ~100Gb/hr from our SDLT 600, which is fairly close
to the manufacturer's advertised transfer rate).

Note that this (poor) performance is similar that that described by
Paul Graf in his post of 29Jun2006 titled "77 hour backup of 850Gb".

TRY #4:
-------
Given the completely unacceptable performance using a 10Mb splitsize, we
decided to try compiling amanda 64-bit based on the assumptions that:

  - 32-bit limitations might have been the cause of both large
    disk and memory buffers failing
  - lack of a proper buffer was the root cause of the performance
    issue

To compile 64-bit, we used the following gcc flags:

setenv CFLAGS "-m64 -mcpu=ultrasparc3"

along with buffer sizes as follows:

tape_splitsize 2 Gb
fallback_splitsize 1Gb

Although the 64-bit code compiled and ran, it seemed to get confused over the
buffer sizes. For example, the amdump PORT-WRITE record was:

  driver: send-cmd time 3.512 to taper: PORT-WRITE 00-00001 [cont'd]
    megawatt fffffeff9ffeffff07 /vol07 0 20060713 9007199254740992 [cont'd]
    /holdvol01/MEGABAK2_DISKBUFFER 4503599627380736

This *seems* to indicate that the requested diskbuffer size is 9 exabytes
(9 * 10^18 bytes) and that the requested fallback size is 4 exabytes despite
the fact that we specified 2G/1G. I'm guessing that these numbers are because
some portions of the code aren't 64-bit clean.

Since these buffer sizes were not available, the backup didn't seem to be
splitting any of the DLE's so we terminated it.

MISC:
-----
Following these attempts, we had a look at the disk buffering code
and noted that:

  - the disk buffer seems to be re-created for every DLE
  - after creation, the entire buffer is zeroed out by writing 1024 byte
    blocks of zeroes to it

Assuming that:

  - a single SCSI disk can maintain a througput of ~50Mbyte/s
  - a reasonable split buffer size would be ~50Gb

then this means that just zeroing the diskbuffer would take about half an
hour per DLE. In our case (~40DLEs), this means that over 20 hours would
be spent just zeroing the buffer file before even doing any real work!


=================================================================
Sean Walmsley                 sean AT fpp.nuclearsafetysolutions DOT com
Nuclear Safety Solutions Ltd.  416-592-4608 (V)  416-592-5528 (F)
700 University Ave M/S H04 J19, Toronto, Ontario, M5G 1X6, CANADA