More Details on Record-Breaking VLDB Backup/Restore Benchmarks!

Here's some more details as promised....thanks for your patience.
Please spread the good news!
Cyndie Behrens (IBM San Jose)

======================================================================
On Monday June 23, 1997, IBM announced a new record for Oracle backup
and restore performance using ADSTAR Distributed Storage Manager
(ADSM) for AIX,  Please see the press release on the ADSM web site:
http://www.storage.ibm.com/adsm.  The information in the press release
and in the following Questions and Answers was accurate as of the
June 1997 benchmark completion date.

Here are some Questions and Answers to help you better understand:
   o The benchmark results
   o The configurations used for the benchmarks
   o The factors that affected the performance results
   o Why these results are record-breaking in the industry
   o When more information will be available publicly to customers and
     internally to IBMers

A performance whitepaper is planned by the end of July 1997 and will be
published on the ADSM web site.  Additional information can be
obtained from IBM.

1)  Why was this VLDB benchmark performed?

    Customers are relying more and more on databases for their critical
    data.  Storage management solutions that address a variety of
    fast backup and recovery scenarios are mandatory.  IBM wanted to
    demonstrate how the combination of key IBM hardware and software
    solutions, along with backup and recovery solutions provided by
    Oracle Corporation, solve VLDB management issues today.

2)  What organizations were involved in this benchmark?

    The organizations involved with this benchmark included:
    IBM's RISC System/6000 Division, IBM's Storage Systems Division
    (with ADSM, 7133 serial disk drives, Magstar 3590 tape drives,
    and a 3494 Tape Library), IBM's Teraplex Integration Center,
    and the groups at Oracle Corporation who develop and support
    Oracle Parallel Server (OPS) and Oracle Enterprise Backup
    Utility (EBU).

3) I understand that the benchmarks were conducted at the IBM RS/6000
   Teraplex Integration Center.  What is this center?

   IBM is one of the first companies to provide large-scale integration
   testing and verification facilities focused on data warehouse,
   data mart, and data mining environments.  IBM's Teraplex
   Integration Centers have been designed to integrate, optimize, and
   stress-test very large business intelligence systems and
   applications.  These centers address the market's increasing
   reliance on very large databases for business critical operations.
   IBM's RS/6000 and S/390 Teraplex Integration Centers are located
   in Poughkeepsie, NY, and the AS/400 Teraplex Integration Center
   is located in Rochester, Minnesota.

4) What were the record-breaking performance results?

   We believe several aspects of the results were record-breaking:

   o This was one of the largest databases used for Oracle backup and
     restore measurements, ranging from 62GBs to 744 GBs, depending on
     the test.  The results achieved, for example 736 GBs backed up in
     less than 1 1/2 hours and restored in less than 2 hours, were real
     measurements, not extrapolated or theoretical numbers, as have been
     used on occasion by our competition.

   o The rates were wall clock rates, that is, total elapsed time for
     the operation, not a maximum data transfer rate.  The wall clock
     rates were the real time it took for the operation, including
     the time for mounting the tapes.

   o The restore rates were very comparable to the backup rates; they
     were all within 15%.  And some restores were actually faster
     than the backups!

5) What is the difference between extrapolated numbers and real
   measurements?

   If, for example, a test was done backing up a 100 GB database, and
   the test took 30 minutes, the extrapolated rate would be 200 GBs
   per hour.  A 200 GB  database was never actually backed up
   in one hour, but instead, an assumption was made that is not
   necessarily true; that is, if a 100 GB database was backed up
   in 30 minutes, you multiply by two to get an hourly backup rate.

   Extrapolated results, in theory, can only provide the best case
   estimation because they assume linear results can be achieved
   as the size of the environment grows, but they ignore the reality
   of the resource costs that accompany managing a larger environment.

   Only by running the full length test can you determine the true
   results.  That's why IBM did it!  These IBM published benchmarks
   are real measurements.  We actually took a 736GB database and
   measured how long it took to back it up and how long it took to
   restore it.  We know that the interactions between the various
   hard ware and software components all worked at a very fast rate,
   even with such a large database.

6) What is the difference between total elapsed time (wall clock time)
   and maximum data transfer rates?

   The results IBM published show the actual duration of the
   backup and restore.  If we started the backup at 10:00 am and it
   finished at 11:30 am, then the wall clock rate is 1 1/2 hours.
   This wall clock rate includes all activities required to complete
   for the backup or restore operation to be successful, including all
   processing time and tape mount time.

   Maximum data transfer rates, and other similar semantics for
   "burst" or "peak" rates, are rates that are achieved at one
   snapshot in time.  It is not a measure of how long an
   operation takes to complete from start to finish.

   Think of a marathon runner who runs over 26 miles.  He/she may run
   some miles in four minutes (peak rate) but what counts is how long
   it takes from start to finish (total elapsed time).  You would not
   let the runner run 13 miles and then just multiply by two to get
   his/her score!

7) What levels of software were used in these benchmarks?

   These benchmarks were run with:
   o AIX 4.1.5
   o Parallel System Support Programs (PSSP) 2.2
   o Oracle Parallel Server (OPS) 7.3.2.3
   o Oracle Enterprise Backup Utility/Parallel Version (EBU/PV)
       Version 2.0.12.4.1
   o ADSM V2 AIX client 2.1.6
   o ADSM V2 AIX server 2.1.5.12 and 2.1.5.13
   o ADSMConnect Agent for Oracle on AIX

8) What hardware was used in these benchmarks?

   The hardware used in these benchmarks included:
   o An IBM RS/6000 Scalable POWERparallel System (RS/6000 SP)
   o IBM 7133 Serial disk drives
   o IBM Magstar 3590 tape drives
   o A 3494 Tape Library

9) What communication protocols were used in these benchmarks?

   Tests were completed with both shared memory and TCP/IP. All
   communications occurred over the SP switch.  In addition, the
   EBU client performed all of it's data read and write operations
   using virtual shared disk (VSD) read/write protocols over the
   SP switch.

10) What configuration was used in the benchmarks?

    OPS backup and restore was tested in a variety of configurations
    which included up to 16 nodes of an SP, and up to 16 3590 tape
    drives.  Eight of the 3590 were housed in a 3494 Tape Library.
    The other eight were stand-alone 3590s with Automated Cartridge
    Facilities (ACFs).  All tape handling was automated.  All
    evaluations were conducted within the framework of the SP.  A
    variety of ADSM server and client node configurations were evaluated.

11) How many tape drives were used in total and per SP node?

    Up to 16 tape drives were used, with one to four tape drives
    per ADSM server node.

12) What processors were used in the SP?

    Different SP node configurations were used for each evaluation
    including:
    o Eight 67 MHz Power2 thin nodes
    o Sixteen 120 MHz P2SC thin nodes
    o Sixteen 8-way 112 MHz PowerPC 604 high nodes

13) Did you use any special software setup or tuning parameters?

    We used the latest ADSM tuning parameters, including large buffers,
    the SP switch and it's settings, a specific physical database
    layout on which EBU read/wrote data, different file sizes, and
    tape compression, which all helped drive the 3590 tape drives at
    a very high rate.

14) These measurements were for OPS.  My customer doesn't have OPS, but
    instead has non-OPS Oracle7 databases which are not running on
    an SP.  What results can I expect?

    An accurate estimate of potential performance would be that
    which is achievable by the ADSM client running in the same
    environment processing like-sized files.  Keep in mind, however,
    that with EBU you can run multiple backup and restore streams,
    each of which would be an ADSM client session.

15) You mention that some of the measurements used up to 16 ADSM
    servers  but I've seen results published by other vendors that
    indicate only one server was used in their environments.

    None of the published reports we have seen to date
    mention the number of servers used.  In any case, EBU/PV has
    special support to manage multiple ADSM servers transparently.
    EBU/PV can send data to multiple ADSM servers simultaneously,
    and restore the data from the appropriate server.

16) When will the EBU/PV function be available for environments
    other than OPS on the SP2?

    EBU/PV is available today for OPS in the SP2 environment.
    The EBU/PV function is planned to be incorporated into the base
    EBU code with EBU 2.2, which Oracle targets for an August
    availability.  This would make EBU/PV functions, such as the
    transparent management of multiple ADSM servers, available in
    non-OPS environments as well as in OPS environments with other
    hardware configurations (for example, clusters of RISC System/6000s).

17) I understand that EBU not only provides multiple parallel data
    streams for backup and restore, but also multiplexes data from
    multiple disks to each of the data streams, if you choose to
    configure it to do so.  How much of a factor was multiplexing
    in your ability to drive the 3590s at such a fast average data
    transfer rate of 9 MBs/second?

    In theory, the greatest benefit from multiplexing is when
    slow multiple client disks can be read from simultaneously, and
    then combined into a single data stream which can then get written
    sequentially to a fast tape device on the ADSM server.  That is,
    multiplexing is supposed to aid in "speeding up" the slower device.

    While we did get some benefit from multiplexing, the effective disk
    read request rates did not scale linearly as more disks were
    accessed in parallel because they were not a performance bottleneck
    in our environment, and in fact, we actually hurt throughput
    performance when an inefficient multiplexing strategy was used.

    The biggest factors in achieving the high data transfer rates to the
    3590s were using:

     o ADSM's large buffer support
     o The SP switch and setting its parameters appropriately
     o A physical database layout on which EBU read/wrote the data
     o Appropriate file sizes

18) Did you use ADSM compression?

    ADSM compression is done on the ADSM client and is of most value
    when you have a slower network and you want to reduce the amount
    of data you send across the network.  We did not use ADSM
    compression because we had a fast network.  We did use the 3590
    tape compression; tape compression is done after the data is
    sent through the network.

19) What was the average CPU utilization for the benchmarks?

    The CPU utilization varied depending on the configurations of the
    tests, but was as low as 20%.

20) What happened when you added more tape drives to the ADSM server
    nodes?

    Excellent scaling characteristics of the solution allowed flexibility
    in meeting customer needs.  We measured linear scalability when a
    second drive was added.  Additional performance was achieved by
    adding more tape drives.  With three or four tape drives we continued
    to see valuable throughput gains.

21) What if I am running Oracle on a different platform, such as
    Sun or  HP?

    The architecture of the hardware and operating system is a key
    factor in your performance results.  We are looking into making
    measurements on other platforms.

22) Backup rates are important, but what I really care about are restore
    rates.  How did your restore rates compare to your backup rates?

    Our restore rates were highly comparable to our backup rates;
    in fact the restore rates were consistently within 15% of our
    backup rates.  Some restores were even faster than the backups.
    We are unaware of any competitive results that are even close to
    this level of performance!

23) Did you need to use 3590's to achieve these performance results?

    A key performance factor is the type of tape drives you use,
    especially when you back up directly to tape.  The 3590s
    were key to the results we achieved.  Using 3590 tape compression
    also improved performance and less tapes were needed to store
    the backup data.  It is the balance of all products working
    together that made our results possible.

24) What can we expect from other VLDB environments, such as the new
    IBM DB2 Universal Database Server (UDB) or Oracle in an SAP R/3
    environment?

    We are currently performing benchmarks with DB2 UDB and
    BACKINT/ADSM and plan to publish the results when they are
    complete.

25) What should we expect from Oracle 8 in terms of backup and recovery
    performance?

    Oracle provides a new and enhanced backup and recovery facility,
    Recovery Manager (RMAN) for Oracle 8 databases.  EBU will
    continue to be the facility to use for Oracle 7 databases.  The
    interface from RMAN to ADSM is expected to be identical to the
    interface from EBU to ADSM.  There are some RMAN enhancements that
    may improve throughput, such as their new true incremental
    support.  With true incremental support,  both backup and recovery
    may show performance improvements.

26) Were the test databases fully populated?

    No, our databases were about 80% full with representative data
    because we wanted to test a typical Oracle customer environment.

27) Any other advice on how to configure a real customer environment?

    Make sure you are using the most recent levels of the software and
    device drivers.

28) Were these benchmarks made using ADSM V2 or V3?
    Do you expect differences with V3?

    These benchmarks were made using ADSM V2.  General V3 performance
    testing is in progress.  Specific V3 testing with Oracle is
    under consideration.

29) Did using PTF12 or PTF13 for the ADSM V2 server make
    a difference?

    In our environment, we saw no performance difference between
    PTF12 and PTF13, but this does not mean that this will be the
    case for all environments.

30) Were all the measurements done straight to tape, or were the
    7133 serial disk drives used as an ADSM disk pool first and
    then later migrated to tape?

    All measurements were straight to tape.

31) What can you say about the linear scalability when adding
    additional ADSM server nodes?

    Performance was very linearly scalable when we added additional
    ADSM server nodes.  In general, adding additional ADSM server
    nodes improves performance and distributes the CPU load across
    multiple servers.