ADSM-L

Re: Performance Large Files vs. Small Files

2001-02-14 11:27:55
Subject: Re: Performance Large Files vs. Small Files
From: "Thomas A. La Porte" <tlaporte AT ANIM.DREAMWORKS DOT COM>
Date: Wed, 14 Feb 2001 08:28:21 -0800
Imagine it strictly from a database perspective.

Scenario 1: 15 files, 2GB each
Scenario 2: 15728640 files, 2KB each

In scenario one, your loop is essentially like this:

  numfiles = 15;
  for (i = 0; $i < $numfiles ; $i++) {
    insert file characteristics into database;
    request data be sent from client;
    store data in storage pool;
  }

In scenario two, the primary difference is that numfiles =
15728640:

  numfiles = 15728640;
  for (i = 0; $i < $numfiles ; $i++) {
    insert file characteristics into database;
    request data be sent from client;
    store data in storage pool;
  }


This means that, in the first scenario, there are 15 interactions
with the database, 15 system calls on the client for file
open/read operations, etc. In the second scenario, there are 15
*million* interactions with the database, 15 *million* file I/O
operations, etc.

Realistically, this is a bit of a simplification, as the
TXNGROUPMAX and the TXNBYTELIMIT parameters help to group the
files into transaction batches that can be larger than a single
file, which reduces the number of round trips to the database,
but the overall effect is still there.

Although you may be transferring the same amount of aggregate
data, you have to factor in the overhead of each single
transfer. Although the overhead may be small, if you multiply
that small number by two orders of magnitude you do generally end
up with a big number.

Imagine the time it would take to collect $30 million dollars
from fifteen $2 million donors, then think of collecting the same
amount of money from fifteen million $2 donors.

I would recommend that you break up your NT server into smaller
filespaces, either physically on the NT server, or logically with
virtualfilespaces on the ADSM server. That way you can have more
multiple processes working simultaneously on the backup. The
aggregate time it will take to back up the server will be the
same, but the wall clock time will be approximately divided by
the number of processes you can run simultaneously.

 -- Tom

Thomas A. La Porte
DreamWorks SKG
tlaporte AT anim.dreamworks DOT com

On Wed, 14 Feb 2001, Diana J.Cline wrote:

>Using an NT Client and an AIX Server
>
>Does anyone have a TECHNICAL reason why I can backup 30GB of 2GB files that are
>stored in one directory so much faster than 30GB of 2kb files that are stored
>in a bunch of directories?
>
>I know that this is the case, I just would like to find out why.  If the amount
>of data is the same and the Network Data Transfer Rate is the same between the
>two backups, why does it take the TSM server so much longer to process the
>files being sent by the larger amount of files in multiple directories?
>
>I sure would like to have the answer to this.  We are trying to complete an
>incremental backup an NT Server with about 3 million small objects (according
>to TSM) in many, many folders and it can't even get done in 12 hours.  The
>actual amount of data transferred is only about 7GB per night.  We have other
>backups that can complete 50GB in 5 hours but they are in one directory and the
># of files is smaller.
>
>Thanks
>
>
>
>
>
> Network data transfer rate
> --------------------------
> The average rate at which the network transfers data between
> the TSM client and the TSM server, calculated by dividing the
> total number of bytes transferred by the time to transfer the
> data over the network. The time it takes for TSM to process
> objects is not included in the network transfer rate. Therefore,
> the network transfer rate is higher than the aggregate transfer
> rate.
>.
> Aggregate data transfer rate
> ----------------------------
> The average rate at which TSM and the network transfer data
> between the TSM client and the TSM server, calculated by
> dividing the total number of bytes transferred by the time
> that elapses from the beginning to the end of the process.
> Both TSM processing and network time are included in the
> aggregate transfer rate. Therefore, the aggregate transfer
> rate is lower than the network transfer rate.
>
>