ADSM-L

Re: [ADSM-L] Slow backup

2012-08-07 12:13:17
Subject: Re: [ADSM-L] Slow backup
From: "Allen S. Rout" <asr AT UFL DOT EDU>
To: ADSM-L AT VM.MARIST DOT EDU
Date: Tue, 7 Aug 2012 12:02:59 -0400
On 08/07/2012 11:10 AM, Arbogast, Warren K wrote:


By 'proxy agent' I mean they are authorized to do backups on behalf
of the target server.

We are doing possibility one, in your set of cases, with four
agents. I kept the example simple for readability, but perhaps some
clarity was lost.

I'm being pedantic here, but your choices of vocabulary ("...on behalf
of the target server") still leave me concerned you may be
unintentionally telling one machine to store what another is
discarding.  If you've got four machines with e.g.

grant proxynode target=BIGFS agent=BIGFS_AF
grant proxynode target=BIGFS agent=BIGFS_GL

and BIGFS_AF is backing up "on behalf of the target server" with the
include/excludes you've mentioned, then you are in category two.

BIGFS_AF is backing up /ip/a* (among other things) and assigning it to a
filespace named '/ip' associated with a node named BIGFS.

BIGFS_GL will be throwing away that same data, and attempting to send
'/ip/g*', which BIGFS_AF (or other siblings) will attempt to slaughter.

If you could include TSM option files (.opt; and .sys if it's there;
is this a unix or windows setup?) for 'the target server' and two of
the proxies it would completely disambiguate these cases.

You say,  "I will note that this is unlikely to accelerate your wall-clock time;
if you've got resourceutilization 10, you've probably got 5+ threads
walking the FS, you've probably moved your bottleneck to IOPS on your
NAS as it tries to pull the metadata to satisfy the FS walk.  20
threads won't do that faster."

Are you saying that reducing Resourceutilization would likely
improve the throughput of the backup? Or, that the "backup by proxy
plan itself is ill conceived for some other reason?

'Ill concieved' is too strong a term.  I'd only go so far as "Possibly
not helping you any".

The key point is identifying your bottleneck, and then determining
wether your contemplated measures affect the bottleneck.
Gedankenexperiment with me:

For most large filesystem installations, i.e. millions of files, the
performance bottleneck for conventional TSM guest-level incrementals
is the turnaround time reading metadata off the filesystem to ask 'Has
this file changed?' (in your case) 21 million times.  Plus a few tens
of thousands for directories.

So why is that a bottleneck?  Usually it's because large filesystems
are not exceptionally high-performance stores, and are consequently
stored on biggish RAIDs of cheapish disk.  Let's say you're on EMC
disk which IIRC suggests 8-spindle RAID groups, and let's further go
with the 'cheapish' disk: SATA with 80-100 IOPS.

So off a RAID group of 8 disks, you'll get 80 to 100 reads a second.
Say your 21M files occupy 21TB; you're using smallish 1TB drives, so
you've got 3 RAID groups.  If your ducks are totally in a row, you can
process 300 IOPS a second.  If you're using newer 3T disks, then
you're down to 100.  Yow.

I'll handwave over wether an IOP is required for each file; there,
we're beyond my statistical envelope-back.  But if you do, then the
_expected_ wall clock time to simply ask the questions about all the
files is 19 hours (3xraid of 1T drives).

And somewhere in there, you might want to also read some data.  Also,
customers might want to use the file store for something; in the way
as always. :)

OK.  So I fantasize that your slow performance is because of some
situation grossly similar to this.

Do you see how adding another backup reader with its own 5 FS walking
threads doesn't affect the problem?  You'll still take 19 hours to do
the 21M IOPS, and you might generate more contention.


- Allen S. Rout

<Prev in Thread] Current Thread [Next in Thread>