ADSM-L

Re: Long, long, long backup sessions

2001-08-24 13:34:24
Subject: Re: Long, long, long backup sessions
From: Zoltan Forray/AC/VCU <zforray AT VCU DOT EDU>
Date: Fri, 24 Aug 2001 13:40:46 -0400
First, I want to thank everyone for their feedback.

Since I am not an AIX person, I passed these responses to my AIX guru.
Here are *HIS* reponses (noted with ###:

>>> I would suggest the problem is nothing but the amount of files that need
to be processed.  I have one client with 3 million + files set that file
system to do incrbydate instead of incremental.  The results were
definitely worth it.

### Pretty much what we guessed

>>> The only loophole I can see with this solution is as follows.  If a
file is added to the file system with a previous date then incrbydate will
not pick it up.  For example if you untar files into the filesystem  the
files will maintain their original date and not be backed up.  To catch
these exceptions you would need to do a regular incremental once a week or
as you see fit.

### Understood


>>> The only other option I have seen on this board is to tar the files up
first and then backup them up.  This however might require a large amount
of disk space.

### This is not a viable option.

>>> As far as no data being transferred for the first eight hours I find
that surprising.  I would assume that the first file system to be looked
at is / and there would definitely be files there that need to be backed
up.
The only explanation I can think of is if your backup is looking at the
filesystem with the large number of files first.

###  Just telling you what the stats say.  Did not know there was a way to
control the order of filesystems searched/examined for backup processing.

>>> Consider using the 4.2 client's journal backup feature, which is intended
to eliminate the need to scan the file system for changed files.

The elapsed time represents the total time from the beginning of the
operation to the end. Elapsed time should have been in the neighborhood of
15 hours, but if you are running 4.1.2.0, then the elapsed time is
erroneous (APAR IC29212, fixed in 4.1.3 and up).

### As has been clarified, currently, the Journal feature only applies to
NT/2000 systems.  However, we will upgrade to the 4.2.0 client for AIX.

>>> With over 8 million files, getting the file list from the server
before backup data transfer is obviously a big factor. Incrbydate will
help somewhat, and is worth trying one night.

### We will try this after upgrading to the 4.2.0 client

>>> The client system may have 2 GB of real memory, but much of that may
already be tied up, as I'd expect it to be busy as a mail server. AIX
monitoring would show that.

### Of course it is. This is a mail server and is always, very busy. We
recently upgraded from 2-CPU to 4. It did not improve the backup times,
though.

>>> If you haven't already, review "Backup performance" in 
>>> http://people.bu.edu/rbs/ADSM.QuickFacts to have a look at that variety of 
>>> factors, including TSM server db
caching, as it may be taking too long to provide the file list to the
client.

### (from me). Did all that we can do, under the current situation, to
improve TSM performance. Our system only has 1GB real storage, so I can't
do much. I am already running SELFTUNING for the DB and TXN stuff. Did not
help much. Server currently has 120MB region.

>>> Look also for any indications of I/O contention or retries on the
server disk due to soft I/O errors, which will slow retrieval. From my MVS
days I recall that data sets going into "extents" can impair performance,
so a listvtoc may be in order.  Also check I/O balancing on the server
disks, as anything else sharing those areas can cause contention.

### (me again) None of this applies to use. ALL DASD is RAID-5 RAMAC.
Can't do any real "load balancing".  Yes, there probably is some
contention. Can't do anything about it. ALL disk storage pools are single
extent VSAM linear d/s occupying complete 3390 disk. Current stgpool is
83GB.

>>> my network data transfer rate is 15 MB/sec while yours is 520 KB/sec. even
if it is not the only problem, sure it may be a big one sooner or later.
also i see you compress your datas. did you try without compression ?

### Will consider addressing network speed issue. Can't do much. It is a
100 switch.  Compression should not be an issue since it can use 4-fast
CPU's.

>>> did you try multi-threading (heard about it but don't know a lot) ?

### Not a viable option. We are about to move the OS390 to a new, fast,
single-processor 7060-H50 machine.

>>>  Look in the accounting log, dsmaccnt.log in your server/bin directory. The
format is documented in the Admin Guide - look up dsmaccnt.log in the
index. This will show you whether the session had a lot of Idle wait
(client slow - compression? millions of files to walk?) , a lot of media
wait(direct to tape, and out of drives?), or a lot of communication wait
(network slow? confirm with ftp.).

### (me, here) We run the ACCOUNTing on the OS390 side, collected as SMF
records.  I went and pulled from stats from an earlier (08/14/2001)
session:

Session Duration:  7:06:49
Object Inserted:   78,979
Size of Objects:   2,480,911
Idle Wait Time:    1:03:27
Comm Wait Time:    3:39:06

>>> Maybe the elapsed processing time subtracts the time-of-day without
looking at the date.  If this session really started at 06:12, that would
fit. I know we've had problems trying to use this number in our monitoring
product.

### As Andy has determined, the stats are invalid due to a bug in the
client. We wil upgrade the client.

>>> The 8 million files is probably the problem, coupled with client-side
compression. Solutions: find the huge directory and use exclude.dir if
it's junk; Clean out old stuff so you don't have to walk the thing;
Unmount the filesystem and use image backup; cpio or tar up the big
directory into one file during the day, then just
backup that one file.

### None of these are viable options. This box is dedicated to e-mail for
thousands of students and faculty. There are no "junk files" we can
eliminate.


Again, many, many thanks for all of the suggestions.

As you can tell from the responses, there isn't a whole lot I can do,
right now.  In about 2-weeks, we will be moving to a new OS390 box with
double the real storage, so I will be letting TSM "stretch it's wings" a
bit, by allocating more real-storage. I will also have more disk so I can
expand the "landing zone" by 30-40GB, to reduce automatic migration from
kicking in (of course we are also going to increase the number of
clients.........ohhhhh welllllll). Hope to have more tape drives, in the
future, to speed up backups and migration (currently can only use 3-3590).

A performance person also looked at TCPIP and said we need to do something
with the buffers it uses.

Whew !!!!!!!!!!!!!!!!
<Prev in Thread] Current Thread [Next in Thread>