Re: Help with TSM server being hammered by clients

        If you have anymore questions about my system layout, contact
me:

1.  The server is a Sun 4500 with 4 400MHz Sparc III procs and 4GB RAM.
Attached to the system is 1 D100 disk array used to hold the OS.  All of
the TSM volumes are held on A5200 fiber channel arrays (3 of them
containing 66 disk drives).
2. tsm: I02SV1000>q dbvol f=d

Volume Name         Copy      Volume Name         Copy      Volume Name
Copy      Available    Allocated        
(Copy 1)            Status    (Copy 2)            Status    (Copy 3)
Status        Space        Space       
 
(MB)         (MB)        (MB)
----------------    ------    ----------------    ------
----------------    ------    ---------    ---------    --------
/adsmdb2/db1a       Sync'd    /adsmdb2m/db1am     Sync'd
Undef-        8,352        8,352
 
ined
/adsmdb3/db1a       Sync'd    /adsmdb3m/db1am     Sync'd
Undef-        8,500        8,352
 
ined
/adsmdb4/db1a       Sync'd    /adsmdb4m/db1am     Sync'd
Undef-        8,352        8,352           
 
ined
/adsmdb5/db1a       Sync'd    /adsmdb5m/db1am     Sync'd
Undef-        8,352        8,352           
 
ined
/adsmdb10/db1a      Sync'd    /adsmdb10m/db1am    Sync'd
Undef-        8,352        8,352           
 
ined
/adsmdb9/db1a       Sync'd    /adsmdb9m/db1am     Sync'd
Undef-        8,352        8,352           
 
ined
/adsmdb8/db1a       Sync'd    /adsmdb8m/db1am     Sync'd
Undef-        8,352        8,352           
 
ined
/adsmdb7/db1a       Sync'd    /adsmdb7m/db1am     Sync'd
Undef-        8,352        8,352          
 
ined
/adsmdb6/db1a       Sync'd    /adsmdb6m/db1am     Sync'd
Undef-        8,352        8,352           
 
ined
/adsmdb1/db1a       Sync'd    /adsmdb1m/db1am     Sync'd
Undef-        8,352        8,352           
 
ined

tsm: I02SV1000>q logvol f=d

Volume Name         Copy      Volume Name         Copy      Volume Name
Copy      Available    Allocated        
(Copy 1)            Status    (Copy 2)            Status    (Copy 3)
Status        Space        Space       
 
(MB)         (MB)        (MB)
----------------    ------    ----------------    ------
----------------    ------    ---------    ---------    --------
/adsmlog2/log4      Sync'd    /adsmlog2m/log4m    Sync'd
Undef-          300          300           
 
ined
/adsmlog1/log2      Sync'd    /adsmlog1m/log2m    Sync'd
Undef-          100          100           
 
ined
/adsmlog1/log1      Sync'd    /adsmlog1m/log1m    Sync'd
Undef-        4,096        4,096           
 
ined
/adsmlog1/log4      Sync'd                        Undef-
Undef-          200          200           
                                                   ined
ined

tsm: I02SV1000>q vol

Volume Name                  Storage         Device         Estimated
Pct      Volume
                             Pool Name       Class Name      Capacity
Util      Status
                                                                 (MB)
------------------------     -----------     ----------     ---------
-----     --------
/dev/vx/rdsk/datadg/ads-     BACKUPPOOL      DISK            36,864.0
64.1     On-Line
 mdata1
/dev/vx/rdsk/datadg/ads-     BACKUPPOOL      DISK            36,864.0
72.0     On-Line
 mdata2
/dev/vx/rdsk/datadg/ads-     BACKUPPOOL      DISK            36,864.0
47.3     On-Line
 mdata3
/dev/vx/rdsk/datadg/ads-     BACKUPPOOL      DISK            36,864.0
56.4     On-Line
 mdata4
/dev/vx/rdsk/datadg/ads-     BACKUPPOOL      DISK            36,864.0
49.7     On-Line
 mdata5
/dev/vx/rdsk/datadg/ads-     BACKUPPOOL      DISK            36,864.0
79.0     On-Line

3.  File system is UFS and VXFS (for all TSM volumes).
4.  RAID is software with Veritas Volume Manager.
5.  Each DB and Log volume is on it's own disk group.  The unfortunate
problem is that they are on mounted file system partitions instead of
raw volumes.  In doing benchmark testing, this had a dramatic effect on
backup performance.  The main issue seemed to be the diskpool volumes
which were easy to convert to raw, at least compared to doing the DB
volumes, so I have already done that several weeks ago.  I am planning
on doing the DB and log vols soon.  I did not see any performance gain
with backups of many large files, just on medium and large ones.
6. Yes, the volumes are seperated by type.
7.  Most volumes appear to be broken out over two physical disks.  I am
not sure how many spindles each disk has.  They are mostly 9GB and 18GB
FC drives.
8.  tsm: I02SV1000>q db f=d

          Available Space (MB): 83,520
        Assigned Capacity (MB): 83,520
        Maximum Extension (MB): 0
        Maximum Reduction (MB): 14,832
             Page Size (bytes): 4,096
            Total Usable Pages: 21,381,120
                    Used Pages: 8,748,928
                      Pct Util: 40.9
                 Max. Pct Util: 41.2
              Physical Volumes: 20
             Buffer Pool Pages: 393,216
         Total Buffer Requests: 10,118,152
                Cache Hit Pct.: 96.81
               Cache Wait Pct.: 0.00

9.  Disk caching is being used as far as I can tell, though I don't know
exactly what to look for on Solaris.  The buff pool in TSM is set to:

BufPoolSize: 1,572,864 K


Michael French
Savvis Communications
IDS01 Santa Clara, CA
(408)450-7812 -- desk
(408)239-9913 -- mobile
 


-----Original Message-----
From: ADSM: Dist Stor Manager [mailto:ADSM-L AT VM.MARIST DOT EDU] On Behalf Of
Dave Canan
Sent: Tuesday, November 04, 2003 6:24 PM
To: ADSM-L AT VM.MARIST DOT EDU
Subject: Re: Help with TSM server being hammered by clients


Can you provide a little more information on the layout of the disk
subsystem for the TSM DB, recovery log and storage pools? I recently
close another customer PMR that was very similar to this. After redoing
the layout for optimal performance for the DB and LOG, the problem went
away. For example:

         1. What is the disk subsystem?
         2. How many DBVOLS? LOGVOLS, STGPOOL VOLS?
         3. Filesystem format - JFS/RLV/???
         4. Is it RAID? What type?
         5. Are the TSM volumes sharing space with other data?
         6. Are the TSM volumes separated by type?
         7. How many physical spindles do you have for the TSM volumes?
         8. What is your cache hit %?
         9. Are you using disk caching? (On both the client and server)
At 01:51 PM 11/4/2003 -0600, you wrote:
>         TSM Server 4.2.4.1 (Solaris)
>         TSM Client 4.2.3 (Solaris)
>
>         TSM DB 83GB (40% util)
>         TSM Log 4.6GB
>
>         I am having a serious problem with 4 Solaris clients hammering

> the server during their backups.  Each client has a lot of file held 
> on TSM, about 4-5 million per node and growing, though not much data, 
> only a couple of hundred GB's per node.  The server is very responsive

> when these clients are not backing up, all other backups run without 
> lagging the server.  I have about 120 nodes backing up to this server
daily.
>         What could be causing this performance problem?  Here is a 
> show logpin during last nights backup:
>
>tsm: I02SV1000>show logpin
>Dirty page Lsn=4675033.188.3116, Last DB backup Lsn=4677956.167.3489, 
>Transaction table Lsn=4677883.231.3853, Running DB backup Lsn=0.0.0, 
>Log truncation Lsn=4675033.188.3116
>Lsn=4675033.188.3116, Owner=DB, Length=128
>Type=Update, Flags=C2, Action=ExtDelete, Page=6110475, Tsn=0:180594521,
>PrevLsn=4675033.180.2739,
>UndoNextLsn=0.0.0, UpdtLsn=4675033.176.827 ===> ObjName=AF.Bitfiles,
>Index=12, RootAddr=29,
>PartKeyLen=1, NonPartKeyLen=7, DataLen=20
>The recovery log is pinned by a dirty page in the data base buffer
pool.
>Check the buffer pool
>statistics. If the associated transaction is still active then more
>information will be displayed
>about that transaction.
>Database buffer pool global variables:
>CkptId=25232, NumClean=269056, MinClean=393192, NumTempClean=393216,
>MinTempClean=196596,
>BufPoolSize=393216, BufDescCount=432537, BufDescMaxFree=432537,
>DpTableSize=393216, DpCount=124149, DpDirty=124149, DpCkptId=21890,
>DpCursor=92805,
>NumEmergency=0 CumEmergency=0, MaxEmergency=0.
>BuffersXlatched=0, xLatchesStopped=False, FullFlushWaiting=False.
>
>Is the large number of DpDirty pages bad?  I think so, but I don't know

>the techincal details behind this value.  The log is at 0% util when 
>backups start during the evening and by midnight last night, the log 
>was up to 80% and climbing rapidly.  Once I cancel these 4 clients from

>backing up, the log stops filling so rapidly.  Does anyone else have 
>problems with clients that have large numbers of small files?  How do 
>you handle backing them up?  It seems like these nodes take 8-10 hours 
>a piece which seems very slow.
>
>Thanks in advance for any assistance that you can provide!
>
>Michael French
>Savvis Communications
>IDS01 Santa Clara, CA
>(408)450-7812 -- desk
>(408)239-9913 -- mobile
>

Dave Canan
TSM Performance
IBM Advanced Technical Support
ddcanan AT us.ibm DOT com