Re: Performance Large Files vs. Small Files

Thanks Jeff, very nice description of an important issue for all of us !
Hoping Tivoli puts all necessary ressources to improve this.

René Lambelet
Nestec S.A. / Informatique du Centre 
55, av. Nestlé  CH-1800 Vevey (Switzerland) 
*+41'21'924'35'43  7+41'21'924'28'88  * K4-117
email rene.lambelet AT nestle DOT com
Visit our site: http://www.nestle.com

        This message is intended only for the use of the addressee and 
        may contain information that is privileged and confidential.



> -----Original Message-----
> From: Jeff Connor [SMTP:connorj AT NIAGARAMOHAWK DOT COM]
> Sent: Thursday, February 15, 2001 8:01 PM
> To:   ADSM-L AT VM.MARIST DOT EDU
> Subject:      Re: Performance Large Files vs. Small Files
> 
> Diana,
> 
> Sorry to chime in late on this but you've hit a subject I've been
> struggling with for quite some time.
> 
> We have some pretty large Windows NT file and print servers using MSCS.
> Each server has lots of small files(1.5 to 2.5 million) and total disk
> space(the D: drive) between 150GB and 200GB, Compaq server, two 400mhz
> xeon
> with 400MB ram.  We have been running TSM on the mainframe since ADSM
> version 1 and are currently at 3.7 of the TSM server with 3.7.2.01 and
> 4.1.2 on the NT clients.
> 
>  Our Windows NT admins have had a concern for quite some time regarding
> TSM
> restore performance and how long it would take to restore that big old D:
> drive.  They don't see the value in TSM as a whole as compared to the
> competition they just want to know how fast can you recover my entire D:
> drive.  They decided they wanted to perform weekly full backups to direct
> attached DLT drives using Arcserve and would use the TSM incrementals to
> forward recover during full volume restore.   We had to finally recover
> one
> of those big D: drives this past September.  The Arcserve portion of the
> recovery took about 10 hours if I recall correctly.  The TSM forward
> recovery ran for 36 hours and only restored about 8.5GB.  They were not
> pleased.  It seems all that comparing took quite some time.   I've been
> trying to get to the root of the bottleneck since then.  I've worked with
> support on and off over the last few months performing various traces and
> the like.  At this point we are looking in the area of mainframe TCPIP and
> delay's in acknowledgments coming out of the mainframe during test
> restores.
> 
> If you've worked with TSM for a number of years and through sources in
> IBM/Tivoli and the valuable information from this listserv, over time you
> learn about all the TSM client and server "knobs" to turn to try and get
> maximum performance.  Things like Bufpoolsize, database cache hits,
> housekeeping processes running at the same time as backups/restores
> slowing
> things down, network issues like auto-negotiate on NIC's, MTU sizes, TSM
> server database and log disk placement, tape drive load/seek times and
> speeds and feeds.  Basically, I think we are pretty well set with all
> those
> important things to consider.  This problem we are having may be a
> mainframe TCPIP issue in the end, but I am not sure that will be the
> complete picture.
> 
>  We have recently installed an AIX TSM server, H80 two-way, 2GB memory,
> 380GB EMC 3430 disk, 6 Fibre Channel 3590-E1A drives in a 3494, TSM server
> at 4.1.2.  We plan to move most of the larger clients from the TSM OS/390
> server to the AIX TSM server.  A good move to realize a performance
> improvement according to many posts on this Listserv over the years.  I am
> in the process of testing my NT "problem children" as quickly as I can to
> prove this configuration will address the concerns our NT Admins have
> about
> restores of large NT servers.  I'm trying to prevent them from installing
> a
> Veritas SAN solution and asking them to stick with our Enterprise Backup
> Strategic direction which is to utilize TSM.  As you probably know, the
> SAN
> enabled TSM backup/archive client for NT is not here and may never be from
> what I've heard.  My only option at this point is SAN tape library sharing
> with the TSM client and server on the same machine for each of our MSCS
> servers.
> 
> Now I'm sure many of you reading this may be thinking of things like, "why
> not break the D: drive into smaller partitions so you can collocate by
> filespace and restore all the data concurrently".  No go guys, they don't
> want to change the way they configure their servers just to accommodate
> TSM
> when the feel they would not have to with other products.  They feel that
> with 144GB single drives around the corner who is to say what a "big" NT
> partition is?  NT seems to support these large drives without issues.
> (Their words not mine).
> 
> Back to the issue.  Our initial backup tests using our new AIX TSM server
> have produced significant improvements in performance.  I am just getting
> the pieces in place to perform restore tests.  My first test a couple days
> ago was to restore part of the data from that server we had the issue with
> in September.  It took about one hour to lay down just the directories
> before restoring any files.  Probably still better than the mainframe but
> not great.  My plan for future tests is to perform backups and restores of
> the same data to and from both of my TSM servers to compare performance.
> I
> will share the results with you and the rest of the listserv as I
> progress.
> 
> In general I have always, like many other TSM users, achieved much better
> restore/backup rates with larger files versus lots of smaller files.
> Assuming you've done all the right tuning, the question that comes to my
> mind is, does it really come down to the architecture?  The TSM database
> makes things very easy for day to day smaller recoveries which is the type
> we perform most.  But does the architecture that makes day to day
> operations easier not lend itself well to backup/recovery of large amounts
> of data made up of small files?  I have very little experience with
> competing products. Do they struggle with lots of small files as well?
> Veritas, Arserve anyone?  If the issue is, as some on the Listserv have
> suggested, frequent interaction with the client file system the
> bottleneck,
> then I suppose the answer would be yes the other products have the same
> problem.  Or is the issue more on the TSM database side due to it's
> design,
> and other products using different architectures may not have this
> problem?
> Maybe the competitions architecture is less bulletproof but if you're one
> of our NT Admins you don't seem to care when the client keeps calling
> asking how much longer the restore will be running.   I know TSM
> development is aware of the issues with lots of small files and I would be
> curious what they plan to do about the problems Diana and I have
> experienced.
> 
> The newer client option, Resourceutilization, has helped with backing up
> clients with lots of small files more quickly.  I would love to see the
> same type of automated multi-tasking on restores.  I don't know the
> specifics of how this actually works but it seems to me that when I ask to
> restore an entire NT drive, for example, the TSM client/server must sort
> the file list in some fashion to intelligently request tape volumes to
> minimize the mounts required.  If that's the case could they take things
> one step further and add an option to the restore specifying the number of
> concurrent sessions/mountpoints to be used to perform the restore?  For
> example, if I have a node who's collocated data is spread across twenty
> tapes and I have 6 tape drives available for the recovery, how about an
> option for the restore command like:
> 
>      RES -subd=y -nummp=6 d:\*
> 
> where the -nummp option would be the number of mount points/tape drives to
> be used for the restore.  TSM could sort the file list coming up with the
> list of tapes to be used for the restore and perhaps spread the mounts
> across 6 sessions/mount points.  I'm sure I've probably made a complex
> task
> sound simple but this type of option would be very useful.  I think many
> of
> us have seen the benefits of running multiple sessions to reduce recovery
> elapsed time.  I find my current choices for doing so difficult to
> implement or politically undesirable.
> 
> If others have the same issues with lots of small files in particular with
> Windows NT clients lets hear from you.  Maybe we can come up with some
> enhancement requests.  I'll pass on the results of my tests as stated
> above.  I'd be interested in hearing from those of you that have worked
> with other products and can tell me if they have the same performance
> problems with lots of small files.  If the performance of other products
> is
> impacted in the same was as TSM performance then that would be good to
> know.  If it's more about the Windows NT NTFS file system then I'd be
> satisfied with that explanation as well.  If it's about lots of
> interaction
> with the TSM database leads to slower performance, even when optimally
> configured, then I'd like to know what Tivoli has in the works to address
> the issue.  Because if it's the TSM database, I could probably install the
> fattest Fibre Channel/network pipe with the fastest peripherals and server
> hardware around and it might not change a thing.
> 
> Thanks
> Jeff Connor
> Niagara Mohawk Power Corp.
> 
> 
> 
> 
> 
> 
> "Diana J.Cline" <Diana.Cline AT ROSSNUTRITION DOT COM>@VM.MARIST.EDU> on
> 02/14/2001 10:04:52 AM
> 
> Please respond to "ADSM: Dist Stor Manager" <ADSM-L AT VM.MARIST DOT EDU>
> 
> Sent by:  "ADSM: Dist Stor Manager" <ADSM-L AT VM.MARIST DOT EDU>
> 
> 
> To:   ADSM-L AT VM.MARIST DOT EDU
> cc:
> 
> Subject:  Performance Large Files vs. Small Files
> 
> 
> Using an NT Client and an AIX Server
> 
> Does anyone have a TECHNICAL reason why I can backup 30GB of 2GB files
> that
> are
> stored in one directory so much faster than 30GB of 2kb files that are
> stored
> in a bunch of directories?
> 
> I know that this is the case, I just would like to find out why.  If the
> amount
> of data is the same and the Network Data Transfer Rate is the same between
> the
> two backups, why does it take the TSM server so much longer to process the
> files being sent by the larger amount of files in multiple directories?
> 
> I sure would like to have the answer to this.  We are trying to complete
> an
> incremental backup an NT Server with about 3 million small objects
> (according
> to TSM) in many, many folders and it can't even get done in 12 hours.  The
> actual amount of data transferred is only about 7GB per night.  We have
> other
> backups that can complete 50GB in 5 hours but they are in one directory
> and
> the
> # of files is smaller.
> 
> Thanks
> 
> 
> 
> 
> 
>  Network data transfer rate
>  --------------------------
>  The average rate at which the network transfers data between
>  the TSM client and the TSM server, calculated by dividing the
>  total number of bytes transferred by the time to transfer the
>  data over the network. The time it takes for TSM to process
>  objects is not included in the network transfer rate. Therefore,
>  the network transfer rate is higher than the aggregate transfer
>  rate.
> .
>  Aggregate data transfer rate
>  ----------------------------
>  The average rate at which TSM and the network transfer data
>  between the TSM client and the TSM server, calculated by
>  dividing the total number of bytes transferred by the time
>  that elapses from the beginning to the end of the process.
>  Both TSM processing and network time are included in the
>  aggregate transfer rate. Therefore, the aggregate transfer
>  rate is lower than the network transfer rate.