ADSM-L

Re: Performance Large Files vs. Small Files

2001-02-26 03:12:55
Subject: Re: Performance Large Files vs. Small Files
From: "Lambelet,Rene,VEVEY,FC-SIL/INF." <Rene.Lambelet AT NESTLE DOT COM>
Date: Mon, 26 Feb 2001 09:13:15 +0100
Hello,


you might think of renaming the node every day, then doing an export
followed by a delete of the filespace (this will free the DB).

In case of restore, import the needed node

René Lambelet
Nestec S.A. / Informatique du Centre 
55, av. Nestlé  CH-1800 Vevey (Switzerland) 
*+41'21'924'35'43  7+41'21'924'28'88  * K4-117
email rene.lambelet AT nestle DOT com
Visit our site: http://www.nestle.com

        This message is intended only for the use of the addressee and 
        may contain information that is privileged and confidential.


> -----Original Message-----
> From: bbullock [SMTP:bbullock AT MICRON DOT COM]
> Sent: Tuesday, February 20, 2001 11:22 PM
> To:   ADSM-L AT VM.MARIST DOT EDU
> Subject:      Re: Performance Large Files vs. Small Files
> 
>         Jeff,
>         You hit the nail on the head of what is the biggest problem I face
> with TSM today. Excuse me for being long winded, but let me explain the
> boat
> I'm in, and how it relates to many small files.
> 
>         We have been using TSM for about 5 years at our company and have
> finally got everyone on our band wagon  and away from the variety of
> backup
> solutions and media we had in the past. We now have 8 TSM servers running
> on
> AIX hosts (S80s) attached to 4 libraries with a total of 44 3590E tape
> drives. A nice beefy environment.
> 
>         The problem that keeps me awake at night now is that we now have
> manufacturing machines wanting to use TSM for their backups. In the past
> they have used small DLT libraries locally attached to the host, but
> that's
> labor intensive and they want to take advantage of our "enterprise backup
> solution". A great coup for my job security and TSM, as they now see the
> benefit of TSM.
> 
>         The problem with these hosts is that they generate many, many
> small
> files every day. Without going into any detail, each file is a test on a
> part that they may need to look at if the part ever fails. Each part gets
> many tests done to it through the manufacturing process, so many files are
> generated for each part.
> 
>         How many files? Well, I have one Solaris-based host that generates
> 500,000 new files a day in a deeply nested directory structure (about 10
> levels deep with only about 5 files per directory). Before I am asked,
> "no,
> they are not able to change the directory of file structure on the host.
> It
> runs proprietary applications that can't be altered". They are currently
> keeping these files on the host for about 30 days and then deleting them.
> 
>         I have no problem moving the files to TSM on a nightly basis, we
> have a nice big network pipe and the files are small. The problem is with
> the TSM database growth, and the number of files per filesystem (stored in
> TSM). Unfortunately, the directories are not shown when you do a 'q occ'
> on
> a node, so there is actually a "hidden" number of database entries that
> are
> taking up space in my TSM database that are not readily apparent when
> looking at the output of "q node".
> 
>         One of my TSM databases is growing by about 1.5 GB a week, with no
> end in sight. We currently are keeping those files for 180 days, but they
> are now requesting that them be kept for 5 years (in case a part gets
> returned by a customer).
> 
>         This one nightmare host now has over 20 million files (and an
> unknown number of directories) across 10 filesystems. We have found from
> experience, that any more than about 500,000 files in any filesystem means
> a
> full filesystem restore would take many hours. Just to restore the
> directory
> structure seems to take a few hours at least. I have told the admins of
> this
> host that it is very much unrecoverable in it's current state, and would
> take on the order of days to restore the whole box.
> 
>         They are disappointed that an "enterprise backup solution" can't
> handle this number of files any better. They are willing to work with us
> to
> get a solution that will both cover the daily "disaster recovery" backup
> need for the host and the long term retentions they desire.
> 
>         I am pushing back and telling them that their desire to keep it
> all
> for 5 years is unreasonable, but thought I'd bounce it off you folks to
> see
> if there was some TSM solution that I was overlooking.
> 
>         There are 2 ways to control database growth: reduce the number of
> database entries, or reduce the retention time.
> 
> Here is what I've looked into so far.
> 
> 1. Cut the incremental backup retention down to 30 days and then generate
> a
> backup set every 30 days for long term retention.
>         On paper it looks good. you don't have to move the data over the
> net
> again and there is only 1 database entry. Well, I'm not sure how many of
> you
> have tried this on a filesystem with many files, but I tried it twice on a
> filesystem with only 20,000 files and it took over 1 hour to complete.
> Doing
> the math it would take over 100 hours to do each of these 2 million-file
> filesystems. Doesn't seem really feasible.
> 
> 2. Cut the incremental backup retention down to 30 days and run and
> archive
> every 30 days to the 5 year management class.
>         This would cut down the number of files we are tracking with the
> incrementals, so a full filesystem restore from the latest backup would
> have
> less garbage to sort through and hopefully run quicker. Yet with the
> archives, we would have to move the 600 GB over the net every 30 days and
> would still end up tracking the millions of individual files for the next
> 5
> years.
> 
> 3. Use TSM as a disaster recovery solution with a short 30 day retention,
> and use some other solution (like a local CD/DVD burner) to get the 5 year
> retention they desire. Still looking into this one, but they don't like it
> because it once again becomes a manual process to swap out CDs.
> 
> 4. Use TSM as a disaster recovery solution (with a short 30 day retention)
> and have a process tar up all the 30-day old files into one large file,
> then
> have TSM do an archive and delete .tar file. This would mean we only track
> 1
> large tar file for every day for the 5 year time (about 1800 files). This
> is
> the option we are currently pursuing.
> 
>         Any other options or suggestions from the group? Any other backup
> solutions you have in place for tracking many files over longer periods of
> time?
> 
>         If you made it this far through this long e-mail, thanks for
> letting
> me drone on.
> 
> Thanks,
> Ben Bullock
> UNIX Systems Manager
> Micron Technology
> 
> 
> > -----Original Message-----
> > From: Jeff Connor [mailto:connorj AT NIAGARAMOHAWK DOT COM]
> > Sent: Thursday, February 15, 2001 12:01 PM
> > To: ADSM-L AT VM.MARIST DOT EDU
> > Subject: Re: Performance Large Files vs. Small Files
> >
> >
> > Diana,
> >
> > Sorry to chime in late on this but you've hit a subject I've been
> > struggling with for quite some time.
> >
> > We have some pretty large Windows NT file and print servers
> > using MSCS.
> > Each server has lots of small files(1.5 to 2.5 million) and total disk
> > space(the D: drive) between 150GB and 200GB, Compaq server,
> > two 400mhz xeon
> > with 400MB ram.  We have been running TSM on the mainframe since ADSM
> > version 1 and are currently at 3.7 of the TSM server with 3.7.2.01 and
> > 4.1.2 on the NT clients.
> >
> >  Our Windows NT admins have had a concern for quite some time
> > regarding TSM
> > restore performance and how long it would take to restore
> > that big old D:
> > drive.  They don't see the value in TSM as a whole as compared to the
> > competition they just want to know how fast can you recover
> > my entire D:
> > drive.  They decided they wanted to perform weekly full
> > backups to direct
> > attached DLT drives using Arcserve and would use the TSM
> > incrementals to
> > forward recover during full volume restore.   We had to
> > finally recover one
> > of those big D: drives this past September.  The Arcserve
> > portion of the
> > recovery took about 10 hours if I recall correctly.  The TSM forward
> > recovery ran for 36 hours and only restored about 8.5GB.
> > They were not
> > pleased.  It seems all that comparing took quite some time.
> > I've been
> > trying to get to the root of the bottleneck since then.  I've
> > worked with
> > support on and off over the last few months performing
> > various traces and
> > the like.  At this point we are looking in the area of
> > mainframe TCPIP and
> > delay's in acknowledgments coming out of the mainframe during test
> > restores.
> >
> > If you've worked with TSM for a number of years and through sources in
> > IBM/Tivoli and the valuable information from this listserv,
> > over time you
> > learn about all the TSM client and server "knobs" to turn to
> > try and get
> > maximum performance.  Things like Bufpoolsize, database cache hits,
> > housekeeping processes running at the same time as
> > backups/restores slowing
> > things down, network issues like auto-negotiate on NIC's, MTU
> > sizes, TSM
> > server database and log disk placement, tape drive load/seek times and
> > speeds and feeds.  Basically, I think we are pretty well set
> > with all those
> > important things to consider.  This problem we are having may be a
> > mainframe TCPIP issue in the end, but I am not sure that will be the
> > complete picture.
> >
> >  We have recently installed an AIX TSM server, H80 two-way,
> > 2GB memory,
> > 380GB EMC 3430 disk, 6 Fibre Channel 3590-E1A drives in a
> > 3494, TSM server
> > at 4.1.2.  We plan to move most of the larger clients from
> > the TSM OS/390
> > server to the AIX TSM server.  A good move to realize a performance
> > improvement according to many posts on this Listserv over the
> > years.  I am
> > in the process of testing my NT "problem children" as quickly
> > as I can to
> > prove this configuration will address the concerns our NT
> > Admins have about
> > restores of large NT servers.  I'm trying to prevent them
> > from installing a
> > Veritas SAN solution and asking them to stick with our
> > Enterprise Backup
> > Strategic direction which is to utilize TSM.  As you probably
> > know, the SAN
> > enabled TSM backup/archive client for NT is not here and may
> > never be from
> > what I've heard.  My only option at this point is SAN tape
> > library sharing
> > with the TSM client and server on the same machine for each
> > of our MSCS
> > servers.
> >
> > Now I'm sure many of you reading this may be thinking of
> > things like, "why
> > not break the D: drive into smaller partitions so you can collocate by
> > filespace and restore all the data concurrently".  No go
> > guys, they don't
> > want to change the way they configure their servers just to
> > accommodate TSM
> > when the feel they would not have to with other products.
> > They feel that
> > with 144GB single drives around the corner who is to say what
> > a "big" NT
> > partition is?  NT seems to support these large drives without issues.
> > (Their words not mine).
> >
> > Back to the issue.  Our initial backup tests using our new
> > AIX TSM server
> > have produced significant improvements in performance.  I am
> > just getting
> > the pieces in place to perform restore tests.  My first test
> > a couple days
> > ago was to restore part of the data from that server we had
> > the issue with
> > in September.  It took about one hour to lay down just the directories
> > before restoring any files.  Probably still better than the
> > mainframe but
> > not great.  My plan for future tests is to perform backups
> > and restores of
> > the same data to and from both of my TSM servers to compare
> > performance.  I
> > will share the results with you and the rest of the listserv
> > as I progress.
> >
> > In general I have always, like many other TSM users, achieved
> > much better
> > restore/backup rates with larger files versus lots of smaller files.
> > Assuming you've done all the right tuning, the question that
> > comes to my
> > mind is, does it really come down to the architecture?  The
> > TSM database
> > makes things very easy for day to day smaller recoveries
> > which is the type
> > we perform most.  But does the architecture that makes day to day
> > operations easier not lend itself well to backup/recovery of
> > large amounts
> > of data made up of small files?  I have very little experience with
> > competing products. Do they struggle with lots of small files as well?
> > Veritas, Arserve anyone?  If the issue is, as some on the
> > Listserv have
> > suggested, frequent interaction with the client file system
> > the bottleneck,
> > then I suppose the answer would be yes the other products
> > have the same
> > problem.  Or is the issue more on the TSM database side due
> > to it's design,
> > and other products using different architectures may not have
> > this problem?
> > Maybe the competitions architecture is less bulletproof but
> > if you're one
> > of our NT Admins you don't seem to care when the client keeps calling
> > asking how much longer the restore will be running.   I know TSM
> > development is aware of the issues with lots of small files
> > and I would be
> > curious what they plan to do about the problems Diana and I have
> > experienced.
> >
> > The newer client option, Resourceutilization, has helped with
> > backing up
> > clients with lots of small files more quickly.  I would love
> > to see the
> > same type of automated multi-tasking on restores.  I don't know the
> > specifics of how this actually works but it seems to me that
> > when I ask to
> > restore an entire NT drive, for example, the TSM
> > client/server must sort
> > the file list in some fashion to intelligently request tape volumes to
> > minimize the mounts required.  If that's the case could they
> > take things
> > one step further and add an option to the restore specifying
> > the number of
> > concurrent sessions/mountpoints to be used to perform the
> > restore?  For
> > example, if I have a node who's collocated data is spread
> > across twenty
> > tapes and I have 6 tape drives available for the recovery,
> > how about an
> > option for the restore command like:
> >
> >      RES -subd=y -nummp=6 d:\*
> >
> > where the -nummp option would be the number of mount
> > points/tape drives to
> > be used for the restore.  TSM could sort the file list coming
> > up with the
> > list of tapes to be used for the restore and perhaps spread the mounts
> > across 6 sessions/mount points.  I'm sure I've probably made
> > a complex task
> > sound simple but this type of option would be very useful.  I
> > think many of
> > us have seen the benefits of running multiple sessions to
> > reduce recovery
> > elapsed time.  I find my current choices for doing so difficult to
> > implement or politically undesirable.
> >
> > If others have the same issues with lots of small files in
> > particular with
> > Windows NT clients lets hear from you.  Maybe we can come up with some
> > enhancement requests.  I'll pass on the results of my tests as stated
> > above.  I'd be interested in hearing from those of you that
> > have worked
> > with other products and can tell me if they have the same performance
> > problems with lots of small files.  If the performance of
> > other products is
> > impacted in the same was as TSM performance then that would be good to
> > know.  If it's more about the Windows NT NTFS file system then I'd be
> > satisfied with that explanation as well.  If it's about lots
> > of interaction
> > with the TSM database leads to slower performance, even when optimally
> > configured, then I'd like to know what Tivoli has in the
> > works to address
> > the issue.  Because if it's the TSM database, I could
> > probably install the
> > fattest Fibre Channel/network pipe with the fastest
> > peripherals and server
> > hardware around and it might not change a thing.
> >
> > Thanks
> > Jeff Connor
> > Niagara Mohawk Power Corp.
> >
> >
> >
> >
> >
> >
> > "Diana J.Cline" <Diana.Cline AT ROSSNUTRITION DOT COM>@VM.MARIST.EDU> on
> > 02/14/2001 10:04:52 AM
> >
> > Please respond to "ADSM: Dist Stor Manager" <ADSM-L AT VM.MARIST DOT EDU>
> >
> > Sent by:  "ADSM: Dist Stor Manager" <ADSM-L AT VM.MARIST DOT EDU>
> >
> >
> > To:   ADSM-L AT VM.MARIST DOT EDU
> > cc:
> >
> > Subject:  Performance Large Files vs. Small Files
> >
> >
> > Using an NT Client and an AIX Server
> >
> > Does anyone have a TECHNICAL reason why I can backup 30GB of
> > 2GB files that
> > are
> > stored in one directory so much faster than 30GB of 2kb files that are
> > stored
> > in a bunch of directories?
> >
> > I know that this is the case, I just would like to find out
> > why.  If the
> > amount
> > of data is the same and the Network Data Transfer Rate is the
> > same between
> > the
> > two backups, why does it take the TSM server so much longer
> > to process the
> > files being sent by the larger amount of files in multiple
> > directories?
> >
> > I sure would like to have the answer to this.  We are trying
> > to complete an
> > incremental backup an NT Server with about 3 million small objects
> > (according
> > to TSM) in many, many folders and it can't even get done in
> > 12 hours.  The
> > actual amount of data transferred is only about 7GB per
> > night.  We have
> > other
> > backups that can complete 50GB in 5 hours but they are in one
> > directory and
> > the
> > # of files is smaller.
> >
> > Thanks
> >
> >
> >
> >
> >
> >  Network data transfer rate
> >  --------------------------
> >  The average rate at which the network transfers data between
> >  the TSM client and the TSM server, calculated by dividing the
> >  total number of bytes transferred by the time to transfer the
> >  data over the network. The time it takes for TSM to process
> >  objects is not included in the network transfer rate. Therefore,
> >  the network transfer rate is higher than the aggregate transfer
> >  rate.
> > .
> >  Aggregate data transfer rate
> >  ----------------------------
> >  The average rate at which TSM and the network transfer data
> >  between the TSM client and the TSM server, calculated by
> >  dividing the total number of bytes transferred by the time
> >  that elapses from the beginning to the end of the process.
> >  Both TSM processing and network time are included in the
> >  aggregate transfer rate. Therefore, the aggregate transfer
> >  rate is lower than the network transfer rate.
> >