ADSM-L

Re: Performance Large Files vs. Small Files

2001-02-26 10:47:31
Subject: Re: Performance Large Files vs. Small Files
From: "Lambelet,Rene,VEVEY,FC-SIL/INF." <Rene.Lambelet AT NESTLE DOT COM>
Date: Mon, 26 Feb 2001 16:47:48 +0100
you could also export only inactive versions,

René Lambelet
Nestec S.A. / Informatique du Centre 
55, av. Nestlé  CH-1800 Vevey (Switzerland) 
*+41'21'924'35'43  7+41'21'924'28'88  * K4-117
email rene.lambelet AT nestle DOT com
Visit our site: http://www.nestle.com

        This message is intended only for the use of the addressee and 
        may contain information that is privileged and confidential.



> -----Original Message-----
> From: bbullock [SMTP:bbullock AT MICRON DOT COM]
> Sent: Monday, February 26, 2001 4:32 PM
> To:   ADSM-L AT VM.MARIST DOT EDU
> Subject:      Re: Performance Large Files vs. Small Files
> 
>         EEK. I'm sure this is not the answer, because if I rename the
> filesystem everyday, then I have to do a  full backup of the filesystem
> every day. I don't think there are enough hours in the day to do a full
> backup, export it, and delete the filespace. Thanks for the suggestion
> though.       
> 
> Ben Bullock
> UNIX Systems Manager
> (208) 368-4287
> 
> > -----Original Message-----
> > From: Lambelet,Rene,VEVEY,FC-SIL/INF. 
> > [mailto:Rene.Lambelet AT NESTLE DOT COM]
> > Sent: Monday, February 26, 2001 1:13 AM
> > To: ADSM-L AT VM.MARIST DOT EDU
> > Subject: Re: Performance Large Files vs. Small Files
> > 
> > 
> > Hello,
> > 
> > 
> > you might think of renaming the node every day, then doing an export
> > followed by a delete of the filespace (this will free the DB).
> > 
> > In case of restore, import the needed node
> > 
> > René Lambelet
> > Nestec S.A. / Informatique du Centre 
> > 55, av. Nestlé  CH-1800 Vevey (Switzerland) 
> > *+41'21'924'35'43  7+41'21'924'28'88  * K4-117
> > email rene.lambelet AT nestle DOT com
> > Visit our site: http://www.nestle.com
> > 
> >         This message is intended only for the use of the 
> > addressee and 
> >         may contain information that is privileged and confidential.
> > 
> > 
> > > -----Original Message-----
> > > From: bbullock [SMTP:bbullock AT MICRON DOT COM]
> > > Sent: Tuesday, February 20, 2001 11:22 PM
> > > To:   ADSM-L AT VM.MARIST DOT EDU
> > > Subject:      Re: Performance Large Files vs. Small Files
> > > 
> > >         Jeff,
> > >         You hit the nail on the head of what is the biggest 
> > problem I face
> > > with TSM today. Excuse me for being long winded, but let me 
> > explain the
> > > boat
> > > I'm in, and how it relates to many small files.
> > > 
> > >         We have been using TSM for about 5 years at our 
> > company and have
> > > finally got everyone on our band wagon  and away from the variety of
> > > backup
> > > solutions and media we had in the past. We now have 8 TSM 
> > servers running
> > > on
> > > AIX hosts (S80s) attached to 4 libraries with a total of 44 
> > 3590E tape
> > > drives. A nice beefy environment.
> > > 
> > >         The problem that keeps me awake at night now is 
> > that we now have
> > > manufacturing machines wanting to use TSM for their 
> > backups. In the past
> > > they have used small DLT libraries locally attached to the host, but
> > > that's
> > > labor intensive and they want to take advantage of our 
> > "enterprise backup
> > > solution". A great coup for my job security and TSM, as 
> > they now see the
> > > benefit of TSM.
> > > 
> > >         The problem with these hosts is that they generate 
> > many, many
> > > small
> > > files every day. Without going into any detail, each file 
> > is a test on a
> > > part that they may need to look at if the part ever fails. 
> > Each part gets
> > > many tests done to it through the manufacturing process, so 
> > many files are
> > > generated for each part.
> > > 
> > >         How many files? Well, I have one Solaris-based host 
> > that generates
> > > 500,000 new files a day in a deeply nested directory 
> > structure (about 10
> > > levels deep with only about 5 files per directory). Before 
> > I am asked,
> > > "no,
> > > they are not able to change the directory of file structure 
> > on the host.
> > > It
> > > runs proprietary applications that can't be altered". They 
> > are currently
> > > keeping these files on the host for about 30 days and then 
> > deleting them.
> > > 
> > >         I have no problem moving the files to TSM on a 
> > nightly basis, we
> > > have a nice big network pipe and the files are small. The 
> > problem is with
> > > the TSM database growth, and the number of files per 
> > filesystem (stored in
> > > TSM). Unfortunately, the directories are not shown when you 
> > do a 'q occ'
> > > on
> > > a node, so there is actually a "hidden" number of database 
> > entries that
> > > are
> > > taking up space in my TSM database that are not readily 
> > apparent when
> > > looking at the output of "q node".
> > > 
> > >         One of my TSM databases is growing by about 1.5 GB 
> > a week, with no
> > > end in sight. We currently are keeping those files for 180 
> > days, but they
> > > are now requesting that them be kept for 5 years (in case a 
> > part gets
> > > returned by a customer).
> > > 
> > >         This one nightmare host now has over 20 million 
> > files (and an
> > > unknown number of directories) across 10 filesystems. We 
> > have found from
> > > experience, that any more than about 500,000 files in any 
> > filesystem means
> > > a
> > > full filesystem restore would take many hours. Just to restore the
> > > directory
> > > structure seems to take a few hours at least. I have told 
> > the admins of
> > > this
> > > host that it is very much unrecoverable in it's current 
> > state, and would
> > > take on the order of days to restore the whole box.
> > > 
> > >         They are disappointed that an "enterprise backup 
> > solution" can't
> > > handle this number of files any better. They are willing to 
> > work with us
> > > to
> > > get a solution that will both cover the daily "disaster 
> > recovery" backup
> > > need for the host and the long term retentions they desire.
> > > 
> > >         I am pushing back and telling them that their 
> > desire to keep it
> > > all
> > > for 5 years is unreasonable, but thought I'd bounce it off 
> > you folks to
> > > see
> > > if there was some TSM solution that I was overlooking.
> > > 
> > >         There are 2 ways to control database growth: reduce 
> > the number of
> > > database entries, or reduce the retention time.
> > > 
> > > Here is what I've looked into so far.
> > > 
> > > 1. Cut the incremental backup retention down to 30 days and 
> > then generate
> > > a
> > > backup set every 30 days for long term retention.
> > >         On paper it looks good. you don't have to move the 
> > data over the
> > > net
> > > again and there is only 1 database entry. Well, I'm not 
> > sure how many of
> > > you
> > > have tried this on a filesystem with many files, but I 
> > tried it twice on a
> > > filesystem with only 20,000 files and it took over 1 hour 
> > to complete.
> > > Doing
> > > the math it would take over 100 hours to do each of these 2 
> > million-file
> > > filesystems. Doesn't seem really feasible.
> > > 
> > > 2. Cut the incremental backup retention down to 30 days and run and
> > > archive
> > > every 30 days to the 5 year management class.
> > >         This would cut down the number of files we are 
> > tracking with the
> > > incrementals, so a full filesystem restore from the latest 
> > backup would
> > > have
> > > less garbage to sort through and hopefully run quicker. Yet with the
> > > archives, we would have to move the 600 GB over the net 
> > every 30 days and
> > > would still end up tracking the millions of individual 
> > files for the next
> > > 5
> > > years.
> > > 
> > > 3. Use TSM as a disaster recovery solution with a short 30 
> > day retention,
> > > and use some other solution (like a local CD/DVD burner) to 
> > get the 5 year
> > > retention they desire. Still looking into this one, but 
> > they don't like it
> > > because it once again becomes a manual process to swap out CDs.
> > > 
> > > 4. Use TSM as a disaster recovery solution (with a short 30 
> > day retention)
> > > and have a process tar up all the 30-day old files into one 
> > large file,
> > > then
> > > have TSM do an archive and delete .tar file. This would 
> > mean we only track
> > > 1
> > > large tar file for every day for the 5 year time (about 
> > 1800 files). This
> > > is
> > > the option we are currently pursuing.
> > > 
> > >         Any other options or suggestions from the group? 
> > Any other backup
> > > solutions you have in place for tracking many files over 
> > longer periods of
> > > time?
> > > 
> > >         If you made it this far through this long e-mail, thanks for
> > > letting
> > > me drone on.
> > > 
> > > Thanks,
> > > Ben Bullock
> > > UNIX Systems Manager
> > > Micron Technology
> > > 
> > > 
> > > > -----Original Message-----
> > > > From: Jeff Connor [mailto:connorj AT NIAGARAMOHAWK DOT COM]
> > > > Sent: Thursday, February 15, 2001 12:01 PM
> > > > To: ADSM-L AT VM.MARIST DOT EDU
> > > > Subject: Re: Performance Large Files vs. Small Files
> > > >
> > > >
> > > > Diana,
> > > >
> > > > Sorry to chime in late on this but you've hit a subject I've been
> > > > struggling with for quite some time.
> > > >
> > > > We have some pretty large Windows NT file and print servers
> > > > using MSCS.
> > > > Each server has lots of small files(1.5 to 2.5 million) 
> > and total disk
> > > > space(the D: drive) between 150GB and 200GB, Compaq server,
> > > > two 400mhz xeon
> > > > with 400MB ram.  We have been running TSM on the 
> > mainframe since ADSM
> > > > version 1 and are currently at 3.7 of the TSM server with 
> > 3.7.2.01 and
> > > > 4.1.2 on the NT clients.
> > > >
> > > >  Our Windows NT admins have had a concern for quite some time
> > > > regarding TSM
> > > > restore performance and how long it would take to restore
> > > > that big old D:
> > > > drive.  They don't see the value in TSM as a whole as 
> > compared to the
> > > > competition they just want to know how fast can you recover
> > > > my entire D:
> > > > drive.  They decided they wanted to perform weekly full
> > > > backups to direct
> > > > attached DLT drives using Arcserve and would use the TSM
> > > > incrementals to
> > > > forward recover during full volume restore.   We had to
> > > > finally recover one
> > > > of those big D: drives this past September.  The Arcserve
> > > > portion of the
> > > > recovery took about 10 hours if I recall correctly.  The 
> > TSM forward
> > > > recovery ran for 36 hours and only restored about 8.5GB.
> > > > They were not
> > > > pleased.  It seems all that comparing took quite some time.
> > > > I've been
> > > > trying to get to the root of the bottleneck since then.  I've
> > > > worked with
> > > > support on and off over the last few months performing
> > > > various traces and
> > > > the like.  At this point we are looking in the area of
> > > > mainframe TCPIP and
> > > > delay's in acknowledgments coming out of the mainframe during test
> > > > restores.
> > > >
> > > > If you've worked with TSM for a number of years and 
> > through sources in
> > > > IBM/Tivoli and the valuable information from this listserv,
> > > > over time you
> > > > learn about all the TSM client and server "knobs" to turn to
> > > > try and get
> > > > maximum performance.  Things like Bufpoolsize, database 
> > cache hits,
> > > > housekeeping processes running at the same time as
> > > > backups/restores slowing
> > > > things down, network issues like auto-negotiate on NIC's, MTU
> > > > sizes, TSM
> > > > server database and log disk placement, tape drive 
> > load/seek times and
> > > > speeds and feeds.  Basically, I think we are pretty well set
> > > > with all those
> > > > important things to consider.  This problem we are having may be a
> > > > mainframe TCPIP issue in the end, but I am not sure that 
> > will be the
> > > > complete picture.
> > > >
> > > >  We have recently installed an AIX TSM server, H80 two-way,
> > > > 2GB memory,
> > > > 380GB EMC 3430 disk, 6 Fibre Channel 3590-E1A drives in a
> > > > 3494, TSM server
> > > > at 4.1.2.  We plan to move most of the larger clients from
> > > > the TSM OS/390
> > > > server to the AIX TSM server.  A good move to realize a 
> > performance
> > > > improvement according to many posts on this Listserv over the
> > > > years.  I am
> > > > in the process of testing my NT "problem children" as quickly
> > > > as I can to
> > > > prove this configuration will address the concerns our NT
> > > > Admins have about
> > > > restores of large NT servers.  I'm trying to prevent them
> > > > from installing a
> > > > Veritas SAN solution and asking them to stick with our
> > > > Enterprise Backup
> > > > Strategic direction which is to utilize TSM.  As you probably
> > > > know, the SAN
> > > > enabled TSM backup/archive client for NT is not here and may
> > > > never be from
> > > > what I've heard.  My only option at this point is SAN tape
> > > > library sharing
> > > > with the TSM client and server on the same machine for each
> > > > of our MSCS
> > > > servers.
> > > >
> > > > Now I'm sure many of you reading this may be thinking of
> > > > things like, "why
> > > > not break the D: drive into smaller partitions so you can 
> > collocate by
> > > > filespace and restore all the data concurrently".  No go
> > > > guys, they don't
> > > > want to change the way they configure their servers just to
> > > > accommodate TSM
> > > > when the feel they would not have to with other products.
> > > > They feel that
> > > > with 144GB single drives around the corner who is to say what
> > > > a "big" NT
> > > > partition is?  NT seems to support these large drives 
> > without issues.
> > > > (Their words not mine).
> > > >
> > > > Back to the issue.  Our initial backup tests using our new
> > > > AIX TSM server
> > > > have produced significant improvements in performance.  I am
> > > > just getting
> > > > the pieces in place to perform restore tests.  My first test
> > > > a couple days
> > > > ago was to restore part of the data from that server we had
> > > > the issue with
> > > > in September.  It took about one hour to lay down just 
> > the directories
> > > > before restoring any files.  Probably still better than the
> > > > mainframe but
> > > > not great.  My plan for future tests is to perform backups
> > > > and restores of
> > > > the same data to and from both of my TSM servers to compare
> > > > performance.  I
> > > > will share the results with you and the rest of the listserv
> > > > as I progress.
> > > >
> > > > In general I have always, like many other TSM users, achieved
> > > > much better
> > > > restore/backup rates with larger files versus lots of 
> > smaller files.
> > > > Assuming you've done all the right tuning, the question that
> > > > comes to my
> > > > mind is, does it really come down to the architecture?  The
> > > > TSM database
> > > > makes things very easy for day to day smaller recoveries
> > > > which is the type
> > > > we perform most.  But does the architecture that makes day to day
> > > > operations easier not lend itself well to backup/recovery of
> > > > large amounts
> > > > of data made up of small files?  I have very little 
> > experience with
> > > > competing products. Do they struggle with lots of small 
> > files as well?
> > > > Veritas, Arserve anyone?  If the issue is, as some on the
> > > > Listserv have
> > > > suggested, frequent interaction with the client file system
> > > > the bottleneck,
> > > > then I suppose the answer would be yes the other products
> > > > have the same
> > > > problem.  Or is the issue more on the TSM database side due
> > > > to it's design,
> > > > and other products using different architectures may not have
> > > > this problem?
> > > > Maybe the competitions architecture is less bulletproof but
> > > > if you're one
> > > > of our NT Admins you don't seem to care when the client 
> > keeps calling
> > > > asking how much longer the restore will be running.   I know TSM
> > > > development is aware of the issues with lots of small files
> > > > and I would be
> > > > curious what they plan to do about the problems Diana and I have
> > > > experienced.
> > > >
> > > > The newer client option, Resourceutilization, has helped with
> > > > backing up
> > > > clients with lots of small files more quickly.  I would love
> > > > to see the
> > > > same type of automated multi-tasking on restores.  I 
> > don't know the
> > > > specifics of how this actually works but it seems to me that
> > > > when I ask to
> > > > restore an entire NT drive, for example, the TSM
> > > > client/server must sort
> > > > the file list in some fashion to intelligently request 
> > tape volumes to
> > > > minimize the mounts required.  If that's the case could they
> > > > take things
> > > > one step further and add an option to the restore specifying
> > > > the number of
> > > > concurrent sessions/mountpoints to be used to perform the
> > > > restore?  For
> > > > example, if I have a node who's collocated data is spread
> > > > across twenty
> > > > tapes and I have 6 tape drives available for the recovery,
> > > > how about an
> > > > option for the restore command like:
> > > >
> > > >      RES -subd=y -nummp=6 d:\*
> > > >
> > > > where the -nummp option would be the number of mount
> > > > points/tape drives to
> > > > be used for the restore.  TSM could sort the file list coming
> > > > up with the
> > > > list of tapes to be used for the restore and perhaps 
> > spread the mounts
> > > > across 6 sessions/mount points.  I'm sure I've probably made
> > > > a complex task
> > > > sound simple but this type of option would be very useful.  I
> > > > think many of
> > > > us have seen the benefits of running multiple sessions to
> > > > reduce recovery
> > > > elapsed time.  I find my current choices for doing so difficult to
> > > > implement or politically undesirable.
> > > >
> > > > If others have the same issues with lots of small files in
> > > > particular with
> > > > Windows NT clients lets hear from you.  Maybe we can come 
> > up with some
> > > > enhancement requests.  I'll pass on the results of my 
> > tests as stated
> > > > above.  I'd be interested in hearing from those of you that
> > > > have worked
> > > > with other products and can tell me if they have the same 
> > performance
> > > > problems with lots of small files.  If the performance of
> > > > other products is
> > > > impacted in the same was as TSM performance then that 
> > would be good to
> > > > know.  If it's more about the Windows NT NTFS file system 
> > then I'd be
> > > > satisfied with that explanation as well.  If it's about lots
> > > > of interaction
> > > > with the TSM database leads to slower performance, even 
> > when optimally
> > > > configured, then I'd like to know what Tivoli has in the
> > > > works to address
> > > > the issue.  Because if it's the TSM database, I could
> > > > probably install the
> > > > fattest Fibre Channel/network pipe with the fastest
> > > > peripherals and server
> > > > hardware around and it might not change a thing.
> > > >
> > > > Thanks
> > > > Jeff Connor
> > > > Niagara Mohawk Power Corp.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > "Diana J.Cline" <Diana.Cline AT ROSSNUTRITION DOT COM>@VM.MARIST.EDU> on
> > > > 02/14/2001 10:04:52 AM
> > > >
> > > > Please respond to "ADSM: Dist Stor Manager" <ADSM-L AT VM.MARIST DOT 
> > > > EDU>
> > > >
> > > > Sent by:  "ADSM: Dist Stor Manager" <ADSM-L AT VM.MARIST DOT EDU>
> > > >
> > > >
> > > > To:   ADSM-L AT VM.MARIST DOT EDU
> > > > cc:
> > > >
> > > > Subject:  Performance Large Files vs. Small Files
> > > >
> > > >
> > > > Using an NT Client and an AIX Server
> > > >
> > > > Does anyone have a TECHNICAL reason why I can backup 30GB of
> > > > 2GB files that
> > > > are
> > > > stored in one directory so much faster than 30GB of 2kb 
> > files that are
> > > > stored
> > > > in a bunch of directories?
> > > >
> > > > I know that this is the case, I just would like to find out
> > > > why.  If the
> > > > amount
> > > > of data is the same and the Network Data Transfer Rate is the
> > > > same between
> > > > the
> > > > two backups, why does it take the TSM server so much longer
> > > > to process the
> > > > files being sent by the larger amount of files in multiple
> > > > directories?
> > > >
> > > > I sure would like to have the answer to this.  We are trying
> > > > to complete an
> > > > incremental backup an NT Server with about 3 million small objects
> > > > (according
> > > > to TSM) in many, many folders and it can't even get done in
> > > > 12 hours.  The
> > > > actual amount of data transferred is only about 7GB per
> > > > night.  We have
> > > > other
> > > > backups that can complete 50GB in 5 hours but they are in one
> > > > directory and
> > > > the
> > > > # of files is smaller.
> > > >
> > > > Thanks
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >  Network data transfer rate
> > > >  --------------------------
> > > >  The average rate at which the network transfers data between
> > > >  the TSM client and the TSM server, calculated by dividing the
> > > >  total number of bytes transferred by the time to transfer the
> > > >  data over the network. The time it takes for TSM to process
> > > >  objects is not included in the network transfer rate. Therefore,
> > > >  the network transfer rate is higher than the aggregate transfer
> > > >  rate.
> > > > .
> > > >  Aggregate data transfer rate
> > > >  ----------------------------
> > > >  The average rate at which TSM and the network transfer data
> > > >  between the TSM client and the TSM server, calculated by
> > > >  dividing the total number of bytes transferred by the time
> > > >  that elapses from the beginning to the end of the process.
> > > >  Both TSM processing and network time are included in the
> > > >  aggregate transfer rate. Therefore, the aggregate transfer
> > > >  rate is lower than the network transfer rate.
> > > >
> > 
<Prev in Thread] Current Thread [Next in Thread>