Millions of small files

jthillo

ADSM.ORG Member
Joined
Jan 21, 2005
Messages
46
Reaction score
0
Points
0
Website
http
Hello!



We just began backing up a server that houses millions of small files that will come out to 1.5 terabytes of data. So far we have backed up 3.9 million files in 475 gigs of space and the database has grown 345 megs.



The downside:



The number of files will continue to grow forever and we have to store the files forever.



The upside:



The files once written never change and once the initial backup completes the process should slow to a few hundred a day.



Right now I am doing a standard backup but I am considering going to backup sets but I an concerned because I have to do a nightly backup and really don't want to do a backupset each night.



Anybody have any thoughts on this or any recommendaton, gotcha's to worry about?
 
I don't see how subfile backups should help in an environment, where a file once written never changes. For a client like outlined below (I'd hazard a guess at 'Some kind of Archive Server') I'd use the following setup:



1.) Create your own domain, stgpool and copypool for the client

2.) Since a file once written never changes, use incremental by date if backup performance becomes too slow - and throw in a standard incremental once a week or so for good measure.

3.) Since nothing will ever expire, keep an eye on the last write dates of the volumes in your node's primary and copy pool and issue move data with reconstruct when the volumes reach a certain write age (mybe a year or so - depending on your tape technology).

4.) When the node starts hogging your TSM DB because of the sheer number of object, create a seperate TSM server instance for this node (and i.e. define it as a library client to the existing isntance). Move the node via export toserver to the new instance.

5.) When you can foresee that the node itself will become unusable due to the sheer number of objects, split the node (i.e. one node per top-level directory or according to however its organized) right from the start.

6.) Since the application handling the data will probably have the same problems regarding eternal growth, you should talk to the people owning the node and ask for their plans about what will happen once the whole thing gets 'too big'. Maybe the'll want to split the entire server somehow. It will help to design the bnackup straegy with those plans in mind.



Cheers, PJ
 
Jorunaling in theory works well it's just not perfect (far from it). I would use it in this case and also perform weekly to bi-weekly image backups to make sure you are not waiting an eternity on a restore if this ever failed. You never stated wether the system is Windows or Unix? Lets assume windows so you can do journaling. If it's not windows then it get more critical. Unix does not play well with lots of small files. I know I got my arse handed to me by a Unix i-node issue once. Also, I haven't seen an online image backup feature yet for Unix (although it's supposed to be coming). If it's Windows based then the problem is not so bad since it allows for online-image backups.
 
I don't think image backups, journalling or whatever is really necessary in this case. There'll be no inactive files, so incr by date will not miss anything and no-query-restore should blast through almost at tapespeed. There's no skips, no secondary queries, no nothing. Our KVS repository (same thing - no inactive files whatsoever) restores no-queries at 32 MB/s using a single LTO2 Tape.



Cheers, PJ
 
One more think to watch is the memory usage on the box your backing up.



We have reached the limits of 32bit Windows, as the paged pool fills up too quickly during a backup scan and crashes the node.



MS have issued a couple of registry fixes to resolve this, but it come at a cost.. the performance is much worse as it flushes the paged pool to the page file much earlier.



Were now looking at migrating a cluster with 37million files onto 64bit hardware and software.



We have recently moved the node to a new TSM server and has taken 35% of a 60gb database with just the 1 full backup !!
 
Thanks all,



I appreciate all the infomation.



Good "Gotcha" on the write date for the tapes! Something I would not have thought of.



Interesting thougth on the separate instance as well.



We were running two servers with the same data for a while during the migration. When I deleted the old server and removed the data our database shrank 18 percent!



Now just to watch it grow and grow....



The server houses public domain data for a local government, things such as land deeds, etc going back to when the county formed in the late 1700's. The deeds etc are scanned and stored on a server for public use. Each year, of course, new deeds, marriage licenses, etc come in...The server data is as much a historical reference as a tool for current homebuyers, etc...
 
JT,

Two more cents ... you might want to talk with your sysadmins and application developers about using diskxtender or TSM HSM.

Cheers,

Neil
 
Back
Top