Re: Duplicate Files

Paul,

Since you asked for some feedback, here are my 2 cents..

On Thu, 4 Jan 1996 11:50:46 PST Paul L. Bradshaw said:
>The subject of detecting duplicate files, etc., has been discussed in ADSM
>many times.  Basically the situation comes down to the following:
>
>To perform adequate duplicate file detection, you can do this with varying
>degrees of sophistication, safety, and cost.
>
>For reasonable safety, every file backed up needs to have a CRC associated
>with it.  We are looking into that for various reasons, this being one of
>them.  This way a file with the same name, size, maybe date/time, and CRC
>can be assumed to be a duplicate (if you choose to use this feature in the
>system).  Note this also includes any extended attributes as well with the
>file.
>
>Then you could create a new table in the system that tracks duplicate files.
>You DO NOT want to look through the entire DB everytime you back up a file
>to see if it is a duplicate... some sites have well over 50 million files
>already!

Agreed.  My biggest concern with such a feature is the CPU overhead it might
add to the server.  To be useful, such duplication checking would have to be
CPU efficient.  Another point: since you are talking about having a separate
table to track duplicate files, it would probably be very useful to have a
way to scan the existing database in such a way that you could determine
what duplicate files already existed in the database.  By sorting the
results, you could determine which files were the most useful to have in this
table.  Since such a command would probably take a long time to run, this is
something that you would probably only want to do very infrequently.

Rather than having one duplicate file table, I would suggest that you have
multiple tables.  First, files differ by platform, so it would be
inefficient to check Macintosh files against a table that included Windows
applications.  Second, files might also differ by groups of users.  I.e.,
users in the accounting department would probably have a different set of
applications than those in the research department, although there would
probably be some amount of overlap.

>          If a file matches on name, date, size, then you could read it
>to check the CRC.  If it is a duplicate, add another pointer to the
>duplicate file saying this user has one as well, but it is stored here on
>his system, etc..  Note this takes basically the same DB space in ADSM
>to save this info as it does to point to a non duplicate file.  Thus you
>achieve very little DB space savings.
>
>Also note that reading the file to check the CRC is a large percentage of
>the time to actually backup the file over the network to the server.  Thus
>this savings is small as well unless you are running over a slow network
>such a phone lines, etc..

I'm not concerned about being wasteful of client resource (i.e., CPU).  If I
don't save any client resource that would be fine, so long as I am saving
server or network resource.  I am much more concerned about these, as they
are more likely to be a limiting factor in our environment.  Saving server
storage, even if it's only tape storage, is also a benefit.

>ADSM's progressive incremental model also insures that once a file is sent to
>the ADSM server, that it is not sent again unless it changes.  One can make
>an assumption that duplicate files are very static, that is why they are
>duplicate!  Thus once ADSM backs up a file, it never has to back it up again
>unless it changes.  Thus my xyz.exe file hasn't been re-sent to the ADSM
>server for over a year now since it hasn't changed and the server has a
>good copy already.  Thus I don't have the expense in ADSM of doing full
>backups on a regular basis, and that further reduces the cost savings of
>finding duplicate files.

I agree that ADSM's progressive incremental model is a very good thing.  It
is what makes ADSM so much more scalable than other backup solutions which
are dependent on periodic full backups.  While I also agree that this
reduces the potential cost savings of finding duplicate files, I still
think that finding duplicate files would be a valuable enhancement to ADSM.
Most of our desktop workstation disks are largely filled with applications,
many of which are common across multiple systems.  I would guess that for
desktop workstations, more than half of the data would qualify for duplicate
checking.  Even with tape costs as low as they are, if I could reduce them by
a factor of two, that would be well worth doing.  Indeed, I think it might be
possible to reduce it much further, since we have many files that are
duplicated potentially hundreds of times.  If I could store one copy of the
Microsoft Word application files instead of hundreds of copies, then my
storage costs could be substantially reduced.

>The other part of the equation is the cost of media to store these duplicate
>files on the system.  A $30 tape can hold 5GB of data these days, so if you
>have 15GB of duplicate data that may cost you $100 in media, or $200 if you
>duplicate it in ADSM.  Since the assumption that these duplicate files are
>largely static, they will propagate down the media hierarchy to reside on
>low cost media over a short period of time.

I don't know what other sites do, but we do not use shelf storage for any
of our tapes.  We keep all of our tapes in a robot.  In addition to the cost
of the tape, you must add in the cost of the robot.  Keeping tapes in shelf
storage (e.g., manual library) only works if you have a human around who can
satisfy the mount request when it happens.

I guess that the actual savings would be dependent on the ratio of duplicate
to non-duplicate files stored on your server.  Actually, rather than count
all duplicate files, you would only want to count those that you would be
willing to add to your "duplication-check-tables".  It seems like the storage
savings could be calculated by examining someone's current database.  It
would probably take awhile to perform the analysis, but it might be very
enlightening (even without having checksum data).  Perhaps if I can find some
spare time...  Now where did I put that round tuit??

>Thus the actual cost savings in determining if a file is duplicate, and thus
>not sending it to the ADSM server is actually minimal.  This is due to some
>of the features in ADSM such as its progressive backup model.
>In ADSM, you pay a 1 time up front cost to save this duplicate file.  For
>duplicate detection you pay in higher DB search costs, but lower storage
>costs.

In addition to the storage cost savings, there are possible resource savings
in not needlessly transmitting duplicate copies to the server.  I think I
largely agree with Paul's analysis that because of ADSM's progressive
incremental backup model, these savings are probably minimal, and by
themselves probably do not justify the development effort.  I can see some
circumstances, however, where we would find them useful at Cornell.  At the
beginning of the semester, we distribute new versions of applications
software to everyone on campus.  This will cause us some spikes in ADSM
backup network traffic, which could be potentially avoided with duplicate
file checking, especially if the addition of new files to the duplicate
tables could somehow be automated.

There is one other situation where duplicate file checking would help, and
that is in the initial backup of a workstation.  We have received several
comments from our users commenting on how long it took to do the initial
backup.  These comments have mainly been from Macintosh users who have large
disks with lots of applications on them.  Part of this is probably caused
by the slowness of the Mac client (or MacTCP or something), and some may be
caused by them comparing ADSM speeds to Retrospect Remote speeds, which is
faster than ADSM.

>All of this said, duplicate file detection is still on the list of
>requirements for ADSM since there are situations where it is very useful
>(those offering services over slow networks being the main beneficiary).
>Duplicate file detection is also very valuable if you have to do a full
>backup of your system on a weekly or regular basis.  ADSM has eliminated
>the need to do this with its progressive backup model.
>
>So, IBM is listening.  Those are our thoughts on this so far.  On the
>surface it looks like a valuable, cost savings function.  But once you dig
>into it the cost savings become very minimal if any at all.
>
>Let me know if you agree or not.  We don't have all the answers to this,
>so your input is valuable in setting our future directions.
>
>Paul Bradshaw

Well, that's my feedback.  I still think there are achievable cost savings.
Remember, even though tape is inexpensive, disk keeps getting more and more
inexpensive all the time.  If you can't stay ahead of declining disk costs,
then some folks may just decide to buy duplicate disks and use them for
backup.

..Paul

Paul Zarnowski                     Phone:   607/255-4757
Cornell Information Technologies   Fax:     607/255-6523
Cornell University                 US Mail: 315 CCC, Ithaca, NY 14853-2601