Duplicate Files

The subject of detecting duplicate files, etc., has been discussed in ADSM
many times.  Basically the situation comes down to the following:

To perform adequate duplicate file detection, you can do this with varying
degrees of sophistication, safety, and cost.

For reasonable safety, every file backed up needs to have a CRC associated
with it.  We are looking into that for various reasons, this being one of
them.  This way a file with the same name, size, maybe date/time, and CRC
can be assumed to be a duplicate (if you choose to use this feature in the
system).  Note this also includes any extended attributes as well with the
file.

Then you could create a new table in the system that tracks duplicate files.
You DO NOT want to look through the entire DB everytime you back up a file
to see if it is a duplicate... some sites have well over 50 million files
already!  If a file matches on name, date, size, then you could read it
to check the CRC.  If it is a duplicate, add another pointer to the
duplicate file saying this user has one as well, but it is stored here on
his system, etc..  Note this takes basically the same DB space in ADSM
to save this info as it does to point to a non duplicate file.  Thus you
achieve very little DB space savings.

Also note that reading the file to check the CRC is a large percentage of
the time to actually backup the file over the network to the server.  Thus
this savings is small as well unless you are running over a slow network
such a phone lines, etc..

ADSM's progressive incremental model also insures that once a file is sent to
the ADSM server, that it is not sent again unless it changes.  One can make
an assumption that duplicate files are very static, that is why they are
duplicate!  Thus once ADSM backs up a file, it never has to back it up again
unless it changes.  Thus my xyz.exe file hasn't been re-sent to the ADSM
server for over a year now since it hasn't changed and the server has a
good copy already.  Thus I don't have the expense in ADSM of doing full
backups on a regular basis, and that further reduces the cost savings of
finding duplicate files.

The other part of the equation is the cost of media to store these duplicate
files on the system.  A $30 tape can hold 5GB of data these days, so if you
have 15GB of duplicate data that may cost you $100 in media, or $200 if you
duplicate it in ADSM.  Since the assumption that these duplicate files are
largely static, they will propagate down the media hierarchy to reside on
low cost media over a short period of time.

Thus the actual cost savings in determining if a file is duplicate, and thus
not sending it to the ADSM server is actually minimal.  This is due to some
of the features in ADSM such as its progressive backup model.
In ADSM, you pay a 1 time up front cost to save this duplicate file.  For
duplicate detection you pay in higher DB search costs, but lower storage
costs.

All of this said, duplicate file detection is still on the list of
requirements for ADSM since there are situations where it is very useful
(those offering services over slow networks being the main beneficiary).
Duplicate file detection is also very valuable if you have to do a full
backup of your system on a weekly or regular basis.  ADSM has eliminated
the need to do this with its progressive backup model.

So, IBM is listening.  Those are our thoughts on this so far.  On the
surface it looks like a valuable, cost savings function.  But once you dig
into it the cost savings become very minimal if any at all.

Let me know if you agree or not.  We don't have all the answers to this,
so your input is valuable in setting our future directions.

Paul Bradshaw