ADSM-L

Re: FW: Requirement for ADSM to store only one copy of a file

1996-01-04 12:22:02
Subject: Re: FW: Requirement for ADSM to store only one copy of a file
From: Chris Krusch <Chris.Krusch AT UBC DOT CA>
Date: Thu, 4 Jan 1996 09:22:02 -0800
When evaluating products, we noted that HARBOR from new era systems manages
to do this. (It was a close race, we particularly liked this feature of
Harbor, but in the end there were other reasons that ADSM was chosen).

The harbor maintainers can build afrc (automatic file redundancy checking)
libraries in which they store common software - such as windows system
binaries, excel, word, etc. These libraries always reside on disk.

When backing up, if the filename, dates, sizes, checksums, etc match that
of a file in the afrc library, the file is not physically backed up, the
database entry on the backup server is pointed to the copy in the afrc
library.

The advantages of this are:

1) greatly reduced first backup times - the majority of stuff backed up are
common programs such a windows, excel, word, .......

2) greatly reduced storage requirements - It would probably be conservative
to estimate that at least half of the data on a typical end user system
could easily be contained in an afrc library (75% of my disk is operating
systems and common programs). If you have 50 windows users, and can reduce
each of their backup set requirements by half, it's a big win. Half the
data to move around, migrate, collocate, etc. Twice as many users you can
support on the same library.

Some final thoughts:

There are very good checksumming algorithms now - using a combination of 2
different checksums would leave very little chance that two different files
would match the same. Add to this a little extra checking (such as certain
directories that files must be contained in to be elligible for file
redundancy - e.g. all windows binaries must be contained in a directory
called windows) and I'd feel very safe using it.

Taking check sums would be more intensive, but it would only have to be
done the first time a file is backed up, or when you've decided a file
needs to be sent and it's in directories elligible for afrc checking.

There's no effect on collocation (except having less files to collocate) if
the files in the common libraries always reside on quickly accessible media
such as disk or cdrom - non afrc matched files come off the collocated
tape(s), and the remainder quickly off the disk(s).

It's not an automated method (someone has to build and maintain the
libraries) but the payback in reduced traffic and storage requirements make
it very attractive.

">He's absolutely right - originally I said "maybe not WINWORD.EXE". But
actually, WINWORD.EXE may in fact store my registration and license
information."

My experience is that most programs store this sort of information in a
separate configuration file and do not modify themselves - as long as that
holds true for most AFRC has many benefits.

I'd personally love to see this feature added to ADSM.

At  7:04 AM 1/4/96 -0500, Andrew Mark Raibeck wrote:
>Simon Travaglia makes an excellent point:
>
>========== Forwarded letter follows ==========
>>Anyone else out there have a need for this feature?
>>Anyone from IBM care to comment on directions?
>>Or am I missing the boat completely?
>>=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D Forwarded letter ends =
>>=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>
>>My own thoughts on this: I agree that this would be a desirable feature. =
>>But I don't know how other products implement it. ADSM currently doesn't =
>>inspect the contents of a file. I suppose it's possible that two =
>>different files could have the same name, size, and modification =
>>date/time. Maybe not WINWORD.EXE, but perhaps something like MYDOC.TXT. =
>>This might not be as far-fetched as it sounds. I have no idea how other =
>>vendors provide a solution to this.
>
>The problem with duplicate files is that a lot of the time they'll look
>like duplicates but will not in fact be that way. Take for instance an
>application that is 'customised' by the installation process to work only
>on a single machine.  It has an IP number, Ethernet address or Name to
>distinguish itself.   Stashing only one copy of photoshop say, because
>they all have exactly the same file size, would not work.  It's possible
>that even checksumming might not work because it's conceivable that the
>checksum is 'padded' after the customisation process to match the original.
>
>The interim way around not backing up multiple copies of data is to not
>back it up at all.  If it's in multiple places and easily installable,
>why not exclude the entire directory, or certain file specs.  Another
>option is to have application areas and data areas on the client.  Never
>backup the application areas and always backup the data areas.
>========== Forwarded letter ends ==========
>
>He's absolutely right - originally I said "maybe not WINWORD.EXE". But
>actually, WINWORD.EXE may in fact store my registration and license
>information.
>
>Another potential problem if ADSM were to detect duplicate files and not back
>them up: for collocated tape pools, the effects of collocation would be
>"watered down".
>
>Andy Raibeck

Chris Krusch                             Email: Chris.Krusch AT ubc DOT ca
University Computing Services            Phone: (604)822-4215