Veritas-bu

[Veritas-bu] Backups vs archives

2006-12-08 11:33:34
Subject: [Veritas-bu] Backups vs archives
From: cpreston at glasshouse.com (Curtis Preston)
Date: Fri, 8 Dec 2006 11:33:34 -0500
I was asked for a URL of this article. It's available at storage
magazine:
http://tinyurl.com/ygjlcj

(Original URL is
http://searchstorage.techtarget.com/magItem/0,291266,sid35_gci1216875,00
.html , but that is probably truncated.)


---
W. Curtis Preston, Author of O'Reilly's Backup & Recovery and Using SANs
and NAS
VP Data Protection
GlassHouse Technologies


-----Original Message-----
From: netbackup-bounces at backupcentral.com
[mailto:netbackup-bounces at backupcentral.com] On Behalf Of Curtis Preston
Sent: Thursday, December 07, 2006 11:18 PM
To: Veritas-bu at mailman.eng.auburn.edu
Subject: [Veritas-bu] Backups vs archives

Based on some previous posts, I'd like to throw out the following
thought, and see what you folks think about it.  

If you want archives, make archives.  Use NetBackup's archive feature,
or you use Enterprise Vault (or some other actual archive product).  You
don't make NetBackup backups and hold on to them for 7 years.  Using
backups as archives can actually make an actual e-discovery process VERY
painful. I submit for your consideration an article that I wrote a few
months ago:

A bottle of grape juice left on a shelf long enough will ferment - but
no one would call it wine.  Backups left on a shelf long enough will
allow one to restore old data - but no one should call them archives.
Like a good wine, an archive should be made for a specific purpose using
an application designed to create archives.  This article will start
with a look at the business requirements for archiving, followed by a
discussion on why backups make lousy archives, and will end with a
discussion of the discussion of the types of products designed to meet
archive requirements.

Archives are for the logical retrieval of information.  That is, they
allow one to retrieve information grouped in a logical way.  The first
way that archives manifest themselves is the storing of reference data,
such as:
*       The CAD drawings, parts lists, and other manufacturing
information for a widget a company used to make
*       All of the information pertaining to a former customer
*       All pertinent information regarding a closed project, account,
law case, etc.
*       Tax returns, financial records, or other records for a
particular year

Information that can be grouped in a logical way can be archived and
stored in such a way that a company can retrieve it via that logical
grouping.  Once a case is closed, a widget is no longer produced, a tax
year has past, etc, the information pertaining to that item is just
taking up space.  We might need to reference it again for some reason,
but we don't want it filling up our high end storage, either.  So we
archive it and delete it.  If we need it five years later, we search the
archives for "Widget XYZ."

The second way that archives manifest themselves is in the logical
storage of active data.  
Suppose, for example, that it was discovered that a critical safety part
was taken out of the design of a particular widget.  It would be
important to be able to see every version of the specification, along
with information about who changed it.  Also consider the now rather
common practice of electronic discovery of email systems.  Think about
the discovery requests that can come from someone in management being
accused of harassment or discrimination; a trader being accused of
promising financial returns, or a company being accused of collusion
with competitors.  Such accusations result in electronic discovery
requests that look like the following:
*       All emails from employee A to employees B, C, and D for the last
year.
*       All emails and instant messages from all traders to all
customers for the last three years that contain the words "promise,"
"guaranty,"  "vow,"  "assure," or "warranty."
*       All emails that left a company going to domains x,y,and z or to
these email addresses

In summary, archives can contain the only copy of inactive (or
reference) data, or a reference copy of active data.

Backups make lousy archives

The most common way that people archive data is to simply keep their
backups for a long time.  They perform a weekly or monthly full backup,
and then keep that backup for anywhere from one year to fifty years,
depending on their business requirements.  There couldn't be a worse way
to archive.

There are many difficulties with using backups as archives, depending on
which type of archive we're talking about.  The most common use of
backups as archives is for the retrieval of reference data.  Companies
take one full backup per month and hold on to it for many years -
indefinitely in some cases.  The idea is that if someone asks for the
parts for widget ABC (or some other piece of reference data), we'll just
restore the appropriate files from where the system where they used to
reside.   The first challenge with that plan is simply remembering where
the files where several years ago.  Can you remember the name of the
fileserver or database server that you used three years ago - let alone
seven years ago?  The next challenge is the number of operating systems
or application versions that come and gone during that time.  To restore
files that were backed up from apollo five years ago, the first
requirement is a system named apollo.  Someone's also going to have to
handle any authentication issues between the backup server and the new
apollo, since it isn't the same apollo that it backed up from five years
ago.  Depending on the backup software and operating system in question,
the new apollo may also need to be running the same version of the
operating system and applications the old apollo was running five years
ago.  Otherwise, there may be incompatibilities in the filesystem or
database that's being restored to.

Backups are also used to satisfy electronic discovery requests, and
doing this can be even more challenging.  Let's use the most common
electronic discovery request as an example: request for emails that
match a particular pattern and were sent via an Exchange server.  (The
concepts below also apply to other email systems, such as Lotus Notes or
SMTP, but we'll use Exchange as an example.)  There are two very large
challenges with using backups to satisfy such a request.  The first
challenge is that it is actually impossible to retrieve all emails sent
or received by a particular person.  It's only possible to restore those
emails that were present in the Exchange server when backups were made.
If someone sent an email that the discovery request is looking for,
deleted it, then cleared their Deleted Items folder, it wouldn't be on
that night's backup, and thus would never show up when attempting to
retrieve it weeks, months, or years later.  Therefore, it's technically
impossible to meet the discovery request using backups.  This means that
even after doing your best to successfully satisfy the discovery
request, a plaintiff may claim that you have not proven your case.
(Remember that in America, the burden of proof is different in civil
suits.  They do not have to prove their case beyond a reasonable doubt.
They must only provide a preponderance of evidence.)

The second challenge with using backups to satisfy an exchange
electronic discovery request is that it's quite difficult to retrieve
months or years of e-mails using backups.  Suppose, for example, a
company performs a full backup of their exchange server once a week, and
for compliance reasons they hold onto these backups for seven years.  If
they received an electronic discovery request for e-mails from the last
seven years, they would need to perform many restores of their entire
exchange server to satisfy the request.  First they would restore their
exchange server to an alternate server using last week's backup.  (Let's
not forget that an alternate server Exchange restore is not that easy to
do.)  Then they would run a query against exchange to look for the
e-mails in question, saving them to a PST file.  Then they would restore
their exchange server using the backup from two weeks ago, rerun the
query, and create another PST file.  They'll end up restoring their
entire exchange server 364 times before they're done (seven years times
52 weeks).  Of course, almost every step in this process will have to be
done manually.

The real challenge here is that the scenario described above is not
impossible.  It will cost that company an incredible amount of time and
money, but a plaintiff in a civil suit or the government doesn't care
how much it costs the defendant.  The only thing you need to know is
that you have a court order to produce this information - regardless of
how much it costs.

Backups are also an extremely inefficient way to store archives.  Where
an archive system will make sure that it has one or two copies of a
particular version of a file, a backup system usually has no such logic.
If a company is using weekly full backups as archives (or creating
"archives" with their backup product but not deleting the original
files), and they're storing their archives for seven years, they'll have
364 copies of many of their files on tape - even if those files have
never changed.  This leads to an incredible amount of media waste.

The other thing that we don't like to talk about when discussing backups
as archives is the number of times a given company changes backup
formats and tape formats over the years.  Almost every company using
backups as archives has a number of older tape and backup formats that
they must continue to support for archive purposes.  While older tape
formats can be converted with a lot of copying, converting older backup
formats is a whole different challenge.  Most people choose to hold onto
both old tape formats and old backup formats and hope they never
actually have to read them.

True Archiving

The most important feature of an archiving system is that the archive
should contain enough metadata to be able to retrieve the information in
logical ways.  For example, metadata can include the author, or business
unit that created an item. (An item can be any piece of archived
information, such as a file, a record from a database, or an email.)
Metadata might also contain the project that the item is attached to, or
some other logical grouping.  An email archive system would also include
who sent and received an email, the subject of the email, and all other
appropriate metadata.  Finally, an archive system may also import the
full text of the item into its database, allowing for full text searches
against the archive.  This can be a very useful feature, especially if
multiple formats can be supported.  It's very nice to be able to do a
full text search against all emails, Word documents, PDF files, etc.

Another important feature of archive systems is their ability to store a
pre-determined number of copies of a given archived item.  The number of
copies a company chooses to keep is up to them and is based on what they
want to protect from.  For example, if they're storing their archives on
a RAID-protected system, they may choose to have one copy on disk and
another on a removable medium such as optical or tape.

Archive systems manifest themselves in two ways.  The first type of
archiving system is the traditional, low-retrieval archive system
attached to your backup software package.  You can make an archive of a
selected group of files and attach limited metadata to it, such as
"Widget XYZ," and then have the archive system delete the files in
question.  The good thing is that it allows the attachment of metadata,
and can reduce multiple copies in the archive by deleting files as
they're archived.  The bad news is that if you want to be able to search
archives via different types of metadata, such as owner, time frame,
etc, you would need to create multiple archives.  The main use for this
type of archive is to save space by deleting files attached to projects
or entities that are no longer active.

Newer archive systems realize that any given archived item might need to
be retrieved for different reasons and would thus require different
metadata.  To support multiple different types of retrievals, it's
important to store the actual archived item only once, but to store all
of its metadata in a searchable database.  Such a system also realizes
that a given archived item might be put into the archive not to save
space, but to allow it to be searched for logically.  Therefore, unlike
their predecessors that stored the only copies of reference data, these
newer types of archives tend to store an extra copy of the data, leaving
the original in place.

One of the problems discussed previously with using backups as archives
is that they won't have all occurrences of a given file or message; they
will have only those items that were available when the backup was made.
One of these newer archive systems solves this problem by archiving data
automatically.  For example, every email that comes in or is sent out is
sent to the archiving system.  Every time a file is saved, a version of
the file is sent to the archive system.

Another advantage of modern archive systems is their use of the single
instance store and delta incremental concepts.  They store only one copy
of a given file or email, no matter where it came from or who it went
to.  (They, of course, record who it came from or who it was sent to.)
If that file or email is then changed and sent/stored again, they can
store only the changed bytes in the new version.  This allows for
incredibly efficiency when storing many files or emails.

As to the format issues of backups as archives, many archive systems
still have those issues.  Many people still store their archives on
tape, and as time passes people will change their archive software.
Therefore, this problem could continue to exist even in archives.  See
the sidebar about the use of CAS disk as an archive target.

Another secondary features of modern archiving systems is that they can
also serve as an HSM-like system, automatically deleting large, older
files and emails, and invisibly replacing them with stubs that
automatically retrieve the appropriate content when accessed.  This is
one of the big business justifications used to sell email archive
software.  In addition to being able to satisfy electronic discovery
requests, you can save a lot of space by archiving redundant and
unneeded emails and attachments.  Surveys shows that over 90% of typical
email storage is consumed with attachments.  If you can store only one
copy of such an attachment across multiple email servers (and Exchange
Storage Groups), and replace it with a stub, then you can save a whole
lot of storage.  If you add delta-block incrementals to that, you can
save even more storage.  While the HSM-like features of most newer
archiving programs may seem more compelling and provide more direct
savings, they should be seen as a secondary reason for archiving.  The
primary reason for archiving should be that you've got a valid business
reason for doing so - and that an actual archiving system might actually
meet that business requirement.

If your company has more than one employee, they probably have a
business case for archiving.  And if you're using backups as archives,
you could be in for a rude awakening when you get an electronic
discovery request.  Perhaps you should look at an email archiving
product or an enterprise content management (ECM) product today.

Sidebar: Disk or tape for archiving?

This article mentioned that the archive industry may suffer the same
issues as the backup industry if customer use tape as their primary
storage, and occasionally switch archiving vendors.  Can we do better?

One idea might be to use a content addressable storage (CAS) device as
the primary storage device for your archives.  If the product supports a
standard filesystem interface, such as NFS or CIFS, and it supports
single instance storage and delta block technologies, it could solve a
number of problems.

First, a disk product using single instance storage and delta block
incremental technologies could actually be cheaper to operate than a
tape-based system.  This will also always be the case, since you really
can't apply delta block technologies to tape based systems.  Therefore,
the first problem we solve is disk systems being more expensive than
tape systems.

Second, if the CAS device supports a filesystem interface, then
migrating between storage systems should be relatively simple.  With a
tape based system, we have to copy all data from the old tape format to
the new tape format.  With a filesystem based system, you could simply
copy data from the older device to the newer device.

Finally, you could potentially solve the format issue as well.  If
archive products can support discovery of existing CAS systems, you
could theoretically switch archive products with no ill effects.  The
raw data would still be accessible via the filesystem interface, and the
metadata could be imported - or the new archive system could grab the
metadata from the CAS device.

Your mileage will definitely vary here, but solutions to this problem do
exist.

Sidebar: Turning backups into archives?

Another common question is what to do when switching away from backups
as archives.  What to do with all the old tapes in the old backup
format(s)?  The answer is the same as it is for changing backup formats.
The only thing you can do is restore the oldest versions of the data
being archived, archive it, delete it, then restore the next version.
It's not pretty, but it's reality.  The good news is that every backup
that you turn into an archive means storage savings.

---
W. Curtis Preston, Author of Backup & Recovery and Using SANs and NAS
VP Data Protection
GlassHouse Technologies



_______________________________________________
Veritas-bu maillist  -  Veritas-bu at mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu



<Prev in Thread] Current Thread [Next in Thread>