ADSM-L

Re: Your thoughts, please.

2015-10-04 18:08:07
Subject: Re: Your thoughts, please.
From: Michael Fink[SMTP:Michael.Fink AT UIBK.AC DOT AT]
To: ADSM-L AT VM.MARIST DOT EDU
This topic has been discussed previously on the list. Due to several
recent incidents with our installation, I was indeed in the course
of re-iterating on this issue, when your contribution arrived.

As we are deploying ADSM clients in our university, the loss of active
file information in case of a disk crash with a subsequent incremental
backup turns out to be a major and serious deficiency of the current
ADSM B/A client implementation. It is in fact so serious that we have
postponed relying on ADSM backup for our central host systems.
I have already received several reports by system managers who stumbled
into this very problem.

Let me recapitulate: backups serve the purpose of preventing loss of
data in case of disk failures or operational user errors (such as a
recursive delete).

Chances are high that an incremental backup will be run before such
problem is discovered. Given that a disk does fail, the probability
of the failure becoming evident during a backup run is high due
to the increased disk activity. Consequently, no - however
sophisticated - heuristics to dynamically exclude file systems on failed
disks will be able to fully eliminate the possibility that files are
inactivated after to disk failure. User mistakes are fundamentally
indiscoverable.

The consequences are twofold:

1. Users / system managers must restore data using the -LATEST option,
   which will also recover all files that were intentionally deleted
   from the node during the preceding expiry period. In many =
environments
   (e.g. program development, large multiuser systems), this is
   inacceptible.

2. If for any reason a replacement disk cannot be put into operation
   immediately, files will be subject to expiry unless effort is taken
   to restore data onto a different system. I was surprised myself that
   this is indeed an issue with some of our users.

It should be possible to implement a solution to issue #1 without
modifying the ADSM server:

In addition to the RESTORE (and QUERY BACKUP) options -LATEST, -TODATE,
etc., a pair of options, e.g. -ACTIVEDATE, -ACTIVETIME is proposed.
This option should have the effect that the most recent unexpired
copy of every file that was active at the specified point of time
is selected for restore. Note that - given any point of time
between the last successful and the first invalid backup - this set of
files is identical to the active set at the time of the last successful
incremental backup, so no extra copies are needed.

Note also that this requirement is different from the "point in time"
requirement (which was also discussed lately, and which I also consider
very important).

Further note that the -ACTIVEDATE, -ACTIVETIME options, as implemented
above, will enable an accurate recovery of data in case of a disk
failure as well as an accidental recursive deletion due to user error.
The only weakness that I am aware of is that in some circumstances,
the date and time will have to be guessed; this could be improved by
displaying the date and time when a file was marked inactive in
QUERY BACKUP output.

Needless to say, the functionality of the GUI should be enhanced
accordingly.

Issue #2 could be possibly addressed by a command that permits an ADSM
administrator and/or a user to "freeze" a filespace (or diretory tree?)
in order to stop expiration of client files.

I'd be very interested to read a statement by ADSM engineering as
to whether such an enhancement is under consideration and when it is
likely to be available. Oh well. I've exposed myself considerably in
recommending ADSM over other vendors' solutions as the backup solution
for our university, and now I've to confess I've simply overlooked this
particular weakness in my analysis. Not that I regret my decision, ADSM
is an excellent product, but I'd really like to see a solution to this
particular problem soon.

Sincerely,     Michael Fink


On Tue, 13 May 1997, Mark Dyer wrote:

> You are correct, when a drive fails, but the mountpoint stays, users =
can file to
> the mountpoint.  I have seen cases where the system even recreates =
missing
> directories under the failed mountpoint.
>
> If you unmount the disk, you may find files under the mountpoint.
>
> ADSM does not check the UNIX environment for failed drives.  ADSM =
simply knows
> that it is to look for files in certain locations.  Does your =
dsm_sched.log
> indicate that a number of files were expired on Sunday?
>
> We are struggling with this design feature of ADSM ourselves.  It is =
our hope
> that we can test for failed drives ahead of the ADSM scheduled backups =
and
> change what ADSM will look for when backing up the client.
>
> If you have a drive included in ADSM, then remove it from ADSM, ADSM =
does not
> expire the files from that drive.  If the drive fails, and ADSM is =
still
> looking for files from that drive, ADSM will expire all of the files =
making
> a full file system recover VERY, VERY difficult.
>
> Mark Dyer
>
>
> >
> > We had a problem this weekend.
> >
> > We lost a disk for one of our HP ADSM clients.  The way this client =
is
> > configured, an entire file system is on one disk.
> >
> > There are some questions as to what and when all this happened.  The
> > backups are normally done from 2 to 4 am every day.  Sunday morning,
> > around 10, users started to notice something was wrong.  The HP =
analyst
> > noticed the failed disk and configured the file system to an unused =
disk.
> >  I then issued an ADSM RESTORE, which worked perfectly except it =
restored
> > to Saturday's 2 am backup and not Sunday's 2 am backup.  Looking at =
the
> > ADSM log I noticed when the ADSM Backup was run on Sunday, it =
backed-up
> > several directories (no files) that were found on the root directory =
that
> > had the same name as the file system's directories that were lost.  =
It did
> > not try to backup the file system, nor did it issued any errors that =
the
> > file system was not there.  Thus there was no later version of the =
backup
> > to restore (lost a full day, at least).  From the log information, I
> > concluded that we lost the disk/file system sometime before the =
Sunday
> > morning backup, but after the Saturday morning backup.  The HP =
analyst
> > says that the file system must have been lost Sunday morning, around =
10
> > am, as users had been doing things with it, including saving files =
to that
> > disk/file system, earlier Sunday (around 8 am) and all day Saturday.
> >
> > I don't know HP-UX (or for that matter any UNIX) very well, but I =
can
> > conjecture that when HP-UX lost the disk/file system, it allowed a =
similar
> > directory structure to be created on it's ROOT file system =
transparently
> > and the application continued to process normally until the =
application
> > wanted to retrieve previously existing files that were in the lost
> > disk/file system.  It is at that time users noticed something was =
wrong.
> >
> > When the HP analyst re-issued the mount for the new file system, it =
made
> > those directories in the ROOT no longer accessible.
> >
> > If any one of you think that this may be a valid scenario, please =
let me
> > know (perhaps with some ideas as to test/verify this).  Or if I am
> > completely off-base, please tell me, so in the future I can keep my
> > big-mouth shut.
> >
> > Also, should ADSM have told us something was wrong?
> >
> > Personally, I think ADSM is/was working as designed, in that it =
backed up
> > what HP-UX presented to it as valid file systems, but I have to =
justify my
> > stance.  They, the people who manage the HP system, think that ADSM =
should
> > have detected that the file system was not available and thus should =
have
> > reported it as an error (we had no error messages from our normal =
backup).
> >
> > Perhaps we can re-structure our backups to be more explicit as to =
what to
> > look for and do?  Thoughts?
> >
> > Thanks!
> >
> > Mark Mapes
> > PG&E
>

   Dr. Michael Fink =
+-----------------------------+------------------------
        EDV-Zentrum | Universitaet Innsbruck      | Tel: =
+43(512)507-2311
 Computing Services | Technikerstrasse 13         | FAX: =
+43(512)507-2944
--------------------+ A - 6020 Innsbruck, Austria | =
Michael.Fink AT uibk.ac DOT at
Michael.Fink AT uibk.ac DOT at
<Prev in Thread] Current Thread [Next in Thread>