Re: Your thoughts, please.

You are correct, when a drive fails, but the mountpoint stays, users can file to
the mountpoint.  I have seen cases where the system even recreates missing
directories under the failed mountpoint.

If you unmount the disk, you may find files under the mountpoint.

ADSM does not check the UNIX environment for failed drives.  ADSM simply knows
that it is to look for files in certain locations.  Does your dsm_sched.log
indicate that a number of files were expired on Sunday?

We are struggling with this design feature of ADSM ourselves.  It is our hope
that we can test for failed drives ahead of the ADSM scheduled backups and
change what ADSM will look for when backing up the client.

If you have a drive included in ADSM, then remove it from ADSM, ADSM does not
expire the files from that drive.  If the drive fails, and ADSM is still
looking for files from that drive, ADSM will expire all of the files making
a full file system recover VERY, VERY difficult.

Mark Dyer


>
> We had a problem this weekend.
>
> We lost a disk for one of our HP ADSM clients.  The way this client is
> configured, an entire file system is on one disk.
>
> There are some questions as to what and when all this happened.  The
> backups are normally done from 2 to 4 am every day.  Sunday morning,
> around 10, users started to notice something was wrong.  The HP analyst
> noticed the failed disk and configured the file system to an unused disk.
>  I then issued an ADSM RESTORE, which worked perfectly except it restored
> to Saturday's 2 am backup and not Sunday's 2 am backup.  Looking at the
> ADSM log I noticed when the ADSM Backup was run on Sunday, it backed-up
> several directories (no files) that were found on the root directory that
> had the same name as the file system's directories that were lost.  It did
> not try to backup the file system, nor did it issued any errors that the
> file system was not there.  Thus there was no later version of the backup
> to restore (lost a full day, at least).  From the log information, I
> concluded that we lost the disk/file system sometime before the Sunday
> morning backup, but after the Saturday morning backup.  The HP analyst
> says that the file system must have been lost Sunday morning, around 10
> am, as users had been doing things with it, including saving files to that
> disk/file system, earlier Sunday (around 8 am) and all day Saturday.
>
> I don't know HP-UX (or for that matter any UNIX) very well, but I can
> conjecture that when HP-UX lost the disk/file system, it allowed a similar
> directory structure to be created on it's ROOT file system transparently
> and the application continued to process normally until the application
> wanted to retrieve previously existing files that were in the lost
> disk/file system.  It is at that time users noticed something was wrong.
>
> When the HP analyst re-issued the mount for the new file system, it made
> those directories in the ROOT no longer accessible.
>
> If any one of you think that this may be a valid scenario, please let me
> know (perhaps with some ideas as to test/verify this).  Or if I am
> completely off-base, please tell me, so in the future I can keep my
> big-mouth shut.
>
> Also, should ADSM have told us something was wrong?
>
> Personally, I think ADSM is/was working as designed, in that it backed up
> what HP-UX presented to it as valid file systems, but I have to justify my
> stance.  They, the people who manage the HP system, think that ADSM should
> have detected that the file system was not available and thus should have
> reported it as an error (we had no error messages from our normal backup).
>
> Perhaps we can re-structure our backups to be more explicit as to what to
> look for and do?  Thoughts?
>
> Thanks!
>
> Mark Mapes
> PG&E