Disaster recovery guide and plannings for a specialized system.

Hello,

we like to have a more reliable, fast and easy way to do a disaster recovery
with amanda. 

For this we have done two things, which you find both in the appended
document:

- write a (less tested) disaster recovery guide  
- start plannings for a specialized amanda disaster recovery system

We like to create such a specialized system or to partizipate in a
similar project. It will be nice, if you read through the document and
give us your annotations and ideas. We like to know:

- what we can make better 
- if there is someone working one something similar 
- if there is anybody how like to participate in our project.

It is planed to publish everything under the GPL, but we can discuss
about a similar license.


Every feedback is welcome,

Bernd Harmsen
ds-DATASYSTEME



PS: If you want I can send you a PDF, PS, LYX by private mail, which
    is much nicer to read.



===============================================
Planning for an Amanda Disaster Recovery System
===============================================

Bernd Harmsen
bjh AT datasysteme DOT de
www.datasysteme.de

--------
Contents
--------

1 Introduction
    1.1 Why we need a specialized Amanda Disaster Recovery System?
2 Goals
3 Disaster recovery with native tools and possible optimizations. 
    3.1 Provide working Hardware and Emergency System
    3.2 Restore a Linux-Backup-Client
    3.3 Restore Linux-Backup-Server
    3.4 Make the System bootable
4 Starting points for optimization
    4.1 Essential Backup Tool 
        4.1.1 Easy Amanda Database export / import 
    4.2 Specialized Amanda Recovery System on CD
        4.2.1 Remote Access
        4.2.2 Full automatic partitioning, formating and mounting
        4.2.3 Amrestore Scripts 


--------------
1 Introduction
--------------

This document was written to provide information about how
to do a disaster recovery with Amanda and to plan a specialized
disaster recovery system for Amanda.

We (ds-DATASYSTEME) are a small company, specialized on Linux
networks that provide Amanda backup system to our customers.
We think that Amanda is a great backup tool, very fast,
reliable and with low hardware recommendations.

But we also think, that Amanda is lacking some features for
recovery. Recovery is more complicated than backup. This
is normal, because during a recovery you have to deal with
an undefined, unknown situation. (E.g. a customer who want
to get some files back normally only knows parts of the
filename.)

But, this is OK. The real problem for us is the case of a
disaster recovery. In case the harddisk of an importand
server is broken (or the server is completely lost) there
are high costs, less time and impatient customers. For this
we need a more secure, reliable and fast way to get the
system working again.

We like to create a specialized Amanda Disaster Recovery
System, maybe together with other members of the Amanda
community, or to participate in an existing system. We like
to publish this system under the GPL or a similar license.


1.1 Why we need a specialized Amanda Disaster Recovery System?
--------------------------------------------------------------

Because the disaster recovery process as described in Chapter
[Disater Recovery naitive] is to complicated (less reliable
because of human errors) and to slow.

A disaster recovery consist of many different steps that
all need time and care. On the other hand there are customers
who want their server back. The following timesheet shows
what we think about the maximum time we have for a disaster
recovery

0.0h A server fails 

0.5h The customer call the support. A member of the support
team do a diagnostic talk with the customer and pack some
hardware for replacement.

1.5h Now the support is on the way to the customer

2.0h The support member arrived at the customer, analyzes
the problem and repairs the system.

3.0h The hardware is working again. Now the support member
starts to recover the data from the Amanda backups. For
this we plan:

  1.5h Active work with the recovery tools.

  2.0h Data transport over the network.

6.5h The system is mostly working again.

8.0h The system is well tested. All the upcoming small problem
are solved.

As you can see, it takes a whole working day to get the system
up and running again. This is very long and we should try
to save some time at some points. But this timesheet is
also optimistic. We think that it is hard to meet its deadlines
without a specialized disaster recovery system. It assumes
that the support worker makes no bigger errors. With a less
trained worker it can even take 16 hours.


-------
2 Goals
-------

What are the goals of an specialized Amanda Disaster Recovery
System.

1. Make the Disaster Recovery more easy and reliable (less
  affected from human errors).

2. Make the Disaster Recovery more fast.


-----------------------------------------------------------------
3 Disaster recovery with native tools and possible optimizations. <Disater 
Recovery naitive>
-----------------------------------------------------------------

This section describes how a disaster recovery can be done
without a specialized system. It uses only the installation
media of an Debian GNU/Linux 3.0
system and the Amanda backup. The concept is to install
a separate minimal Debian system on a own partition and
use this to restore the original partitions.

This section has two intentions:

1. Provide a step by step guide for a disaster recovery.

  You can use it as guide. But the procedure is not well
  tested, because I write it after my last disaster recovery.
  Feel free to send me corrections and suggestions.

2. Show how complicated and time-consuming a disaster recovery
  can be and find some good points to start optimization.

  This is the main goal. The described way is too time-consuming
  and too complicated for a stressfull situation with an
  impatient customer behind you. So we like to build or
  participate in an more optimized and automated disaster
  recovery system.


3.1 Provide working Hardware and Emergency System
-------------------------------------------------

1. Provide working hardware.

2. Plan partition table.

  Additional to the partitions for the system you want to
  recover (destination-system), you must provide a partition
  for the emergency system. Put this partition at the beginning
  of the table and give it e.g. 300MB. 
  You need a Backup of all your partition tables for that.

  Possible optimization: Full automatic partitioning,
    initialization and mounting (see [Full-automatic-partitioning]).

3. Install a Debian-Base-System

  Use your normal Debian installation method/media to install
  a base system on the additional partition.We will use
  this as emergency system. Create the partitions as planed
  above but only initialize and mount the partition for
  the emergency system.

  Install the following additional packages:

  Amanda: amanda-client, amanda-server, tar, dump

  Remote-Access: ssh, isdnutils-base, ipppd

  Possible optimization: Use an specialized Amanda
    Recovery System on a bootable CD (see [Amanda Recovery System on CD]).

4. Boot the emergency system.

5. Configure the IP-Network manually using ifconfig and route.

6. If you need remote access, e.g. for assistance from your
  office, configure ipppd manually.

  Possible_optimization: Provide good defaults for the
    isdn config files (see [Remote Access]).

7. Initialize and mount the destination partitions.

  Possible optimization: Full automatic partitioning,
    initialization and mounting (see [Full-automatic-partitioning]).

  (a) Initialize the Swap-Partition
    mkswap <DEVICE>

  (b) Initialize destination filesystem partitions

    Initialize ext2-filesystems with the following command:
    mke2fs /dev/<DEVICE>

  (c) Mount destination partition.
    Compose the destination partitions under the mountpoint
    /mnt. Use the following steps for that:

    i. Mount destination-"/"-partition
      under /mnt.
      mount /dev/<DEVICE> /mnt

    ii. Create mountpoints for other partitions in the 
destination-"/"-filesystem.
      e.g.: /var, /home, /groups, /usr
      mkdir /mnt/<MOUNTPOINT>

    iii. Mount all other destination partitions.
      mount /dev/<DEVICE> /mnt/<MOUNTPOINT>

8. Set correct date and time.
  date <MMDDhhmmYYYY>


3.2 Restore a Linux-Backup-Client
---------------------------------

Use this step if you have a working Amanda-Backup-Server
and want to restore a Linux-Backup-Client.

Now we restore the data from our Backup-Server to the inactive
destination system. For each partition we first restore
the last level "0" backup
and then the last backup of each higher level.

1. Get root permissions.
  su

2. Go to the highest directory of the selected destination
  partition.
  cd /mnt/<MOUNTPOINT>

3. Run Amrecover <Disaster-Linux-Client-Amrecover-starten>
  amrecover <CONFIG> -s <BACKUP-SERVER> -t <BACKUP-SERVER>

4. Set source partition.
  sethost <NAME>
  setdisk <MOUNTPOINT>

5. Select all files and directories:
  add *

6. Verify the list of files marked for extraction. Note which
  tapes are needed.
  list

7. Note the number of the archive you need on each tape.
  history

  You will see lines like:
  201- 2002-03-06 0 ds-daily4 8
  The last column shows the number of the archive and the
  second last the name of the tape. You need all listed
  tapes since the last level "0"
  backup.

8. Start the restore.
  extract

9. Verify if the shown destination directory is correct.

10. Load tape and wind to the beginning of the archive. 
<disaster-Amrecover-Linux-Band-laden>

  (a) Login on the Amanda backup server.

  (b) Load the tape wanted by amrecover. Wait until the streamer
    is quiet again.

  (c) Wind to the X. Filemark. Attention: X = archive-number - 1
    
    mt --file=/den/<DEVICE> rewind
    mt --file=/dev/<DEVICE> fsf <X>

  (d) Wait until you get the next prompt.

11. Confirm to Amrecover on the backup client that the correct
  tape is loaded.
  Load tape <NAME> now 
  Continue? [Y/n]: Y

12. Wait until restoration finishes.

13. Confirm restoration of origin permissions to the top level
  directory.
  set owner/mode for '.'? [yn] y

14. If Amrecover want another tape, proceed with step 
[disaster-Amrecover-Linux-Band-laden].

15. Leave Amrecover.
  quit

16. Proceed with step [Disaster-Linux-Client-Amrecover-starten]
  to restore the next partition.


3.3 Restore Linux-Backup-Server
-------------------------------

Use this step if your Amanda-Backup-Server itself is defect.

Because the Backup-Server has failed, there is no Amanda
database and you cannot use "amrecover".
So we restore each partition with the less comfortable tool
"amrestore". You must manually find out, which
tapes and which archive-numbers you need for recovery.

1. Find out the tapes and archive-numbers.

  For each destination partition you need the last level
  "0" backup and the last backup of each higher backup
  level. You can find this information manually in the e-mails
  you have gotten from "amverify"
  in the past.

  Here is an example:

  Following you find an extract from different "amverify"
  e-mails. Each e-mail shows the content of one tape. The
  last number shows the backup level and the number of the
  "Checked ..." line (count from top) gives
  the number of the archive on the tape.

  In the example we want to restore the "/home"-Partition
  of out Backup-Server "amun".
  We start with the last level "0"
  backup in archive-number 11 on tape "ds-daily4".
  After that we have to restore the last level "1"
  backup in archive-number 10 on tape "ds-daily7".
  There is no level "2"
  backup, so we need only two tapes.

  Date: Wed, 5 Mar 2003 12:51:21 +0100 
  Subject: ds-daily AMANDA VERIFY REPORT FOR ds-daily4
  [...]
  Using device /dev/nst0
  Volume ds-daily4, Date 20030305
  Checked upuaut.datasys._boot.20030305.0
  Checked inpu.datasys._boot.20030305.0
  Checked amun.datasys.__ra.datasys_E$.20030305.1
  Checked amun.datasys.__aset.datasys_E$.20030305.1
  Checked amun.datasys.__ra.datasys_D$.20030305.1
  Checked inpu.datasys._var_lib.20030305.0
  Checked amun.datasys._usr.20030305.0
  Checked amun.datasys.__djhuti.datasys_E$.20030305.0
  Checked amun.datasys.__djhuti.datasys_F$.20030305.1
  Checked inpu.datasys._var.20030305.3
  Checked amun.datasys._home.20030305.0
  [...]

  Date: Thu, 6 Mar 2003 12:59:49 +0100
  Subject: ds-daily AMANDA VERIFY REPORT FOR ds-daily6
  [...]
  Using device /dev/nst0 
  Volume ds-daily6, Date 20030306
  Checked amun.datasys._usr.20030306.1 
  Checked inpu.datasys._boot.20030306.1 
  Checked upuaut.datasys._.20030306.1 
  Checked upuaut.datasys._boot.20030306.1 
  Checked amun.datasys._.20030306.1 
  Checked inpu.datasys._var_lib.20030306.1 
  Checked upuaut.datasys._var.20030306.1 
  Checked amun.datasys.__aset.datasys_E$.20030306.1 
  Checked inpu.datasys._.20030306.1 
  Checked amun.datasys.__djhuti.datasys_F$.20030306.1 
  Checked amun.datasys.__ra.datasys_E$.20030306.1 
  Checked amun.datasys._var.20030306.1 
  Checked amun.datasys.__ra.datasys_C$.20030306.1 
  Checked amun.datasys.__aset.datasys_C$.20030306.1 
  Checked amun.datasys.__djhuti.datasys_C$.20030306.1 
  Checked amun.datasys.__aset.datasys_D$.20030306.1 
  Checked amun.datasys.__djhuti.datasys_E$.20030306.1 
  Checked inpu.datasys._var.20030306.0 
  Checked amun.datasys.__ra.datasys_D$.20030306.0 
  Checked amun.datasys.__djhuti.datasys_D$.20030306.0 
  Checked amun.datasys._home.20030306.1
  [...]

  Date: Fri, 7 Mar 2003 13:41:35 +0100
  Subject: ds-daily AMANDA VERIFY REPORT FOR ds-daily7
  [...]
  Using device /dev/nst0 
  Volume ds-daily7, Date 20030307
  Checked inpu.datasys._boot.20030307.1
  Checked amun.datasys._usr.20030307.1
  Checked upuaut.datasys._.20030307.1
  Checked upuaut.datasys._boot.20030307.1
  Checked amun.datasys._.20030307.1
  Checked inpu.datasys._var_lib.20030307.1
  Checked upuaut.datasys._var.20030307.2
  Checked amun.datasys.__ra.datasys_D$.20030307.1
  Checked inpu.datasys._.20030307.1
  Checked amun.datasys._home.20030307.1
  [...]

  Possible optimization: Provide an easy export/import
    mechanism for the Amanda database to use "amrecover"
    here (see [Easy Amanda Database export / import]).

2. TAR or DUMP?

  For each partition you must find out, if the backup was
  made using "tar" or "dump".
  You find this information in your amanda disklist file
  (e.g.: /etc/amanda/<CONFIG>/disklist), if you have a separate
  backup of it.

  Possible optimization: Provide an "Essential
    Backup" tool, that stores such information
    in a separate backup (see [Essential Backup]). 

3. If you do not have root permission in the emergency system,
  get it now.
  su

4. Restore destination partitions

  (a) Change to the top level directory of the destination
    partition. <Disaster-Linux-Backup-Server-CD>
    cd /mnt/<MOUNTPOINT>

  (b) Insert correct tape<Disaster-Linux-Backup-Server-Bandwechsel>

  (c) Wind to the X. Filemark. Attention: X = archive-number - 1
    
    mt --file=/den/<DEVICE> rewind
    mt --file=/dev/<DEVICE> fsf <X>

  (d) Run "amrecover"

    For DUMP-Backups
    amrestore -p /dev/<DEVICE> "<HOSTNAME>"
    "<MPOINT>$" |  restore -rv -b2 -f-

    For TAR-Backups
    amrestore -p /dev/<DEVICE> "<HOSTNAME>"
    "<MPOINT>$" |  tar -xvpmi -f-
     --ignore-failed-read --same-owner

    Possible optimization: Provide simple scripts
      that run this nasty commands (see [Amrestore Scripts]). 

  (e) If there are more backup levels for this partition, proceed
    with step [Disaster-Linux-Backup-Server-Bandwechsel].

  (f) If there are more partitions proceed with step 
[Disaster-Linux-Backup-Server-CD].


3.4 Make the System bootable
----------------------------

1. Change "/" to destination system.
  With this command the destination system becomes the active
  system. You can mostly use it as if you have booted it.
  
  chroot /mnt

2. Make sure that /proc is an empty directory
  /proc is an virtual file system provided by the kernel.
  During the restore process it was maybe restored with
  it contents, but it should only be a mountpoint.
  rm -f /proc/*

3. Check /etc/fstab
  Is the fstab conform with the new partition table?

4. Check /etc/lilo.conf
  Are the params "root"
  and "boot" conform with the new partition
  table?

  root = Device that contains the "/"-partition
  (e.g. /dev/sda2).

  boot = Device that should contain the bootsector
  (e.g. /dev/sda).

5. Write a new bootsector
  liloconfig

6. Exit "chroot"
  exit

7. Boot restored destination system.
  shutdown -r now

8. Thats all.


----------------------------------
4 Starting points for optimization
----------------------------------

This part shows the possible targets for optimization, extracted
from chapter [Disater Recovery naitive]. At
the moment this is more a brainstorming than a detailed
concept. We like to read your ideas about that.


4.1 Essential Backup Tool <Essential Backup>
-------------------------

This little script should collect all the essential informations
that is need in case of an disaster recovery and store it
in one or more a save places appart from the normal backups.
It can be installed on all Linux hosts and started by (ana)cron
e.g. once a week.

The informations we consider essential are:

* Configuration (/etc/*, incl. full amanda config)

* Partition table

* Installed packages (dpkg --get-selections)

* Amanda database (only on the Backup-Server, amadmin <CONFIG>
  export)

There are plans to provide ways to save this informations:

* on a local floppy disk.

* by GPG encrypted e-mail.

* by sftp or ftp.


4.1.1 Easy Amanda Database export / import <Easy Amanda Database export / 
import>

Provide a way to use "amrecover"
even if the Backup-Server has failed. For this we need an
easy import of the Amanda database from the last essential
backup.

If there are problems with that, we can provide a script
that extracts the informations about tapes and archive-numbers
from a amanda database and optionally calls amrestore (see
[Amrestore Scripts]).


4.2 Specialized Amanda Recovery System on CD <Amanda Recovery System on CD>
--------------------------------------------

Provide an bootable emergency system on cd, that contains:

* a base system

* all necessary tools

* some scripts to make disaster recovery more easy.

* an "import" function for the essential
  backups.

* maybe it is nice to have a kind of GUI where you only select
  the name of the host you want to restore and everything
  else runs automatic. But we think this is much work and
  should be delayed for a later step.


4.2.1 Remote Access<Remote Access>

Provide good callin defaults for the isdn config files device.ippp0
and ipppd.ippp0. The support worker should only load the
correct kernel module and change the MSN. With this feature
a less trained worker can start the disaster recovery system
and someone in the main office can proceed or assist.


4.2.2 Full automatic partitioning, formating and 
mounting<Full-automatic-partitioning>

For this we can write a script that reads all necessary information
from the essential backup of the selected host (see [Essential Backup])
and automatic:

* partitioning the harddisk(s).

* initialize the partitions with the correct filesystem or
  swap.

* mount the partitions for disaster recovery.


4.2.3 Amrestore Scripts <Amrestore Scripts>

Provide little scripts (e.g. "amrestoredump"
and "amrestoretar") that runs the following
nasty "amrestore" commands on the
backup server, in cases where we cannot use "amrecover".
But maybe this can/should be more automatic. 

* For DUMP-Backups:
  amrestore -p /dev/<DEVICE> "<HOSTNAME>"
  "<MPOINT>$" |  restore -rv -b2 -f-

* For TAR-Backups:
  amrestore -p /dev/<DEVICE> "<HOSTNAME>"
  "<MPOINT>$" |  tar -xvpmi -f- --ignore-failed-read --same-owner

E.g.: amrestoretar <DEVICE> <HOSTNAME> <MPOINT>