Hello,
we like to have a more reliable, fast and easy way to do a disaster recovery
with amanda.
For this we have done two things, which you find both in the appended
document:
- write a (less tested) disaster recovery guide
- start plannings for a specialized amanda disaster recovery system
We like to create such a specialized system or to partizipate in a
similar project. It will be nice, if you read through the document and
give us your annotations and ideas. We like to know:
- what we can make better
- if there is someone working one something similar
- if there is anybody how like to participate in our project.
It is planed to publish everything under the GPL, but we can discuss
about a similar license.
Every feedback is welcome,
Bernd Harmsen
ds-DATASYSTEME
PS: If you want I can send you a PDF, PS, LYX by private mail, which
is much nicer to read.
===============================================
Planning for an Amanda Disaster Recovery System
===============================================
Bernd Harmsen
bjh AT datasysteme DOT de
www.datasysteme.de
--------
Contents
--------
1 Introduction
1.1 Why we need a specialized Amanda Disaster Recovery System?
2 Goals
3 Disaster recovery with native tools and possible optimizations.
3.1 Provide working Hardware and Emergency System
3.2 Restore a Linux-Backup-Client
3.3 Restore Linux-Backup-Server
3.4 Make the System bootable
4 Starting points for optimization
4.1 Essential Backup Tool
4.1.1 Easy Amanda Database export / import
4.2 Specialized Amanda Recovery System on CD
4.2.1 Remote Access
4.2.2 Full automatic partitioning, formating and mounting
4.2.3 Amrestore Scripts
--------------
1 Introduction
--------------
This document was written to provide information about how
to do a disaster recovery with Amanda and to plan a specialized
disaster recovery system for Amanda.
We (ds-DATASYSTEME) are a small company, specialized on Linux
networks that provide Amanda backup system to our customers.
We think that Amanda is a great backup tool, very fast,
reliable and with low hardware recommendations.
But we also think, that Amanda is lacking some features for
recovery. Recovery is more complicated than backup. This
is normal, because during a recovery you have to deal with
an undefined, unknown situation. (E.g. a customer who want
to get some files back normally only knows parts of the
filename.)
But, this is OK. The real problem for us is the case of a
disaster recovery. In case the harddisk of an importand
server is broken (or the server is completely lost) there
are high costs, less time and impatient customers. For this
we need a more secure, reliable and fast way to get the
system working again.
We like to create a specialized Amanda Disaster Recovery
System, maybe together with other members of the Amanda
community, or to participate in an existing system. We like
to publish this system under the GPL or a similar license.
1.1 Why we need a specialized Amanda Disaster Recovery System?
--------------------------------------------------------------
Because the disaster recovery process as described in Chapter
[Disater Recovery naitive] is to complicated (less reliable
because of human errors) and to slow.
A disaster recovery consist of many different steps that
all need time and care. On the other hand there are customers
who want their server back. The following timesheet shows
what we think about the maximum time we have for a disaster
recovery
0.0h A server fails
0.5h The customer call the support. A member of the support
team do a diagnostic talk with the customer and pack some
hardware for replacement.
1.5h Now the support is on the way to the customer
2.0h The support member arrived at the customer, analyzes
the problem and repairs the system.
3.0h The hardware is working again. Now the support member
starts to recover the data from the Amanda backups. For
this we plan:
1.5h Active work with the recovery tools.
2.0h Data transport over the network.
6.5h The system is mostly working again.
8.0h The system is well tested. All the upcoming small problem
are solved.
As you can see, it takes a whole working day to get the system
up and running again. This is very long and we should try
to save some time at some points. But this timesheet is
also optimistic. We think that it is hard to meet its deadlines
without a specialized disaster recovery system. It assumes
that the support worker makes no bigger errors. With a less
trained worker it can even take 16 hours.
-------
2 Goals
-------
What are the goals of an specialized Amanda Disaster Recovery
System.
1. Make the Disaster Recovery more easy and reliable (less
affected from human errors).
2. Make the Disaster Recovery more fast.
-----------------------------------------------------------------
3 Disaster recovery with native tools and possible optimizations. <Disater
Recovery naitive>
-----------------------------------------------------------------
This section describes how a disaster recovery can be done
without a specialized system. It uses only the installation
media of an Debian GNU/Linux 3.0
system and the Amanda backup. The concept is to install
a separate minimal Debian system on a own partition and
use this to restore the original partitions.
This section has two intentions:
1. Provide a step by step guide for a disaster recovery.
You can use it as guide. But the procedure is not well
tested, because I write it after my last disaster recovery.
Feel free to send me corrections and suggestions.
2. Show how complicated and time-consuming a disaster recovery
can be and find some good points to start optimization.
This is the main goal. The described way is too time-consuming
and too complicated for a stressfull situation with an
impatient customer behind you. So we like to build or
participate in an more optimized and automated disaster
recovery system.
3.1 Provide working Hardware and Emergency System
-------------------------------------------------
1. Provide working hardware.
2. Plan partition table.
Additional to the partitions for the system you want to
recover (destination-system), you must provide a partition
for the emergency system. Put this partition at the beginning
of the table and give it e.g. 300MB.
You need a Backup of all your partition tables for that.
Possible optimization: Full automatic partitioning,
initialization and mounting (see [Full-automatic-partitioning]).
3. Install a Debian-Base-System
Use your normal Debian installation method/media to install
a base system on the additional partition.We will use
this as emergency system. Create the partitions as planed
above but only initialize and mount the partition for
the emergency system.
Install the following additional packages:
Amanda: amanda-client, amanda-server, tar, dump
Remote-Access: ssh, isdnutils-base, ipppd
Possible optimization: Use an specialized Amanda
Recovery System on a bootable CD (see [Amanda Recovery System on CD]).
4. Boot the emergency system.
5. Configure the IP-Network manually using ifconfig and route.
6. If you need remote access, e.g. for assistance from your
office, configure ipppd manually.
Possible_optimization: Provide good defaults for the
isdn config files (see [Remote Access]).
7. Initialize and mount the destination partitions.
Possible optimization: Full automatic partitioning,
initialization and mounting (see [Full-automatic-partitioning]).
(a) Initialize the Swap-Partition
mkswap <DEVICE>
(b) Initialize destination filesystem partitions
Initialize ext2-filesystems with the following command:
mke2fs /dev/<DEVICE>
(c) Mount destination partition.
Compose the destination partitions under the mountpoint
/mnt. Use the following steps for that:
i. Mount destination-"/"-partition
under /mnt.
mount /dev/<DEVICE> /mnt
ii. Create mountpoints for other partitions in the
destination-"/"-filesystem.
e.g.: /var, /home, /groups, /usr
mkdir /mnt/<MOUNTPOINT>
iii. Mount all other destination partitions.
mount /dev/<DEVICE> /mnt/<MOUNTPOINT>
8. Set correct date and time.
date <MMDDhhmmYYYY>
3.2 Restore a Linux-Backup-Client
---------------------------------
Use this step if you have a working Amanda-Backup-Server
and want to restore a Linux-Backup-Client.
Now we restore the data from our Backup-Server to the inactive
destination system. For each partition we first restore
the last level "0" backup
and then the last backup of each higher level.
1. Get root permissions.
su
2. Go to the highest directory of the selected destination
partition.
cd /mnt/<MOUNTPOINT>
3. Run Amrecover <Disaster-Linux-Client-Amrecover-starten>
amrecover <CONFIG> -s <BACKUP-SERVER> -t <BACKUP-SERVER>
4. Set source partition.
sethost <NAME>
setdisk <MOUNTPOINT>
5. Select all files and directories:
add *
6. Verify the list of files marked for extraction. Note which
tapes are needed.
list
7. Note the number of the archive you need on each tape.
history
You will see lines like:
201- 2002-03-06 0 ds-daily4 8
The last column shows the number of the archive and the
second last the name of the tape. You need all listed
tapes since the last level "0"
backup.
8. Start the restore.
extract
9. Verify if the shown destination directory is correct.
10. Load tape and wind to the beginning of the archive.
<disaster-Amrecover-Linux-Band-laden>
(a) Login on the Amanda backup server.
(b) Load the tape wanted by amrecover. Wait until the streamer
is quiet again.
(c) Wind to the X. Filemark. Attention: X = archive-number - 1
mt --file=/den/<DEVICE> rewind
mt --file=/dev/<DEVICE> fsf <X>
(d) Wait until you get the next prompt.
11. Confirm to Amrecover on the backup client that the correct
tape is loaded.
Load tape <NAME> now
Continue? [Y/n]: Y
12. Wait until restoration finishes.
13. Confirm restoration of origin permissions to the top level
directory.
set owner/mode for '.'? [yn] y
14. If Amrecover want another tape, proceed with step
[disaster-Amrecover-Linux-Band-laden].
15. Leave Amrecover.
quit
16. Proceed with step [Disaster-Linux-Client-Amrecover-starten]
to restore the next partition.
3.3 Restore Linux-Backup-Server
-------------------------------
Use this step if your Amanda-Backup-Server itself is defect.
Because the Backup-Server has failed, there is no Amanda
database and you cannot use "amrecover".
So we restore each partition with the less comfortable tool
"amrestore". You must manually find out, which
tapes and which archive-numbers you need for recovery.
1. Find out the tapes and archive-numbers.
For each destination partition you need the last level
"0" backup and the last backup of each higher backup
level. You can find this information manually in the e-mails
you have gotten from "amverify"
in the past.
Here is an example:
Following you find an extract from different "amverify"
e-mails. Each e-mail shows the content of one tape. The
last number shows the backup level and the number of the
"Checked ..." line (count from top) gives
the number of the archive on the tape.
In the example we want to restore the "/home"-Partition
of out Backup-Server "amun".
We start with the last level "0"
backup in archive-number 11 on tape "ds-daily4".
After that we have to restore the last level "1"
backup in archive-number 10 on tape "ds-daily7".
There is no level "2"
backup, so we need only two tapes.
Date: Wed, 5 Mar 2003 12:51:21 +0100
Subject: ds-daily AMANDA VERIFY REPORT FOR ds-daily4
[...]
Using device /dev/nst0
Volume ds-daily4, Date 20030305
Checked upuaut.datasys._boot.20030305.0
Checked inpu.datasys._boot.20030305.0
Checked amun.datasys.__ra.datasys_E$.20030305.1
Checked amun.datasys.__aset.datasys_E$.20030305.1
Checked amun.datasys.__ra.datasys_D$.20030305.1
Checked inpu.datasys._var_lib.20030305.0
Checked amun.datasys._usr.20030305.0
Checked amun.datasys.__djhuti.datasys_E$.20030305.0
Checked amun.datasys.__djhuti.datasys_F$.20030305.1
Checked inpu.datasys._var.20030305.3
Checked amun.datasys._home.20030305.0
[...]
Date: Thu, 6 Mar 2003 12:59:49 +0100
Subject: ds-daily AMANDA VERIFY REPORT FOR ds-daily6
[...]
Using device /dev/nst0
Volume ds-daily6, Date 20030306
Checked amun.datasys._usr.20030306.1
Checked inpu.datasys._boot.20030306.1
Checked upuaut.datasys._.20030306.1
Checked upuaut.datasys._boot.20030306.1
Checked amun.datasys._.20030306.1
Checked inpu.datasys._var_lib.20030306.1
Checked upuaut.datasys._var.20030306.1
Checked amun.datasys.__aset.datasys_E$.20030306.1
Checked inpu.datasys._.20030306.1
Checked amun.datasys.__djhuti.datasys_F$.20030306.1
Checked amun.datasys.__ra.datasys_E$.20030306.1
Checked amun.datasys._var.20030306.1
Checked amun.datasys.__ra.datasys_C$.20030306.1
Checked amun.datasys.__aset.datasys_C$.20030306.1
Checked amun.datasys.__djhuti.datasys_C$.20030306.1
Checked amun.datasys.__aset.datasys_D$.20030306.1
Checked amun.datasys.__djhuti.datasys_E$.20030306.1
Checked inpu.datasys._var.20030306.0
Checked amun.datasys.__ra.datasys_D$.20030306.0
Checked amun.datasys.__djhuti.datasys_D$.20030306.0
Checked amun.datasys._home.20030306.1
[...]
Date: Fri, 7 Mar 2003 13:41:35 +0100
Subject: ds-daily AMANDA VERIFY REPORT FOR ds-daily7
[...]
Using device /dev/nst0
Volume ds-daily7, Date 20030307
Checked inpu.datasys._boot.20030307.1
Checked amun.datasys._usr.20030307.1
Checked upuaut.datasys._.20030307.1
Checked upuaut.datasys._boot.20030307.1
Checked amun.datasys._.20030307.1
Checked inpu.datasys._var_lib.20030307.1
Checked upuaut.datasys._var.20030307.2
Checked amun.datasys.__ra.datasys_D$.20030307.1
Checked inpu.datasys._.20030307.1
Checked amun.datasys._home.20030307.1
[...]
Possible optimization: Provide an easy export/import
mechanism for the Amanda database to use "amrecover"
here (see [Easy Amanda Database export / import]).
2. TAR or DUMP?
For each partition you must find out, if the backup was
made using "tar" or "dump".
You find this information in your amanda disklist file
(e.g.: /etc/amanda/<CONFIG>/disklist), if you have a separate
backup of it.
Possible optimization: Provide an "Essential
Backup" tool, that stores such information
in a separate backup (see [Essential Backup]).
3. If you do not have root permission in the emergency system,
get it now.
su
4. Restore destination partitions
(a) Change to the top level directory of the destination
partition. <Disaster-Linux-Backup-Server-CD>
cd /mnt/<MOUNTPOINT>
(b) Insert correct tape<Disaster-Linux-Backup-Server-Bandwechsel>
(c) Wind to the X. Filemark. Attention: X = archive-number - 1
mt --file=/den/<DEVICE> rewind
mt --file=/dev/<DEVICE> fsf <X>
(d) Run "amrecover"
For DUMP-Backups
amrestore -p /dev/<DEVICE> "<HOSTNAME>"
"<MPOINT>$" | restore -rv -b2 -f-
For TAR-Backups
amrestore -p /dev/<DEVICE> "<HOSTNAME>"
"<MPOINT>$" | tar -xvpmi -f-
--ignore-failed-read --same-owner
Possible optimization: Provide simple scripts
that run this nasty commands (see [Amrestore Scripts]).
(e) If there are more backup levels for this partition, proceed
with step [Disaster-Linux-Backup-Server-Bandwechsel].
(f) If there are more partitions proceed with step
[Disaster-Linux-Backup-Server-CD].
3.4 Make the System bootable
----------------------------
1. Change "/" to destination system.
With this command the destination system becomes the active
system. You can mostly use it as if you have booted it.
chroot /mnt
2. Make sure that /proc is an empty directory
/proc is an virtual file system provided by the kernel.
During the restore process it was maybe restored with
it contents, but it should only be a mountpoint.
rm -f /proc/*
3. Check /etc/fstab
Is the fstab conform with the new partition table?
4. Check /etc/lilo.conf
Are the params "root"
and "boot" conform with the new partition
table?
root = Device that contains the "/"-partition
(e.g. /dev/sda2).
boot = Device that should contain the bootsector
(e.g. /dev/sda).
5. Write a new bootsector
liloconfig
6. Exit "chroot"
exit
7. Boot restored destination system.
shutdown -r now
8. Thats all.
----------------------------------
4 Starting points for optimization
----------------------------------
This part shows the possible targets for optimization, extracted
from chapter [Disater Recovery naitive]. At
the moment this is more a brainstorming than a detailed
concept. We like to read your ideas about that.
4.1 Essential Backup Tool <Essential Backup>
-------------------------
This little script should collect all the essential informations
that is need in case of an disaster recovery and store it
in one or more a save places appart from the normal backups.
It can be installed on all Linux hosts and started by (ana)cron
e.g. once a week.
The informations we consider essential are:
* Configuration (/etc/*, incl. full amanda config)
* Partition table
* Installed packages (dpkg --get-selections)
* Amanda database (only on the Backup-Server, amadmin <CONFIG>
export)
There are plans to provide ways to save this informations:
* on a local floppy disk.
* by GPG encrypted e-mail.
* by sftp or ftp.
4.1.1 Easy Amanda Database export / import <Easy Amanda Database export /
import>
Provide a way to use "amrecover"
even if the Backup-Server has failed. For this we need an
easy import of the Amanda database from the last essential
backup.
If there are problems with that, we can provide a script
that extracts the informations about tapes and archive-numbers
from a amanda database and optionally calls amrestore (see
[Amrestore Scripts]).
4.2 Specialized Amanda Recovery System on CD <Amanda Recovery System on CD>
--------------------------------------------
Provide an bootable emergency system on cd, that contains:
* a base system
* all necessary tools
* some scripts to make disaster recovery more easy.
* an "import" function for the essential
backups.
* maybe it is nice to have a kind of GUI where you only select
the name of the host you want to restore and everything
else runs automatic. But we think this is much work and
should be delayed for a later step.
4.2.1 Remote Access<Remote Access>
Provide good callin defaults for the isdn config files device.ippp0
and ipppd.ippp0. The support worker should only load the
correct kernel module and change the MSN. With this feature
a less trained worker can start the disaster recovery system
and someone in the main office can proceed or assist.
4.2.2 Full automatic partitioning, formating and
mounting<Full-automatic-partitioning>
For this we can write a script that reads all necessary information
from the essential backup of the selected host (see [Essential Backup])
and automatic:
* partitioning the harddisk(s).
* initialize the partitions with the correct filesystem or
swap.
* mount the partitions for disaster recovery.
4.2.3 Amrestore Scripts <Amrestore Scripts>
Provide little scripts (e.g. "amrestoredump"
and "amrestoretar") that runs the following
nasty "amrestore" commands on the
backup server, in cases where we cannot use "amrecover".
But maybe this can/should be more automatic.
* For DUMP-Backups:
amrestore -p /dev/<DEVICE> "<HOSTNAME>"
"<MPOINT>$" | restore -rv -b2 -f-
* For TAR-Backups:
amrestore -p /dev/<DEVICE> "<HOSTNAME>"
"<MPOINT>$" | tar -xvpmi -f- --ignore-failed-read --same-owner
E.g.: amrestoretar <DEVICE> <HOSTNAME> <MPOINT>
|