ADSM-L

Re: Seeking thoughts on Cyrus email backup/restore

2006-05-19 05:09:53
Subject: Re: Seeking thoughts on Cyrus email backup/restore
From: Rainer Wolf <rainer.wolf AT UNI-ULM DOT DE>
To: ADSM-L AT VM.MARIST DOT EDU
Date: Fri, 19 May 2006 11:09:25 +0200
Hi Richard ,
tsm-server: V5.3.3.0 on solaris v440/16GB ram, 3494lib, 3592drives
tsm-client: V5.3.3.0 on solaris v440/16GB ram

Our cyrus mail-server is set up on solaris and running
the whole mail-data with a copy using a private-fc to another building
as a permanently synced one -- it's a poor-mans solution just using
the available system services -- but running okay.
You may also consider to clone the mailserver-machine too so in case
of a catastrophic scenario the mail-service can swap completely to
another location. Our mail-data itself is always in a synchronized
state - others are doing delays to have a kind of restore-window
from that copy.
Because of this HA-config we don't hope
to ever restore the full mail-data from TSM :-)

With TSM we backup as normal incremental (no snapshot ... )
and doing the 'normal' restores of user-folders accasionally deleted by users.

In the last time I have done a lot of tsm-restore-tests
of our cyrus mail-server ( currently 1 Filesystem, 5 Mio Files, 280GB )
and had those expereinces:
A complete backup of the whole ( with just one session )
runs in about 18 hours, but normally not doing that - just incremental.
The incremental with around 80.000files/10GB take about 4 hours per night.

Currently our best restore-time of that single mail-server filesystem
has taken  03:49:51 for 4,4mio objects/280GB
 - thats pretty good for us and I just try to get this down to about 3 hours
balancing the data on more input-volummes/disk-cache.
That best-restore time was a result of a 'fresh' full backup finally placed on
mainly 2 3592 tapes and only very few on disk-cache.
These values ( average objects/hour  - average data/hour ) are
finally the facts showing what at least had been possible once.

Doing a full restore from the normal backup data
(not the 'fresh' one) with all the real wholes and with
the aggregate-wholes within and so on ... take about +70-80 % of time
compared with the best-one real-possible.

In case of full desaster - with HA -solution also not working we finally would
do that full restore and while the restore is running no sending of mail
would be possible - incoming mails would be queued.
So having that pause the cyrus-reconstruct
of all folders is not necessary which itself may take a very long time.

I think everyone has do one full-restore-test of the mail server at
any time using  tsm-snapshot or the 'normal'
tsm backup data from incrementals    -whatever using-
just to proof whats going on.

The other thing I now came to is to get the value of the best
full-restore-througput that is possible  -in practice-
just to verify the overall-status and identify possible bottlenecks.

........................................
Currently I have two problems with the cyrus-backup
1) Full restore : comparing with our individual
best-possible full-mail-restore-time
... the +80% are not bad but it seems that tsm slows down in some
way the restore. I have really always measuring a fast start of the restore
in the first 2-3 hours ( measuring restored-Files and restored-data )
and the restore-forecast always looks like having a
total restore-elapsed time of about +30% (comparing to the best-possible).
In the end the restore slows down without any obvious reasons
an the restore-process on the client is raising with his cpu-usage
and it take no wonder that the tsm-server is showing
more and more 'SendWait' states of the sessions.
For me the bottleneck seems to be 'inside' the
tsm-software and currently i have an open pmr on that.

2) Partial restore:
Restoring just a few hundred Files/few MB may result in a too-long-time
... tsm is doing things not understandable... maybe its an
deep problem / architectura.
Hhere we help ourself disabling the nqr-restore
using just for example
dsmc restore "/mail/imap/j/user/juser/?*"
... running pretty fast ...
instead of  dsmc restore /mail/imap/j/user/juser/
... may take 10 times longer ...
I hope IBM is aware of that problem
- because its a really painfull and annoying one
........................................


...just some thoughts
Rainer