Re: [ADSM-L] ? how to backup Cyrus email to minimize restore times ?

>> On Fri, 11 Jan 2008 07:30:42 -0500, James R Owen <Jim.Owen AT YALE DOT EDU> 
>> said:


> Yale University will convert from UWash to Cyrus IMAP email service in 
> Q2-2008.

> Searching for Cyrus ref's, I know that UFla, Cornell, Buffalo,
> MPI-FKF, and Uni-Ulm responded to BostonU's 2006-05-16 request for
> Cyrus backup/restore advice.  I hope you and others w/large Cyrus
> experience will respond again!

That's us.  :)

> We plan to have 10 Cyrus email backend servers (each w/10*200GB
> FileStores) clustered in two primary datacenters, here and there.
> The five Cyrus backends here will backup to a dedicated TSM service
> there, and vice versa.  Current testing indicates that a DR restore
> of a single 200GB FS from TSM continuous incremental backups on LTO3
> tapes will probably take longer than a week to complete!  Obviously,
> we need a better backup & restore plan.

Cyrus stores mail with one file per message: this means that database
behavior will be unusually prominent in all of your operations.  Keep
that in mind.

We also have 10 back-ends, each of which houses 10 ~60G stores. So
while smaller we're in the same order of magnitude, close enough to
see the same architectural issues, I trust.

I am currently backing up each back-end to a separate TSM instance.
Experience has yielded the opinion that this is an excess of caution,
but the previous configuration was 4 back ends, and putting two of
-those- on the same TSM server was not pleasant.

I do not collocate by filespace.  The number of volumes per node (and
thus in my scheme, per TSM instance) is sufficiencly low that it's not
an issue.

We do nightly incrementals, which finish in a few hours per back-end.
We stretch them out over much of the evening to save peak load on the
DB spindles.

I kicked off a full restore of one of my file stores in response to
this message.  It was 53G total, and finished in 1:36.  This mounted
four tape drives to begin with.

11:48 - 12:28    48 min
11:48 - 12:49    61 min
11:48 - 12:54    66 min
11:48 - 13:24    96 min

So, that's a total of 271 drive-minutes, or 4.5 drive-hours.

One of my compatriots, who was watching the back-ends, says that we
were blowing out the IOOPS on the lun, and also blowing out the write
cache on the disk subsystem.  This corroberates with my observations
of occasional multi-second sendW and recvW on the restore sessions; in
this config my bottleneck was SAN recieve transactions. (not recieve
-bandwidth-, note.  transactions.)

So, if I had your filestore sizes, I'd probably be restoring one of
them in 5-6 hours.  I anticipate there are procedural or equipment
wins somewhere in your scenario.  Where do you think your bottleneck
is?


- Allen S. Rout
- whistles "LTO's not great at seeks, doo-dah, doo-dah"