ADSM-L

Re: Seeking thoughts on Cyrus email backup/restore

2006-05-16 11:40:22
Subject: Re: Seeking thoughts on Cyrus email backup/restore
From: Steve Roder <spr AT REXX.ACSU.BUFFALO DOT EDU>
To: ADSM-L AT VM.MARIST DOT EDU
Date: Tue, 16 May 2006 11:39:57 -0400
Hi Richard,

     We use Cyrus on Solaris, but I think the concepts follow to Linux.
...and sorry for the length of this, but it is complicated...

We have 12 spools of roughly equal size currently across 4 machines
clustered with one standby node.  Soon, we will cluster 6 machines, and
each machine will run a logical host that owns two spools.  Failover can
then occur from any to any.

On TSM, for file level restores, we run a seperate client for each spool.
Example postschedule email backup from last night on one of the 12:

Subject: email4.acsu.buffalo.edu:spool05 TSM Client Backup Report:Failures:     
   1

05/15/06   00:14:20 ANS1228E Sending of object
'/global/05/spool/42/user/bpeppers/11869.' failed
05/15/06   00:14:20 ANS4005E Error processing
'/global/05/spool/42/user/bpeppers/11869.': file not
found

 Summary:

05/16/06   01:30:10 --- SCHEDULEREC STATUS BEGIN
05/16/06   01:30:10 Total number of objects inspected: 2,209,678
05/16/06   01:30:10 Total number of objects backed up:   33,158
05/16/06   01:30:10 Total number of objects updated:          0
05/16/06   01:30:10 Total number of objects rebound:          0
05/16/06   01:30:10 Total number of objects deleted:          0
05/16/06   01:30:10 Total number of objects expired:     25,841
05/16/06   01:30:10 Total number of objects failed:           0
05/16/06   01:30:10 Total number of bytes transferred:     3.24 GB
05/16/06   01:30:10 Data transfer time:                  403.88 sec
05/16/06   01:30:10 Network data transfer rate:        8,421.79 KB/sec
05/16/06   01:30:10 Aggregate data transfer rate:        398.37 KB/sec
05/16/06   01:30:10 Objects compressed by:                    0%
05/16/06   01:30:10 Elapsed processing time:           02:22:18
05/16/06   01:30:10 --- SCHEDULEREC STATUS END
05/16/06   01:30:10 --- SCHEDULEREC OBJECT END EMAIL 05/15/06   23:00:00
05/16/06   01:30:10
Executing Operating System command or script:
   /opt/tivoli/tsm/client/ba/bin/spool.POSTschedulecmd spool05

The spool:

> df -h /global/05
Filesystem             size   used  avail capacity  Mounted on
/dev/vx/dsk/spool05dg/vol03
                       305G    63G   228G    22%    /global/05

So, you can see we have lots of room for growth!  That VxVM filesystem is
on a Hitaci 9960, mirrored via TruCopy to another 9960 in a different
site.

We also have a third mirror that is zoned to a Solaris TSM server that
shares our 3494 and 3584 (in different locations) with our two AIX TSM
Servers, and takes special "DR" backups of each spool using an Image
backup of the raw veritas volume.  The basic processing cycle breaks this
mirror off on the Solaris TSM server, starts the volume, fsck's it, mounts
it, umounts it, backs it up to the local server, and then resyncs the
mirror.  We run four in parallel using four drives.  This Solaris server
is a hybrid LANFree client, in that it see the Disk on the SAN, and the
tapes, but does not use the Agent.  We decided to make it a full partner
with our other two AIX TSM Servers (that do the logical backups).

>From it:
tsm: SAN1>q ses
Session established with server SAN1: Solaris 8/9
  Server Version 5, Release 3, Level 2.0
  Server date/time: 05/16/06   11:28:12  Last access: 05/16/06   11:01:06


  Sess  Comm.   Sess      Wait    Bytes    Bytes  Sess   Platform  Client
Name
Number  Method  State     Time     Sent    Recvd  Type
------  ------  ------  ------  -------  -------  -----  --------
--------------------
 1,477  Tcp/Ip  RecvW     0 S     2.8 K   22.5 G  Node   SUN SOL-
EMAIL3.DISASTER
                                                          ARIS
 1,480  Tcp/Ip  Run       0 S     2.8 K   23.6 G  Node   SUN SOL-
EMAIL3.DISASTER
                                                          ARIS
 1,486  Tcp/Ip  RecvW     0 S     2.7 K   21.5 G  Node   SUN SOL-
EMAIL3.DISASTER
                                                          ARIS
 1,488  Tcp/Ip  RecvW     0 S     2.7 K   19.0 G  Node   SUN SOL-
EMAIL3.DISASTER
                                                          ARIS
 1,490  ShMem   Run       0 S       126      162  Admin  SUN SOL-  RODER
                                                          ARIS

Once that completes, the next set starts.  We also allow all our
TSM servers to see all drives, and then control who uses what by what
drive is online where.  So san1 has:


Library Name     Drive Name       Device Type     On-Line
------------     ------------     -----------     -------------------
3494             100              3590            No
3494             101              3590            No
3494             102              3590            No
3494             103              3590            No
3494             104              3590            No
3494             105              3590            No
3494             200              3590            No
3494             201              3590            No
3494             202              3590            No
3494             203              3590            No
3494             204              3590            No
3494             205              3590            No
3584             D01              LTO             No
3584             D02              LTO             No
3584             D03              LTO             No
3584             D04              LTO             No
3584             D05              LTO             No
3584             D06              LTO             No
3584             D07              LTO             Yes
3584             D08              LTO             Yes
3584             D09              LTO             Yes
3584             D10              LTO             Yes

and all these offline drives are used by the other two tsm servers.  This
allows us to reallocate drive resource on the fly.  The 3494 supports this
beautifully, but on the 3584, we share without using logical partitions,
so have to take care with cell management.

We will soon have "san2", and will split this load across two solaris
serfers, using  8 drives (future plans will add Oracle DB's to this backup
scheme).

Back to email...

For our emai servers, we also have a set of machines that "hold" 24 hours
of email that was delieverd to the backends, so our recovery plan is:

1. build new filesystem
2. present it to solaris tsm server
3. restore image backup (takes about 4.5 hours)
4. integrity check filesystem
5. mount it on proper backend cryus server
6. replay last 24 hours of email.

Call me if you want more info, or to clarify anything....

Hope this helps,
Steve Roder


> We're a site who has been using IMAP with mbox style mail "folders"
> for a decade. It's obvious that as we boost mail quotas, mbox "does
> not scale". Thus, we are planning for a conversion to Cyrus, on Linux.
>
> For planning purposes, I'd like to get input from Cyrus sites as to
> issues they've encountered in performing backups and restorals,
> particularly with coherency, given that each message becomes an
> individual file. The Many Small Files issue is apparent, but there
> are probably other things that sites have learned the hard way that
> could help others. One unobvious issue I can think of is how to best
> perform mail investigations via restorals, as called upon in
> subpoenas. (A search of the ADSM-L archives turns up next to nothing
> on Cyrus and TSM.)
>
> I'd be happy to hear any other practical recommendations for
> efficiently implementing Cyrus, based upon site experiences (disk
> arrangement, file system type, directory architecture, etc.). Whereas
> that would be off the TSM topic per se, feel free to email me
> directly (rbs AT bu DOT edu).
>
>     thanks, Richard Sims
>
>

Steve Roder
University at Buffalo
(spr AT buffalo DOT edu | (716)645-3564)