ADSM-L

Re: [ADSM-L] Improving Replication performance

2018-04-26 16:14:13
Subject: Re: [ADSM-L] Improving Replication performance
From: Zoltan Forray <zforray AT VCU DOT EDU>
To: ADSM-L AT VM.MARIST DOT EDU
Date: Thu, 26 Apr 2018 16:10:20 -0400
Currently, all source servers are non-dedup since they don't have the
horsepower/cpu/threads to handle it.  The target server is deduping.

Currently, the source servers have internal disk but a lot of the daily
backups to flow to tape which I know is slow - so the DB's are internal
disk.

The target server DB is on SSD and we are monitoring it since it has grown
to over 1.8TB in size and we are nowhere close to replicating more than
half of what we need to process.

We can't afford a 1-to-1 source/replica server like you have.

We are looking to move to 8.x in the future - no target yet since we are in
the middle of hardware upgrades of 2-ISP servers right now and upgrading
5-local servers to forced TLS will be a big morass of problems due to the
levels and mix of TSM clients (everything from 5.4-8.1.2) and having to
upgrade 2-LM servers simultaneously will be an adventure.

On Thu, Apr 26, 2018 at 3:50 PM, Sergio O. Fuentes <sfuentes AT umd DOT edu> 
wrote:

> The totals that you're mentioning:  eg: "Yet we can only replicate around
> 3TB daily when we backup around 7TB.", are these deduplicated replication
> totals or non-deduplication non-compressed totals?
>
> Are you replicating deduplicated data?
>
> We've been replicating successfully about 10TB of nightly non-deduplicated
> data across our two datacenters successfully for years now and it's only
> been getting better.  We get about 60% deduplication so what we're really
> transferring over the wire is about 4 TB of data.  It used to take 6 to 8
> hours to do the replication of that much data across our network, but we
> changed the DB storage to essentially SSD (as part of a separate project)
> and the code base for TSM replication has gotten more stable that it only
> takes about two-three hours to replicate it now.
>
> Some differences that I notice from what you've provided, is that we're
> doing 1 to 1 replication where we have 4 servers, 2 are primary and 2
> replicate to a dedicated replica for that primary server.  We're now moving
> those replicas to the cloud and seeing if we can use the AWS EC2 and S3
> instances to be our DR servers (plus the AWS TSM instance will be our AWS
> TSM target (yes, there are still reasons to have backups in the cloud)).
> That's still early in production so we haven't taxed that replication
> network much yet.
>
> We also use dedicated disk for both DB's and stgpools (still on file-pools,
> so we're not doing any protect stgpool commands).  We don't have another
> ethernet network to traverse for the file-pool traffic.  That dedicated
> storage is backed by old 8GB FC switches on a pair of Dell MD arrays.  It's
> nice to have that low-latency backbone while we still got it.  We're on TSM
> 8.latest and there might have been some performance bugs at the 7.1 level
> if I remember correctly.
>
> The one thing I did notice is that the more you can spreadout your I/O
> across your storage arrays (whether it's DB or STGPOOL targets) the better
> the performance.   For example, the setup for our TSM server has 16
> filesystems for the database and 16 mountpoints for the filepool
> directories.  For your Isilon backend do you see that as a bottleneck at
> all?  Is the TSM server pushing load to as many of those Isilon nodes as
> possible?   Or is it really the enumeration of the replication data that
> takes a long time (that's possibly a DB bottleneck).  Lots of questions
> than answers for me but I hope I pointed you in the right direction.
>
> Thanks and good luck!
> Sergio
>
> On Thu, Apr 26, 2018 at 2:46 PM, Zoltan Forray <zforray AT vcu DOT edu> wrote:
>
> > As we get deeper into Replication and my boss wants to use it more and
> more
> > as an offsite recovery platform.
> >
> > As we try to reach "best practices" of replicating everything, we are
> > finding this desire to be difficult if not impossible to achieve due to
> the
> > resource demands.
> >
> > Total we want to eventually replicate is around 700TB from 5-source
> servers
> > to 1-target server which is dedicated to replication.
> >
> > So the big question is, can this be done?
> >
> > We recently rebuilt the offsite target server to as big as we could
> afford
> > ($38K).  It has 256GB of RAM.  64-threads of CPU. Storage is primarily
> > 500TB of ISILON/NFS. Connectivity is via quad 10G (2-for IP traffic from
> > source servers and 2-for ISILON/NFS).
> >
> > Yet we can only replicate around 3TB daily when we backup around 7TB.
> >
> > Looking for suggestions/thoughts/experiences?
> >
> > All boxes are RHEL Linux and 7.1.7.300
> >
> > --
> > *Zoltan Forray*
> > Spectrum Protect (p.k.a. TSM) Software & Hardware Administrator
> > Xymon Monitor Administrator
> > VMware Administrator
> > Virginia Commonwealth University
> > UCC/Office of Technology Services
> > www.ucc.vcu.edu
> > zforray AT vcu DOT edu - 804-828-4807
> > Don't be a phishing victim - VCU and other reputable organizations will
> > never use email to request that you reply with your password, social
> > security number or confidential personal information. For more details
> > visit http://phishing.vcu.edu/
> >
>



--
*Zoltan Forray*
Spectrum Protect (p.k.a. TSM) Software & Hardware Administrator
Xymon Monitor Administrator
VMware Administrator
Virginia Commonwealth University
UCC/Office of Technology Services
www.ucc.vcu.edu
zforray AT vcu DOT edu - 804-828-4807
Don't be a phishing victim - VCU and other reputable organizations will
never use email to request that you reply with your password, social
security number or confidential personal information. For more details
visit http://phishing.vcu.edu/


ADSM.ORG Privacy and Data Security by KimLaw, PLLC