ADSM-L

Re: [ADSM-L] Maximum TSM nodes per server

2013-08-16 13:00:33
Subject: Re: [ADSM-L] Maximum TSM nodes per server
From: Skylar Thompson <skylar2 AT U.WASHINGTON DOT EDU>
To: ADSM-L AT VM.MARIST DOT EDU
Date: Fri, 16 Aug 2013 09:57:37 -0700
If I'm understanding the problem correctly, you're running into a
problem with a TSM client node, not with your TSM server.

We have a similar setup, although our proxy nodes are RHEL and use NFS
rather than CIFS. We have a pool of nine 10GbE-attached nodes that
backup a variety of storage devices that are big enough that we can't
run a single backup schedule on them (GPFS), or that we don't have a
good backup client for (Isilon, BlueARC). In aggregate these systems
inspect a bit over 250 million objects spread over ~2.5PB.

A few issues we've run into:

* Under high load, the storage servers can bog down and cause backups to
run a day behind, but it's rarely a serious problem.

* The Linux dentry cache will get pinned, and cause the system to run
out of RAM. By echo'ing 3 into /proc/sys/vm/drop_caches occasionally we
can work-around this problem. Our original proxy nodes also had 12GB of
RAM, but we've progressively bumped this up to 24GB and 48GB as we buy
newer systems (RAM is cheap these days).

* The Linux NFS client is pretty poor, and there's performance problems
when stat'ing lots of files even on separate filesystems. This appears
to us to be a context-switch issue, so we try to keep the number of
simultaneous backups below the number of CPUs each proxy node has.

* The atomic unit of parallelization in the TSM world is the filespace,
not the filesystem. By working with end users before we start doing
backups, we can find ways to divvy up each filesystem (where the minimum
size is in the hundreds of TB, one ranging past a PB) into multiple
filespaces that we can mount separately in /etc/fstab. With judicious
use of -domain statements in schedules, we can assign different
filespaces to different schedules that are all processing the filesystem
but still get decent parallelization.

For your particular problem, I would see if you can figure out where the
bottleneck is. Is it data throughput? Metadata latency? Locking within
Windows itself? Contention on the TSM server side (network throughput,
DB, mount limits, etc.)?

In the UNIX world, "strace -ttf" is a useful tool in that it will print
the latency of every system call that's made by a process. Failing that,
TSM client tracing can give the same information, albeit with much more
cruft around the timing.

On 08/16/13 08:29, Zoltan Forray wrote:
We are starting to experiencing performance issues on a server that acts as
the "head" for multiple (31 currently) TSM nodes. This server CIFS mounts
multiple departmental filesystems - all in various EMC SAN's.  Each
filesystem is a different TSM node.

The "head" server is running Windows 2012 server with 12GB RAM and
2-quad-core processor.

Anyone out there something like this?  What are the realistic limits?  I
have tried spreading the backup start times as much as I can.

As expected, a lot of the time is spend scanning files - 1-node is >10M
files.

Thoughts?  Comments?  Suggestions?

--
*Zoltan Forray*
TSM Software & Hardware Administrator
Virginia Commonwealth University
UCC/Office of Technology Services
zforray AT vcu DOT edu - 804-828-4807
Don't be a phishing victim - VCU and other reputable organizations will
never use email to request that you reply with your password, social
security number or confidential personal information. For more details
visit http://infosecurity.vcu.edu/phishing.html


--
-- Skylar Thompson (skylar2 AT u.washington DOT edu)
-- Genome Sciences Department, System Administrator
-- Foege Building S046, (206)-685-7354
-- University of Washington School of Medicine