Networker

Re: [Networker] Too many tape mounts

2003-07-31 18:33:42
Subject: Re: [Networker] Too many tape mounts
From: "Reed, Ted G II [ITS]" <ted.reed AT MAIL.SPRINT DOT COM>
To: NETWORKER AT LISTMAIL.TEMPLE DOT EDU
Date: Thu, 31 Jul 2003 17:31:25 -0500
7.0 also does a much better job of handling tape mount requests than 6.x.  The 
reason your tape mounts and your interactive sessions with networker are 
unresponsive is most likely due to the "nsrd" process running at maximum 
available CPU cycles.    nsrd is your probable bottleneck for all these 
problems and I'm guessing at the times of these slowdowns you will find "nsrd" 
at the top of the list of cpu intensive processes.....probably eating all 
available CPU cycles.

We had the same basic problem while on 6.x and made the following changes:
        10 drives @ 10 target sessions per ( 100 target sessions )
        Set primary group (150+ clients) to 68 savegroup parallelism (<7 drives)
        set secondary groups to approximate same value as number of clients (ie 
20 client group @ 4 sessions per client = 80 potential target sessions, 
savegroup parallelism set to 20 or 2 drives)
        We tried to make it so the total number of open target sessions at any 
one time were less than/equal to the 100 total sessions available.  

You say you recently migrated....what per cpu speed did you use to run at and 
what per cpu speed do you have now?  And if you are using one single cpu server 
for both master AND data mover, then you are definitely beating up that cpu and 
it's no wonder you bog down.  And yes, I can confirm, you can hit a point of no 
return when it can take 30 minutes to get a gui back for a tape mount ..... and 
where things will just start timing out and failing.  And once you've hit that 
point, it's very difficult to get back up to speed w/o recycling the app.

So, potential solutions are:
Get a faster CPU in server
decrease tape mounts through new, larger drives
decrease real-time load at mount times by decreasing number of saveset 
parallelisms below total available
        - If you have 100 target sessions and you're only using 90, that 'free' 
space will allow for easier mounts/dismounts
stack more clients into a single group that uses a small saveset parallelism
Make sure clients are labeled and empty before use.  Having the server do the 
labeling on the fly is God Awful on the system and uses MUCH more resources 
than having prelabeled tapes that just mount.
OR some combination of the above

--Ted


-----Original Message-----
From: Bob Spurzem [mailto:bobs AT gocmt DOT com]
Sent: Thursday, July 31, 2003 4:48 PM
To: NETWORKER AT LISTMAIL.TEMPLE DOT EDU
Subject: Re: [Networker] Too many tape mounts


I saw this same problem before, the solution was to use larger tapes (Super
DLT or LTO).  The small tapes cause too many mount requests.

Bob
CMT - The Tape People
1-800-252-9268
"we trade new tape media for old used tape media"

-----Original Message-----
From: Legato NetWorker discussion
[mailto:NETWORKER AT LISTMAIL.TEMPLE DOT EDU]On Behalf Of Stan Horwitz
Sent: Thursday, July 31, 2003 2:04 PM
To: NETWORKER AT LISTMAIL.TEMPLE DOT EDU
Subject: [Networker] Too many tape mounts


Has anyone here run into a situation with NetWorker 6.1.3 (Power Edition)
on a single CPU Enterprise 450 with Solaris 9 where NSR chokes when there
SEEM to be more pending tape mount requests than available tape drives?
Our Legato server handles daily backups for 112 different clients; a mix
of Windows, Novell, Solaris, and Tru64 Unix. We also run SnapImage to back
up a pair of Mirapoint message stores with 220GB of data each to our
server's tape library.

After going back and forth with Legato tech support regarding frequent,
but intermittent slow downs of our Legato server, I think I have hit upon
the common denominator between these slow downs: too many tape mount
requests in a given period of time. Legato said today that my hypothesis
makes sense so they are investigating it from that angle.

Meanwhile, I am wondering if anyone else has encountered this issue. Our
Legato server will get to a point where tape ejects and mounts can take
several hours to be processed. This causes our backup schedule to fall way
behind. When this happens, NSR becomes very unresponsive. For example,
opening up nsrwatch can take 10 or more minutes as does quitting out of it
by pressing the "q" key. Sometimes we'll get a slowdown condition that
lasts as little as half an hour, or maybe one or two hours; other times it
will last all night. When this happens, the /nsr/logs/daeman.log file
shows instances of nsrmd restart failures.

Not realizing this correlation, I have been juggling our backup schedule
almost every day, but all I believe I was actually doing was rescheduling
the slow down. I have just taken to dropping the savegroup parallelism
setting on the five or six savegroups that have the greatest number of
clients. I set them to equal to the number of target sessions on our tape
drives, which is 12. Fortunately, I have fairly wide lattitude in how our
backups are scheduled. This afternoon, I also took one of our larger (in
terms of number of clients) savegroups and broke it up into three separate
savegroups and scheduled them to run a few hours a part with their weekly
full backup scheduled on different days of the week. This is all at
Legato's recommendation so maybe these steps will help.

At any rate, when our NSR server slows down, operating system commands
such as "ls", the login process, etc. are also executed quickly. System
load rarely gets above 7 and even when it sinks way down, the problem does
not always go away.

We recently migrated our Qualstar tape library and backup server from a
system that runs Tru64 Unix and NSR 6.1.1 to this new Sun E450 server.
Since the migration, NSR has been nothing but problems. We will likely
retain outside assistance to help us deal with this issue and verify that
we have our Legato server set up with best practices in mind.  A colleague
and I have been pouring over documentation from Legato, Sun, and Qualstar
in an attempt to better understand this problem.

Meanwhile, I am wondering if anyone else has encountered this type of
problem with 6.1.3 and if you're seeing any nsrmmd restart cancelations
and/or RPC time out errors in your server's /nsr/logs/daemon.log file.


Thanks,


Stan

--
Note: To sign off this list, send a "signoff networker" command via email
to listserv AT listmail.temple DOT edu or visit the list's Web site at
http://listmail.temple.edu/archives/networker.html where you can
also view and post messages to the list.
=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=

--
Note: To sign off this list, send a "signoff networker" command via email
to listserv AT listmail.temple DOT edu or visit the list's Web site at
http://listmail.temple.edu/archives/networker.html where you can
also view and post messages to the list.
=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=

--
Note: To sign off this list, send a "signoff networker" command via email
to listserv AT listmail.temple DOT edu or visit the list's Web site at
http://listmail.temple.edu/archives/networker.html where you can
also view and post messages to the list.
=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=

<Prev in Thread] Current Thread [Next in Thread>