Re: [Networker] Too many tape mounts

Stan,

This sounds a very familiar scenario. I was recently working on a system
that showed the same symptoms as this. We were aware that the system had
grown a little too large, it had probably 50ish devices spread around
quite a mix of storage nodes. The problem with tape mounts queuing and
never completing plagued us. Sometimes an nsrjb would run for hours,
seemingly in a hung state waiting for the server to communicate.
Eventually it would reach a point of no return. The responses to the GUI
would sometimes take several minutes to even partly refresh. This was
NetWorker 6.1.1. Another problem was a "too many open files" error seen
in daemon.log, and it became clear that once this had appeared we were sunk.

After some very long days and a shortage of sleep for several team
members we made some progress. We managed to get one of Legato's best
crit. sit. engineers involved (quite an achievement since our support is
with a third party, not Legato) and he had some clues about the problem.

The "too many open files" error occurs when the number of open files
passes a limit (1024?) that is hard-wired into the Solaris build of
NetWorker. When there are too many open files, NetWorker starts to close
down connections, and one of the first connections to be closed is
usually to nsrindexd (or was it nsrmmdbd), a pretty crucial connection.
Without this, NetWorker is crippled, and a restart is the only solution.

Somehow, the engineer managed to find us some patch versions for 6.1.1.
A new copy of nsrd materialised, origins unknown, but thought to have
been built as a special pre-release for a customer. It was the same
code, but built on an engineer's workstation rather than the official
build environment. Thus the code is compiled under Solaris 8 instead of
Solaris 2.6. This appears to get round the problem of open files. Also a
new version of nsrjb appeared, and this seemed to fix the problem of
slow or hanging tape mounts. I'm sure there was a third patch that made
nsrindexd reconnect to the server, I can check next week if need be.

The result was a slick and responsive system that helped us survive
until we were able to introduce a new server to split the load of this
overworked server.

I don't know how many of the fixes here have made it into 6.1.3, and it
worries me a little, since we are soon to be migrating to the newer
version, for a variety of good reasons. I think one of the most
important issues here is the NetWorker build environment used by Legato
engineering. The Solaris 2.6 platform is obviously placing serious
restrictions on the scalability and it's about time they either
abandoned 2.6, or started producing different versions for different
Solaris versions. It seems a little ironic that they are clinging to
Solaris 2.6 yet in NW7.0 they have quite happily dropped Linux support
for Pentium and lower altogether.

So, how big can a NetWorker environment get? Legato say that 64 devices
on a Unix server is OK. Yeah right! Maybe they can make that work on a
simple test environment, but a typical customer will have mixed
platforms, mixed devices, less than perfect networks and name
resolution, plenty other stuff going on with their LAN, and lots of
other factors conspiring to stop it working. I know that there is steady
progress being made as the version numbers grow, and the move to faster
tape drives also helps since then less of them are needed, but I still
believe that we should be looking at a maximum of about 40 devices.
There will be exceptions to this in very specialised environments.

So Stan, I think Legato have all the answers for you, if they can only
find them. Perhaps you just need to get hold of the right experts within
Legato. Our particular expert was Scotland based, so he might not be
available to you, but I know that there are other people who understand
the issues. Perhaps there is a problem disseminating the information
down through the Support organisation?

As has been discussed recently, for 6.x versions of NetWorker, nsrd can
be quite a bottleneck, and faster CPUs are more useful than adding extra
CPUs. A box with 2 fast Sparc IIIs is probably better as a NetWorker
server than one with 10 slower processors.

I hope this is some help to you.




Stan Horwitz wrote:

Has anyone here run into a situation with NetWorker 6.1.3 (Power Edition)
on a single CPU Enterprise 450 with Solaris 9 where NSR chokes when there
SEEM to be more pending tape mount requests than available tape drives?
Our Legato server handles daily backups for 112 different clients; a mix
of Windows, Novell, Solaris, and Tru64 Unix. We also run SnapImage to back
up a pair of Mirapoint message stores with 220GB of data each to our
server's tape library.

After going back and forth with Legato tech support regarding frequent,
but intermittent slow downs of our Legato server, I think I have hit upon
the common denominator between these slow downs: too many tape mount
requests in a given period of time. Legato said today that my hypothesis
makes sense so they are investigating it from that angle.

Meanwhile, I am wondering if anyone else has encountered this issue. Our
Legato server will get to a point where tape ejects and mounts can take
several hours to be processed. This causes our backup schedule to fall way
behind. When this happens, NSR becomes very unresponsive. For example,
opening up nsrwatch can take 10 or more minutes as does quitting out of it
by pressing the "q" key. Sometimes we'll get a slowdown condition that
lasts as little as half an hour, or maybe one or two hours; other times it
will last all night. When this happens, the /nsr/logs/daeman.log file
shows instances of nsrmd restart failures.

Not realizing this correlation, I have been juggling our backup schedule
almost every day, but all I believe I was actually doing was rescheduling
the slow down. I have just taken to dropping the savegroup parallelism
setting on the five or six savegroups that have the greatest number of
clients. I set them to equal to the number of target sessions on our tape
drives, which is 12. Fortunately, I have fairly wide lattitude in how our
backups are scheduled. This afternoon, I also took one of our larger (in
terms of number of clients) savegroups and broke it up into three separate
savegroups and scheduled them to run a few hours a part with their weekly
full backup scheduled on different days of the week. This is all at
Legato's recommendation so maybe these steps will help.

At any rate, when our NSR server slows down, operating system commands
such as "ls", the login process, etc. are also executed quickly. System
load rarely gets above 7 and even when it sinks way down, the problem does
not always go away.

We recently migrated our Qualstar tape library and backup server from a
system that runs Tru64 Unix and NSR 6.1.1 to this new Sun E450 server.
Since the migration, NSR has been nothing but problems. We will likely
retain outside assistance to help us deal with this issue and verify that
we have our Legato server set up with best practices in mind.  A colleague
and I have been pouring over documentation from Legato, Sun, and Qualstar
in an attempt to better understand this problem.

Meanwhile, I am wondering if anyone else has encountered this type of
problem with 6.1.3 and if you're seeing any nsrmmd restart cancelations
and/or RPC time out errors in your server's /nsr/logs/daemon.log file.


--
Note: To sign off this list, send a "signoff networker" command via email
to listserv AT listmail.temple DOT edu or visit the list's Web site at
http://listmail.temple.edu/archives/networker.html where you can
also view and post messages to the list.
=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=