Networker

Re: [Networker] Too many open files

2004-07-09 15:34:09
Subject: Re: [Networker] Too many open files
From: Allan Nelson <an AT CEH.AC DOT UK>
To: NETWORKER AT LISTMAIL.TEMPLE DOT EDU
Date: Fri, 9 Jul 2004 20:33:37 +0100
> Hi,
> After upgrading Legato Networker 6.1.3 towards 7.1.2 we got lots of
> "service at 0.0.0.0/5833error from accept call: Too many open files"
> messages in the daemon.log, during heavy load caused by obviously
> too many client connection requests.
> The Legato Networker Server was busy near death, nsrd occupied
> one CPU by 100% but the Server was more or less idle.
> We had semilar situations several times in the past but we got no
clue why.
> Just since we upgraded we got this detailed messages in the
daemon.log,
> so now we know where to start to investigate.


Some (or all ;-) of this may help.
Good luck!
Allan.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Solution Title: Aborting client connection 'Too many open files' on
Solaris backup server
Solution ID: legato66194

Here is the solution:
The above noted error message 'Too many open files' could occur on
busy backup servers due to over taxed system resources. There are
several TCP/IP (and other) kernel tunable parameters that can help
resolve these issues. Although the examples listed below references the
Solaris operating system, other UNIX flavors share similar variables.
Please reference your relevant Operating System manuals for the
appropriate syntax on modifying these variables were applicable



File descriptors

Increasing the value for file descriptors to 2048 (1024 if on running
Solaris 2.5)

Some File descriptor uses in NetWorker
- every nsrexec on the server links back to nsrd
- one nsrexec process exists per client saveset (or one
for saveset "All")
- [server parallelism] or [group parallelism, if lower]
nsrexecs are launched per savegrp.
- even when server parallelism is maxed out, saves will
continue to start on the clients, one per fs up to
[client parallelism]. All of these link back to nsrd
- every nsrindexd links back to nsrd. If things are
running well, that's [server parallelism + 1]
- all nsrmmds link back to nsrd, including helper mmds
- nsrjb takes up 3 FDs per time


Number of file descriptors can be changed from command line with
# ulimit -n 2048
This sets the number of file descriptors to 2048. This is considered
the "soft limit", and cannot be greater than the "hard limit". The hard
limit is set in the kernel. For Solaris 8, default value is 1024.

Note: It is possible to "set" the soft limit (using ulimit -n) greater
than the hard limit. In other words, soft limit = 2048, but hard limit
is 1024. However, the maximum number of file descriptors will be the
hard limit.

To find the hard file descriptor limit:
1) Check /etc/system file. If a value other than the default is used,
it must
be listed in /etc/system for the value to be set after a reboot.

2) Run:
# adb -kw /dev/ksyms
physmem 5da31
Type:
rlim_fd_max/D
Result:
rlim_fd_max: rlim_fd_max: 1024

man pages adb

To override this value, tune system parameter by
1) add to /etc/system file:
set rlim_fd_max=2048
2) Shutdown and issue boot -r at the OpenBoot prompt.
NOTE: If you are not familiar with changing kernel parameters, Verify
the
process with your Sun documentation.


---------------------------------------------------------------------------------
tcp_time_wait_interval

Setting tcp_time_wait_interval with 'ndd -get /dev/tcp
tcp_time_wait_interval' to 60000 (1 minute), from the initial default
value of 240000 (4 minutes). For Solaris 2.6 check
tcp_close_wait_interval.


---------------------------------------------------------------------------------
tcp_keepalive_interval

Increase the tcp_keepalive_interval to 4 hours. you can view the
current value by running the following
'ndd -get /dev/tcp tcp_keepalive_interval'. Note that the output is in
milliseconds.

You can set the tcp_keepalive setting through the following

ndd -set /dev/tcp tcp_keepalive_interval (value)

---------------------------------------------------------------------------------
tcp_conn_hash

Check the value of the TCP parameter tcp_conn_hash with "ndd -get
/dev/tcp tcp_conn_hash". It will give you something like

tcp_conn_hash_size = 512
TCPB dest snxt suna swnd rnxt rack rwnd rto mss w sw rw t recent
[lport,fport] state
002 300013a8818 ::ffff:127.0.0.1 0188fb78 0188fb78 0000032768 018608cd
018608cd 0000032768 03375 08192 0 00 00 0 00000000 [32801, 32802]
TCP_ESTABLISHED
002 300013a8958 ::ffff:127.0.0.1 018608cd 018608cd 0000032768 0188fb78
0188fb78 0000032768 03375 08192 0 00 00 0 00000000 [32802, 32801]
TCP_ESTABLISHED
006 300013a95d8 ::ffff:127.0.0.1 017c730d 017c730d 0000032768 017abe06
017abe06 0000032768 03375 08192 0 00 00 0 00000000 [32795, 32796]
TCP_ESTABLISHED
006 300013a9718 ::ffff:127.0.0.1 017abe06 017abe06 0000032768 017c730d
017c730d 0000032768 03375 08192 0 00 00 0 00000000 [32796, 32795]
TCP_ESTABLISHED
010 300013a9218 ::ffff:127.0.0.1 0183fc8a 0183fc8a 0000032768 01835542
01835542 0000032768 03915 08192 0 00 00 0 00000000 [32788, 32799]
TCP_ESTABLISHED
010 300013a9998 ::ffff:127.0.0.1 01835542 01835542 0000032768 0183fc8a
0183fc8a 0000032768 04047 08192 0 00 00 0 00000000 [32799, 32788]
TCP_ESTABLISHED
012 300013a9c18 ::ffff:127.0.0.1 0177e520 0177e520 0000032768 0175a848
0175a848 0000032768 04702 08192 0 00 00 0 00000000 [32788, 32793]
TCP_ESTABLISHED
.
.
255 30000faf700 ::ffff:127.0.0.1 00e1360c 00e1360c 0000032768 00e402c6
00e402c6 0000032768 04265 08192 0 00 00 0 00000000 [11471, 7938]
TCP_TIME_WAIT-

The hash value in the first lines must be higher than the first number
mentioned in the last line. If it is identical, the hash table size is
too small and must be increased with "ndd -set /dev/tcp tpp_conn_hash
xxxx"


Here is the problem or goal:
Error: 'Aborting client connection from (w.x.y.z)/19996 to
w.x.y.z/9405, Too many open files'

Error: 'Aborting client connection from 127.0.0.1/20000 to
127.0.0.1/9405, Too many open files'

Error: 'nsrd: log event failed: Bad file number'

too many open files

Error: 'nsrmmd [#]: Aborting client connection from 127.0.0.1/[port] to
127.0.0.1/[port]

Identify certain relevant Solaris TCP/IP kernel tunnable parameters,
that impact NetWorker operations


Problem Environment:
NetWorker

Solaris

--
Note: To sign off this list, send a "signoff networker" command via email
to listserv AT listmail.temple DOT edu or visit the list's Web site at
http://listmail.temple.edu/archives/networker.html where you can
also view and post messages to the list.
=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=

<Prev in Thread] Current Thread [Next in Thread>