[Veritas-bu] Problems since upgrading master server hardware (NB 4.5, MP3, Solaris8)
2003-01-21 21:55:16
Subject: |
[Veritas-bu] Problems since upgrading master server hardware (NB 4.5, MP3, Solaris8) |
From: |
Len.Boyle AT sas DOT com (Len Boyle) |
Date: |
Tue, 21 Jan 2003 21:55:16 -0500 |
Hello Suzi
You might want to issue the comman netstat -a and look for a large number
of sockets in a close or timed wait status.
If so you might you might be seeing the problem listed in the following
technote. It is a problem that we ran into after an hardware upgrade.
len boyle
=============================================================================================
Symptom:
When performing backups, or restores, socket errors are being produced.
Exact Error Message:
Status codes 13, 14, 23, 24, 25 may occur. 13 = file read failed, 14 = file
write failed, 23 = socket read failed, 24 = socket write failed, 25 = cannot
connect on socket
Solution:
Possibly the NetBackup server is getting a lot of traffic from clients;
therefore, sockets are not becoming free soon enough to satisfy the demand.
For Solaris 2.6 or previous use the following command:
ndd -get /dev/tcp tcp_close_wait_interval
For Solaris 7 or above use the following command:
ndd -get /dev/tcp tcp_time_wait_interval
For HP-UX 11 use the following command:
ndd -get /dev/tcp tcp_time_wait_interval
(NOTE: The equivalent command on HP-UX 10 is "nettune" instead of "ndd".)
These commands will produce a large number, like the default 240000 (value is
in milliseconds, so 240 seconds or 4 minutes). This is the amount of time to
wait after a socket is closed before it can be reused. In most cases this can
be shortened to about 1 second (1000) and may alleviate the problem.
The command to set it to 1000 on Solaris 2.6 and previous versions is:
ndd -set /dev/tcp tcp_close_wait_interval 1000
The command to set it to 1000 on Solaris 7 and later versions is:
ndd -set /dev/tcp tcp_time_wait_interval 1000
or
The command to set it to 1000 on HP-UX 11 is:
ndd -set /dev/tcp tcp_time_wait_interval 1000
The "ndd" command makes the change immediately, without a need for a reboot.
This setting will go back to default after a reboot. To make it set to this
value after each reboot, the command can be added to the appropriate TCP/IP
startup script. On Solaris, this is /etc/rc2.d/S69inet or on HP-UX 11 see
/etc/rc.config.d/nddconf for examples of how to set it.
Acknowledgments:
Numerous prior cases
_____
-----Original Message-----
From: Suzi Archer [mailto:sarcher AT connect.com DOT au]
Sent: Tue 1/21/2003 8:44 PM
To: veritas-bu AT mailman.eng.auburn DOT edu
Cc:
Subject: [Veritas-bu] Problems since upgrading master server hardware
(NB 4.5, MP3, Solaris8)
Hi,
We upgraded our hardware on our master server about a week ago, and
we've had no
end of grief ever since. (note that we upgraded from 3.4 to 4.5 at
essentially the same time)
The errors we are getting are mostly:
Subject: Backup on fred - 134 started
Status = unable to process request because the server resources are
busy.
but closely followed by:
Subject: Backup on fred - 24 started
Status = socket write failed.
Subject: Backup on fred - 40 started
Status = network connection broken.
Subject: Backup on fred - 49 started
Status = client did not start.
Subject: Backup on fred - 13 started
Status = file read failed.
Subject: Backup on fred - 228 started
Status = unable to process request.
And we are talking in the hundreds of messages/trys per day
(3500 odd since last wednesday)
And appears to be affecting all clients (sol2.6 and Sol6)
Some backups end up completing after many tries, some simply fall out
of their
window and fail.
Our hardware chaged from:
Sun Microsystems 280R, 2 x UltraSparcIII 750MHz, 2GB ram, GB eth
To:
Sun Microsystems 420R, 4 X UltraSPARC-II 450MHz, 4GB ram, GB eth
IBM 6 Drive anaconda robot (LTO) attached by fibre to the same machine
Could there possibly be some setting that are uber important that i
missed ?
The things I can think of are (and match the previous server):
bob[/]# head -15 /etc/system
*ident "@(#)system 1.18 97/06/27 SMI" /* SVR4 1.5 */
*
* SYSTEM SPECIFICATION FILE
*
set shmsys:shminfo_shmmax=4294967295
set shmsys:shminfo_shmmin=1
set shmsys:shminfo_shmmni=512
set shmsys:shminfo_shmseg=100
set semsys:seminfo_semmns=500
set semsys:seminfo_semmni=150
set semsys:seminfo_semmap=350
set semsys:seminfo_semmnu=700
set semsys:seminfo_semume=100
set semsys:seminfo_semmsl=100
bob[/]# cat /usr/openv/netbackup/db/config/NUMBER_DATA_BUFFERS
16
bob[/]# cat /usr/openv/netbackup/db/config/SIZE_DATA_BUFFERS
131072
load on the server is not particularly high, plenty of memory and swap
free,
ethernet interface not saturated by far
Is there a list of what resources available that NB coulld possibly be
referring to when it claims
"server resources are busy"
As usual, veritas knowledgebase was useless :(
Anyone spotting anything dead obvious here, or know where else i should
look ?
TIA
-suz
--
Suzi Archer @connect.com.au /"\
Systems Group 180-188 Burnley Street \ / ASCII Ribbon
Campaign
sarcher AT connect.com DOT au Richmond X Against
HTML Mail
(03) 8686 2321 VIC, 3121 / \
** Note Address and Phone number changed on Nov 1 **
_______________________________________________
Veritas-bu maillist - Veritas-bu AT mailman.eng.auburn DOT edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
|
|
|