Ah, this was my hell as I lived this for
several months with weekly and (sometimes more frequently) restarts, but we
have not yet upgraded, as we found ways to optimize our backups around the
scheduler. FWIW, we are on Solaris 9 64-bit (8 GB RAM) and at 5.1 MP5…we
also were instructed to upgrade to 5.1MP6 and the issue did not resolve itself.
This is an inherent design flaw of the scheduler in 5.1, which is fixed in 6.0,
however, there are things you can do to alleviate/lessen/eliminate the
frequency of the restarts:
-
Increase
the shared memory (in your case, you are at the maximum HP allows)
-
Combine
as many streams as possible (do you back up a lot of CIFS/NFS streams that
could be consolidated into NDMP) – this greatly helped us out as it
decreased the number of streams/jobs being started, so there were fewer to keep
track of from the scheduler’s perspective
-
Make
sure all buffer settings are at the maximum-supported-by-Symantec (I had 3-4
calls with support before I drug this information out of them) since we have
Solaris 9 64-bit with 8 GB of RAM, here are the settings we were advised by
Symantec for the number of expected concurrent job starts (on Solaris, monitor
the ipcs –a output, not sure of the HP equiv):
400 JOBS:
set msgsys:msginfo_msgmnb=131072
set shmsys:shminfo_shmmax=33554432
set msgsys:msginfo_msgmni=512
set msgsys:msginfo_msgtql=1000
600 JOBS:
set msgsys:msginfo_msgmnb=262144
set shmsys:shminfo_shmmax=67108864
set msgsys:msginfo_msgmni=768
set msgsys:msginfo_msgtql=1500
800 JOBS:
set msgsys:msginfo_msgmnb=262144
set shmsys:shminfo_shmmax=67108864
set msgsys:msginfo_msgmni=1024
set msgsys:msginfo_msgtql=2000
set semsys:seminfo_semmni=2056
set semsys:seminfo_semmns=2056
set semsys:seminfo_semmnu=2056
set semsys:seminfo_semmsl=600
set msgsys:msginfo_msgmni=1024
set msgsys:msginfo_msgtql=2000
1600 JOBS (CURRENT SETTINGS):
* Message queues
set msgsys:msginfo_msgmax=8192
set msgsys:msginfo_msgmnb=524288
set msgsys:msginfo_msgmni=2048
set msgsys:msginfo_msgtql=2000
* Semaphores
set semsys:seminfo_semmni=4096
set semsys:seminfo_semmns=4096
set semsys:seminfo_semmnu=4096
set semsys:seminfo_semmsl=600
set semsys:seminfo_semopm=64
set semsys:seminfo_semume=128
* Shared memory
set shmsys:shminfo_shmmax=4294967296
set shmsys:shminfo_shmmni=230
Our issue was compounded by a setting in
the bp.conf file with CLIENT_CONNECT_TIMEOUT value of 3600….when you have
clients defined in a backup policy that “suddenly” are
down/retired/should have been removed from backup policies, this setting will
greatly impact the scheduler’s ability to process jobs…its hell
when you have a few dozen wreaking havoc.
Anyway, with the above adjustments, we
have to restart maybe once per month, but do so anyway for other reasons.
We are still hoping to upgrade to 6.5 soon.
Regards,
Doug
From: veritas-bu-bounces AT mailman.eng.auburn DOT edu
[mailto:veritas-bu-bounces AT mailman.eng.auburn DOT edu] On Behalf Of rascal
Sent: Thursday, April 03, 2008
9:13 AM
To:
mikemclain AT northwesternmutual DOT com
Cc: randy.k.zimmer AT monsanto DOT com; veritas-bu AT mailman.eng.auburn DOT edu
Subject: Re: [Veritas-bu] Status
Code 150's
We had a similar issue
with NBU 5.1 MP5. We rolled to 6 in an attempt to fix the problem and it
only made it worse. We ended up rolling back to 5 and setting up a call
with Symantec. At the end of the day, it was a memory leak which they
issued a fix for. I would suggest trending the memory, recording the
results and getting a call opened with Symantec. We have not experienced
this issue since we got the fix for Symantec as an fyi!
On 4/3/08, mikemclain AT northwesternmutual DOT com
<mikemclain AT northwesternmutual DOT com>
wrote:
Randy,
When running on NBU 5.1 MP6, we would
experience this error about every 10 days and we had 16GB memory on the master,
but on HP-UX 11.11 you are limited to 1.75GB of shared memory on 32-bit
apps. This technote describes the issue/memory
leak (http://seer.entsupport.symantec.com/docs/294251.htm), but
our only recourse was to recycle NBU once a week.
We upgraded to NBU 6.0 last fall and this
issue doesn't occur due to the elimination of bpsched.
Mike
From: veritas-bu-bounces AT mailman.eng.auburn DOT edu
[mailto:veritas-bu-bounces AT mailman.eng.auburn DOT edu] On Behalf Of ZIMMER, RANDY K [AG/1000]
Sent: Thursday, April 03, 2008
10:10 AM
To: veritas-bu AT mailman.eng.auburn DOT edu
Subject: [Veritas-bu] Status Code
150's
All,
I
have a Master Server which is a RP2470 with 1024KB of memory and we process
about 1500 backups per day through it. In the past two weeks I have
experienced Code 150's, but the backups were not cancelled by an administrator,
but by the system. Here is the error we receive when it occurs:
3688660: 05:05:29.648 [10196] <16> start_backup_job: fork error: Not enough space (12)
3688661: 05:05:29.648 [10196] <16>
run_any_ret_level: failure starting backup job, PID=-1
When
this happens nothing else will schedule unti we either restart all the NB
process or reboot the server, and I have done both. We logged a call on
this and there is no fix for this as of yet but there is one planned in 5.1MP7
which is due out sometime this month. The recommendation was to increase
the memory on the server (I realize 1GB is extremely low), and we should be
receiving it shortly. I have load balanced the schedule as much as I can.
Has anyone else experienced this issue and if so do you have any information
that would be helpful? The first two times this happened I rebooted the
server and the subsequent outages all I did was recycle the application.
I'm looking for any and all opinions on this topic.
Thanks,
Randy
K. Zimmer
Sr. Unix System Administrator
Office: 314-694-3109
Cell: 314-960-0500
rkzimm AT monsanto DOT com
This e-mail message may contain privileged and/or confidential
information, and is intended to be received only by persons entitled to receive
such information. If you have received this e-mail in error, please notify the
sender immediately. Please delete it and all attachments from any servers, hard
drives or any other media. Other use of this e-mail by you is strictly
prohibited.
All e-mails and attachments sent and received are subject to monitoring,
reading and archival by Monsanto, including its subsidiaries. The recipient of
this e-mail is solely responsible for checking for the presence of
"Viruses" or other "Malware". Monsanto, along with its
subsidiaries, accepts no liability for any damage caused by any such code
transmitted by or accompanying this e-mail or any attachment.
This e-mail and any attachments may contain
confidential information of Northwestern Mutual. If you are not the intended
recipient of this message, be aware that any disclosure, copying, distribution
or use of this e-mail and any attachments is prohibited. If you have received
this e-mail in error, please notify Northwestern Mutual immediately by
returning it to the sender and delete all copies from your system. Please be
advised that communications received via the Northwestern Mutual
Secure Message
Center are secure.
Communications that are not received via the Northwestern Mutual
Secure Message
Center may not be secure
and could be observed by a third party. Thank you for your cooperation.
_______________________________________________
Veritas-bu maillist - Veritas-bu AT mailman.eng.auburn DOT edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
--
Matthew MCP, MCSA, MCTS, OCA
rascal1981 AT gmail DOT com
Define Trouble:
Why did you apply THAT patch??....