Networker

Re: [Networker] nsrjobd fubar in 7.3.4 and 7.4 SP2?

2008-07-06 11:25:19
Subject: Re: [Networker] nsrjobd fubar in 7.3.4 and 7.4 SP2?
From: Yaron Zabary <yaron AT ARISTO.TAU.AC DOT IL>
To: NETWORKER AT LISTSERV.TEMPLE DOT EDU
Date: Sun, 6 Jul 2008 18:16:02 +0300
I just started to see hanged sessions in the group's 'show details' window after the upgrade to 7.3.4. It seems that I was getting these messages in daemon.log:

nsrjobd: jobsdb size at 22938674 exceeded high size watermark.

  And then, many of these:

nsrjobd: Jobs error: Unable to find record for job 236744 during an attempt to send message to it

After discussing this with my support engineer, he found esg93651 (to which I have no access via powerlink) which suggests raising the jobsdb size till there are no watermark messages (the default is 20Mb). I suspect (it is hinted by the esg) that the process that trims the jobsdb does a poor job and therefore it is better to let records expire by time. Come to think of it, it seems like a poor design decision to have this limit to begin with, as most people can allocate even a 1Gb for the jobsdb and avoid the hanged save problem.

Oscar Olsson wrote:
On 2008-05-27 12:21, Peter Viertel revealed:

PV> I messed it up back when I had 733 by fiddling with the setting for PV> maximum jobsdb size. I'd added a zero to the end thinking that PV> allowing it to be bigger would mean less issues with its GC routines PV> but Emc told us to put it back to the default and it seemed to work PV> since then.

The settings we have changed, per EMC support recommendation is to lower the data retention in the jobsdb to three days, and increase the size to 100MB. This has had no effect, at least not a positive one. :)

Another thing we have changed, also per their recommendation is to increase the number of TCP connections that can be opened or be half-open per second. I also belive that has no effect, especially considering that we still see the same problems. :P

PV> Have you tried moving the whole jobsdb directory aside and restarting 
networker?

Several times, it works OK for a day or two, but then the messages start appearing in the logs indicating that stuff can't talk to it, some savesets get aborted due to inactivity, nsrjobd takes lots of CPU and memory etc etc, until nothing works. That process takes about a week tops.

PV> I share your pain with emc support.

Yes. Nothing has really changed during the last years when it comes to their ability to identify and solve software bugs. Although, I am getting the feeling that the industry as a whole is closing in to the EMC networker level of support (sadly enough).

//Oscar

To sign off this list, send email to listserv AT listserv.temple DOT edu and type 
"signoff networker" in the body of the email. Please write to networker-request 
AT listserv.temple DOT edu if you have any problems with this list. You can access the 
archives at http://listserv.temple.edu/archives/networker.html or
via RSS at http://listserv.temple.edu/cgi-bin/wa?RSS&L=NETWORKER


--

-- Yaron.

To sign off this list, send email to listserv AT listserv.temple DOT edu and type 
"signoff networker" in the body of the email. Please write to networker-request 
AT listserv.temple DOT edu if you have any problems with this list. You can access the 
archives at http://listserv.temple.edu/archives/networker.html or
via RSS at http://listserv.temple.edu/cgi-bin/wa?RSS&L=NETWORKER

<Prev in Thread] Current Thread [Next in Thread>
  • Re: [Networker] nsrjobd fubar in 7.3.4 and 7.4 SP2?, Yaron Zabary <=