Networker

Re: [Networker] AW: [Networker] all sessions slow at some time during backup

2012-11-27 04:06:39
Subject: Re: [Networker] AW: [Networker] all sessions slow at some time during backup
From: Yaron Zabary <yaron AT ARISTO.TAU.AC DOT IL>
To: NETWORKER AT LISTSERV.TEMPLE DOT EDU
Date: Tue, 27 Nov 2012 11:06:24 +0200
While I don't have problems with you bashing nsrjobd, I cannot see how it can cause jobs that are already running to stall. Read the original problem description.

On 11/26/2012 06:44 AM, Rainer Rethmeier wrote:
I am still sure that we have a nsrjobd problem. Rachel is right, the nsrjobd
only handels 1 job at a time.

" Before the advent of nsrjobd, I used to "stack" multiple commands on the
command line, knowing that they would eventually be honoured by the
system.   These days, jobd only seems to accept commands in a "serial"
fashion ie one at a time. "

-----Ursprüngliche Nachricht-----
Von: EMC NetWorker discussion [mailto:NETWORKER AT LISTSERV.TEMPLE DOT EDU] Im
Auftrag von Rachel Polanskis
Gesendet: Montag, 26. November 2012 02:33
An: NETWORKER AT LISTSERV.TEMPLE DOT EDU
Betreff: Re: [Networker] all sessions slow at some time during backup

On 26/11/2012, at 11:23 AM, jeronimo wrote:

Ok, the transfer that is stalling is caused by the process using port
8094

  5.502271  130.1.1.118 -> 130.1.1.45   TCP [TCP ZeroWindow] 8094 > 59561
[ACK] Seq=1 Ack=56809 Win=0 Len=0 TSV=163928420 TSER=311968938

Which would be (accroding to lsof | grep TCP | grep 8094) PID 11561

# ps -ef | grep 11561
root     11561 10625  6 Nov18 ?        11:21:26 /opt/nsr/nsrmmd -n 2

strace reveals FD 7

01:11:02.593016 write(7, "\0\0\0\0XL0305           NETWORKER
3\0\0\0\0\0\0\0\0\0\0\0\

which finally boils down to the tape drive

# lsof -p 11561 | grep 7u
nsrmmd  11561 root    7u   CHR     9,128      0t0     6489 /dev/nst0

I also found this.
http://www.mibus.org/2012/11/11/networker-random-stalling/
Not sure though if just killing nsrmmd is a solution..


I used to call this "The Spiral of Doom".   Networker would gradually stop
progressing jobs,
loading tapes and doing any useful functions.  Although on the face of it,
everything appears
to be working correctly.   I think what happens is that networker somehow
forgets about
job control.  Everything that starts a backup job has to run through nsrd or
nsrjobd since about
v7.4.   I believe that as resources start to get difficult to allocate on a
large system with many
clients, that somehow nsrd or jobd become exhausted and goes into some kind
of hung state.

I have found in the past, that killing nsrmmd will only help on those jobs
that have been stuck a long time.  In those cases I have found it best to
kill off the savegrp related to the
process, then kill the nsrmmd process related to the savegrp.   It is
"dangerous" to
spuriously kill nsrmmd as it may actually be streaming the catalogue of data
to be backed up and interrupting this connection from nsrmmd to the tape
drive can lead to failed savesets or incomplete backups - it is actually
"doing something" but it is not obvious.

I have usually mitigated Spiral of Doom situations by instead running nsrjb
-HH or -IE
to force the systems to do something.   Sometimes it is caused by an
oversubscribed
tape drive which is not coping with receiving the commands quickly enough.
In this case, clearing the drive via the reset often helps it "wake up"
again for a bit.

I am not sure that Networker always handles some physical problems very
well, hanging or timing out on a slow device, or complaining when you send
too many jobs though that tiny little aperture called nsrjobd, which, only
really lets you send one or two commands through before it starts ignoring
you.

Before the advent of nsrjobd, I used to "stack" multiple commands on the
command line, knowing that they would eventually be honoured by the
system.   These days, jobd only seems to accept commands in a "serial"
fashion ie one at a time.


rachel

--
Rachel Polanskis                Systems Admin, University of Western Sydney
ADD Werrington North Campus     (+61 2) 9678 7291  <r.polanskis AT uws.edu DOT 
au>
    "The perversity of the Universe tends towards a maximum." - Finagle's Law



--

-- Yaron.