Re: [Networker] all sessions slow at some time during backup

On 26/11/2012, at 11:23 AM, jeronimo wrote:

> Ok, the transfer that is stalling is caused by the process using port 8094
> 
>  5.502271  130.1.1.118 -> 130.1.1.45   TCP [TCP ZeroWindow] 8094 > 59561 
> [ACK] Seq=1 Ack=56809 Win=0 Len=0 TSV=163928420 TSER=311968938
> 
> Which would be (accroding to lsof | grep TCP | grep 8094) PID 11561
> 
> # ps -ef | grep 11561
> root     11561 10625  6 Nov18 ?        11:21:26 /opt/nsr/nsrmmd -n 2
> 
> strace reveals FD 7
> 
> 01:11:02.593016 write(7, "\0\0\0\0XL0305           NETWORKER                  
>                                3\0\0\0\0\0\0\0\0\0\0\0\
> 
> which finally boils down to the tape drive
> 
> # lsof -p 11561 | grep 7u
> nsrmmd  11561 root    7u   CHR     9,128      0t0     6489 /dev/nst0
> 
> I also found this.
> http://www.mibus.org/2012/11/11/networker-random-stalling/
> Not sure though if just killing nsrmmd is a solution..


I used to call this "The Spiral of Doom".   Networker would gradually stop 
progressing jobs,
loading tapes and doing any useful functions.  Although on the face of it, 
everything appears
to be working correctly.   I think what happens is that networker somehow 
forgets about 
job control.  Everything that starts a backup job has to run through nsrd or 
nsrjobd since about 
v7.4.   I believe that as resources start to get difficult to allocate on a 
large system with many 
clients, that somehow nsrd or jobd become exhausted and goes into some kind of 
hung state.

I have found in the past, that killing nsrmmd will only help on those jobs that 
have been 
stuck a long time.  In those cases I have found it best to kill off the savegrp 
related to the 
process, then kill the nsrmmd process related to the savegrp.   It is 
"dangerous" to 
spuriously kill nsrmmd as it may actually be streaming the catalogue of data to 
be 
backed up and interrupting this connection from nsrmmd to the tape drive can 
lead to failed savesets or incomplete backups - it is actually "doing 
something" but it is not obvious.

I have usually mitigated Spiral of Doom situations by instead running nsrjb -HH 
or -IE
to force the systems to do something.   Sometimes it is caused by an 
oversubscribed 
tape drive which is not coping with receiving the commands quickly enough.  In 
this 
case, clearing the drive via the reset often helps it "wake up" again for a 
bit. 

I am not sure that Networker always handles some physical problems very well,
hanging or timing out on a slow device, or complaining when you send too 
many jobs though that tiny little aperture called nsrjobd, which, only really 
lets you send one or two commands through before it starts ignoring you.

Before the advent of nsrjobd, I used to "stack" multiple commands on the 
command line, knowing that they would eventually be honoured by the 
system.   These days, jobd only seems to accept commands in a "serial" 
fashion ie one at a time.   


rachel

--
Rachel Polanskis                Systems Admin, University of Western Sydney
ADD Werrington North Campus     (+61 2) 9678 7291  <r.polanskis AT uws.edu DOT 
au>
   "The perversity of the Universe tends towards a maximum." - Finagle's Law