On 26/11/2012, at 11:23 AM, jeronimo wrote:
> Ok, the transfer that is stalling is caused by the process using port 8094
>
> 5.502271 130.1.1.118 -> 130.1.1.45 TCP [TCP ZeroWindow] 8094 > 59561
> [ACK] Seq=1 Ack=56809 Win=0 Len=0 TSV=163928420 TSER=311968938
>
> Which would be (accroding to lsof | grep TCP | grep 8094) PID 11561
>
> # ps -ef | grep 11561
> root 11561 10625 6 Nov18 ? 11:21:26 /opt/nsr/nsrmmd -n 2
>
> strace reveals FD 7
>
> 01:11:02.593016 write(7, "\0\0\0\0XL0305 NETWORKER
> 3\0\0\0\0\0\0\0\0\0\0\0\
>
> which finally boils down to the tape drive
>
> # lsof -p 11561 | grep 7u
> nsrmmd 11561 root 7u CHR 9,128 0t0 6489 /dev/nst0
>
> I also found this.
> http://www.mibus.org/2012/11/11/networker-random-stalling/
> Not sure though if just killing nsrmmd is a solution..
I used to call this "The Spiral of Doom". Networker would gradually stop
progressing jobs,
loading tapes and doing any useful functions. Although on the face of it,
everything appears
to be working correctly. I think what happens is that networker somehow
forgets about
job control. Everything that starts a backup job has to run through nsrd or
nsrjobd since about
v7.4. I believe that as resources start to get difficult to allocate on a
large system with many
clients, that somehow nsrd or jobd become exhausted and goes into some kind of
hung state.
I have found in the past, that killing nsrmmd will only help on those jobs that
have been
stuck a long time. In those cases I have found it best to kill off the savegrp
related to the
process, then kill the nsrmmd process related to the savegrp. It is
"dangerous" to
spuriously kill nsrmmd as it may actually be streaming the catalogue of data to
be
backed up and interrupting this connection from nsrmmd to the tape drive can
lead to failed savesets or incomplete backups - it is actually "doing
something" but it is not obvious.
I have usually mitigated Spiral of Doom situations by instead running nsrjb -HH
or -IE
to force the systems to do something. Sometimes it is caused by an
oversubscribed
tape drive which is not coping with receiving the commands quickly enough. In
this
case, clearing the drive via the reset often helps it "wake up" again for a
bit.
I am not sure that Networker always handles some physical problems very well,
hanging or timing out on a slow device, or complaining when you send too
many jobs though that tiny little aperture called nsrjobd, which, only really
lets you send one or two commands through before it starts ignoring you.
Before the advent of nsrjobd, I used to "stack" multiple commands on the
command line, knowing that they would eventually be honoured by the
system. These days, jobd only seems to accept commands in a "serial"
fashion ie one at a time.
rachel
--
Rachel Polanskis Systems Admin, University of Western Sydney
ADD Werrington North Campus (+61 2) 9678 7291 <r.polanskis AT uws.edu DOT
au>
"The perversity of the Universe tends towards a maximum." - Finagle's Law
|