Re: [Networker] Networker storage node problem

Someone might be interested in this one.  Note I'm running 6.x on the
storage node.

 

This problem happened again last night, but I was able to get on the
storage node to determine what's going on.  Why it's happening is
another thing.  The nsrjb process seemed to be going berserk, it was
taking up 99% of the CPU.  

 

  1:29am  up 1 day,  2:46,  3 users,  load average: 1.03, 0.96, 0.91

72 processes: 69 sleeping, 3 running, 0 zombie, 0 stopped

CPU states: 13.0% user, 87.0% system,  0.0% nice,  0.0% idle

Mem:   515116K av,  268004K used,  247112K free,       0K shrd,       0K
buff

Swap: 1048536K av,       0K used, 1048536K free                  119464K
cached

 

  PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME COMMAND

16593 root      17   0  1548 1548  1196 R    99.5  0.3 126:38 nsrjb

 

# /usr/sbin/lsof -p 16593

COMMAND   PID USER   FD   TYPE DEVICE    SIZE     NODE NAME

nsrjb   16593 root    9u  IPv4 902915              UDP *:24336

 

And then the telling point was the strace of the process.  I saw this
one line looping.  I did the trace to a file for 10 seconds and it came
up with a 10 mb file with this in it:

 

select(10, [9], NULL, NULL, {0, 0})     = 0 (Timeout)

sendto(9, "B\16\363\361\0\0\0\0\0\0\0\2\0\1\206\240\0\0\0\2\0\0\0"...,
56, 0, {sin_family=AF_INET, sin_port=htons(7938),
sin_addr=inet_addr("10.2.x.x")}

}, 16) = 56

 

So this happens, UDP is flooding the network and taking us out.  Now,
the question is why is this happening?  Before last night's outage,
networker was backing up and filled a tape.  10 minutes later it
attempted to write to that same tape, which is when I believe this
infinite loop started.  Any ideas?

 

07/29/04 22:00:40 nsrd: media info: suggest mounting NY0013L1 on grendel
for writing  to pool 'Exchange'

07/29/04 22:02:25 nsrd: media info: loading volume NY0013L1 into
rd=grendel:/dev/nst0

07/29/04 22:04:18 nsrd: nycx:MSEXCH:IS saving to pool 'Exchange'
(NY0013L1)

07/29/04 23:20:04 nsrd: media notice: LTO Ultrium-2 tape NY0013L1 on
rd=grendel:/dev/nst0 is full

07/29/04 23:20:04 nsrd: media notice: LTO Ultrium-2 tape NY0013L1 used
286 GB of 200 GB capacity

07/29/04 23:21:30 nsrd: media info: verification of volume "NY0013L1",
volid 1543961908 succeeded.

07/29/04 23:21:47 nsrd: write completion notice: Writing to volume
NY0013L1 complete

07/29/04 23:32:43 nsrd: nycx:MSEXCH:IS saving to pool 'Exchange'
(NY0013L1) 68 GB

07/29/04 23:32:43 nsrd: media notice: check storage node: grendel (nsrmo

n timed out)

07/29/04 23:32:43 nsrd: media notice: check storage node: grendel (nsrmm

d missing from polling reply)

07/29/04 23:32:43 nsrd: media info: restarting nsrmmd #15 on grendel in

2 minute(s)

 

Adam 

 

________________________________

From: Adam Ardis 
Sent: Tuesday, July 27, 2004 9:14 AM
To: Legato NetWorker discussion
Subject: Networker storage node problem

 

We've been having a problem since this weekend, as soon as I kick off a
backup from my server to a remote storage node, the network gets
hammered and unavailable.  It looks like millions of packets are sent
from the storage node server to the master server right before it goes
down.  The only thing I see in my daemon.log is wanting to mount a tape
at the storage node, and then I get the nsrmmd timeouts because the
network is down.  Stopping networker on the storage node(grendel) brings
the network back up.  When I restarted networker on the storage node,
the network immediately went back down, even though the save group was
no longer running.  It went to mount a tape in the pool at grendel, and
that was it.  

 

One question I have is how does Networker send traffic across from
storage node to the master, and why would it be flooding the network if
I'm not doing the data backup across it?  At first I thought it was the
Index, but it hasn't gotten that far the last couple of times it freaked
out.  I've checked to see if the pool is set right, all tapes labeled
belong to the storage node.  This process has worked for many months up
until now, nothing changed on the legato side but the router config was
changed.  When it failed, the change was reverted and it isn't working
now.

 

07/26/04 23:30:27 nsrd: media info: suggest mounting NY0001L1 on grendel

 for writing  to pool 'NT Incremental NYC'

07/26/04 23:30:27 nsrd: media waiting event: Waiting for 1 writable
volumes to b

ackup pool 'NT Incremental NYC' tape(s) or disk(s) on grendel

07/26/04 23:30:28 nsrd: media info: suggest relabeling NY0003L1 on
grendel

 for writing  to pool 'NT Incremental NYC'

07/26/04 23:30:28 nsrd: media event cleared: Waiting for 1 writable
volumes to b

ackup pool 'NT Incremental NYC' tape(s) or disk(s) on grendel

07/26/04 23:30:28 nsrd: media waiting event: Waiting for 2 writable
volumes to b

ackup pool 'NT Incremental NYC' tape(s) or disk(s) on grendel

07/26/04 23:40:53 nsrd: media notice: check storage node: grendel (nsrmo

n timed out)

07/26/04 23:40:53 nsrd: media notice: check storage node: grendel (nsrmo

n timed out)

07/26/04 23:40:53 nsrd: media notice: check storage node: grendel (nsrmm

d missing from polling reply)

 

Any advice would be appreciated.

 

Thanks,

Adam 

 


--
Note: To sign off this list, send a "signoff networker" command via email
to listserv AT listmail.temple DOT edu or visit the list's Web site at
http://listmail.temple.edu/archives/networker.html where you can
also view and post messages to the list.
=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=