Someone might be interested in this one. Note I'm running 6.x on the
storage node.
This problem happened again last night, but I was able to get on the
storage node to determine what's going on. Why it's happening is
another thing. The nsrjb process seemed to be going berserk, it was
taking up 99% of the CPU.
1:29am up 1 day, 2:46, 3 users, load average: 1.03, 0.96, 0.91
72 processes: 69 sleeping, 3 running, 0 zombie, 0 stopped
CPU states: 13.0% user, 87.0% system, 0.0% nice, 0.0% idle
Mem: 515116K av, 268004K used, 247112K free, 0K shrd, 0K
buff
Swap: 1048536K av, 0K used, 1048536K free 119464K
cached
PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME COMMAND
16593 root 17 0 1548 1548 1196 R 99.5 0.3 126:38 nsrjb
# /usr/sbin/lsof -p 16593
COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME
nsrjb 16593 root 9u IPv4 902915 UDP *:24336
And then the telling point was the strace of the process. I saw this
one line looping. I did the trace to a file for 10 seconds and it came
up with a 10 mb file with this in it:
select(10, [9], NULL, NULL, {0, 0}) = 0 (Timeout)
sendto(9, "B\16\363\361\0\0\0\0\0\0\0\2\0\1\206\240\0\0\0\2\0\0\0"...,
56, 0, {sin_family=AF_INET, sin_port=htons(7938),
sin_addr=inet_addr("10.2.x.x")}
}, 16) = 56
So this happens, UDP is flooding the network and taking us out. Now,
the question is why is this happening? Before last night's outage,
networker was backing up and filled a tape. 10 minutes later it
attempted to write to that same tape, which is when I believe this
infinite loop started. Any ideas?
07/29/04 22:00:40 nsrd: media info: suggest mounting NY0013L1 on grendel
for writing to pool 'Exchange'
07/29/04 22:02:25 nsrd: media info: loading volume NY0013L1 into
rd=grendel:/dev/nst0
07/29/04 22:04:18 nsrd: nycx:MSEXCH:IS saving to pool 'Exchange'
(NY0013L1)
07/29/04 23:20:04 nsrd: media notice: LTO Ultrium-2 tape NY0013L1 on
rd=grendel:/dev/nst0 is full
07/29/04 23:20:04 nsrd: media notice: LTO Ultrium-2 tape NY0013L1 used
286 GB of 200 GB capacity
07/29/04 23:21:30 nsrd: media info: verification of volume "NY0013L1",
volid 1543961908 succeeded.
07/29/04 23:21:47 nsrd: write completion notice: Writing to volume
NY0013L1 complete
07/29/04 23:32:43 nsrd: nycx:MSEXCH:IS saving to pool 'Exchange'
(NY0013L1) 68 GB
07/29/04 23:32:43 nsrd: media notice: check storage node: grendel (nsrmo
n timed out)
07/29/04 23:32:43 nsrd: media notice: check storage node: grendel (nsrmm
d missing from polling reply)
07/29/04 23:32:43 nsrd: media info: restarting nsrmmd #15 on grendel in
2 minute(s)
Adam
________________________________
From: Adam Ardis
Sent: Tuesday, July 27, 2004 9:14 AM
To: Legato NetWorker discussion
Subject: Networker storage node problem
We've been having a problem since this weekend, as soon as I kick off a
backup from my server to a remote storage node, the network gets
hammered and unavailable. It looks like millions of packets are sent
from the storage node server to the master server right before it goes
down. The only thing I see in my daemon.log is wanting to mount a tape
at the storage node, and then I get the nsrmmd timeouts because the
network is down. Stopping networker on the storage node(grendel) brings
the network back up. When I restarted networker on the storage node,
the network immediately went back down, even though the save group was
no longer running. It went to mount a tape in the pool at grendel, and
that was it.
One question I have is how does Networker send traffic across from
storage node to the master, and why would it be flooding the network if
I'm not doing the data backup across it? At first I thought it was the
Index, but it hasn't gotten that far the last couple of times it freaked
out. I've checked to see if the pool is set right, all tapes labeled
belong to the storage node. This process has worked for many months up
until now, nothing changed on the legato side but the router config was
changed. When it failed, the change was reverted and it isn't working
now.
07/26/04 23:30:27 nsrd: media info: suggest mounting NY0001L1 on grendel
for writing to pool 'NT Incremental NYC'
07/26/04 23:30:27 nsrd: media waiting event: Waiting for 1 writable
volumes to b
ackup pool 'NT Incremental NYC' tape(s) or disk(s) on grendel
07/26/04 23:30:28 nsrd: media info: suggest relabeling NY0003L1 on
grendel
for writing to pool 'NT Incremental NYC'
07/26/04 23:30:28 nsrd: media event cleared: Waiting for 1 writable
volumes to b
ackup pool 'NT Incremental NYC' tape(s) or disk(s) on grendel
07/26/04 23:30:28 nsrd: media waiting event: Waiting for 2 writable
volumes to b
ackup pool 'NT Incremental NYC' tape(s) or disk(s) on grendel
07/26/04 23:40:53 nsrd: media notice: check storage node: grendel (nsrmo
n timed out)
07/26/04 23:40:53 nsrd: media notice: check storage node: grendel (nsrmo
n timed out)
07/26/04 23:40:53 nsrd: media notice: check storage node: grendel (nsrmm
d missing from polling reply)
Any advice would be appreciated.
Thanks,
Adam
--
Note: To sign off this list, send a "signoff networker" command via email
to listserv AT listmail.temple DOT edu or visit the list's Web site at
http://listmail.temple.edu/archives/networker.html where you can
also view and post messages to the list.
=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=
|