We're seeing something really weird from our crew doing offsite
diaster-recovery testing. First, a little background: They're running
7.5.1 on Solaris 10, using a tar-to-tape dump of our /nsr directory to
avoid having to recover indexes via mmrecov, nsrck, etc. (since we have
limited time for testing). For the past six years, this procedure has
worked fine.
Last year it was:
Production DR test
-------------------------------
Solaris 8 Solaris 10
Networker 7.2 Networker 7.2
Sun Fire V880 Sun e2900
This year it's:
Production DR test
-------------------------------
Solaris 9 Solaris 10
Networker 7.5.1 Networker 7.5.1
Sun Fire V880 Sun e2900
Networker starts OK, but shortly thereafter, it (apparently) starts
resetting the SCSI buses, etc., to the degree where other disk ops fail,
OS functions fail, etc. For example, this is from an email one of my
coworkers sent this afternoon:
----------------------------------------------------------------------------
----------------------------------------------------------------------------
We moved the boot disk from the backup_serv server to another 2900 and
were
able to boot successfully, but when the Networker 'secondary' jobs kicked
off we got the following errors just like on the original server. Any
ideas?
Sep 28 15:16:25 backup_serv last message repeated 1 time
Sep 28 15:16:25 backup_serv root: Sun StorageTek(TM) Enterprise
backup_serv index: (notice) Completed checking 112 client(s)
Sep 28 15:16:25 backup_serv last message repeated 1 time
# ps -ef |grep nsr
root 2632 1119 0 15:16:57 ? 0:00 /usr/sbin/nsr/nsrmmd -n
7
root 2636 1119 0 15:17:05 ? 0:00 /usr/sbin/nsr/nsrmmd -n
11
root 1265 1119 0 15:16:10 ? 0:00 /usr/sbin/nsr/nsrindexd
root 2630 1119 0 15:16:53 ? 0:00 /usr/sbin/nsr/nsrmmd -n
5
root 2631 1119 0 15:16:55 ? 0:00 /usr/sbin/nsr/nsrmmd -n
6
root 1260 1119 0 15:16:08 ? 0:20 /usr/sbin/nsr/nsrmmdbd
root 1103 1 0 15:10:17 ? 0:03 /usr/sbin/nsr/nsrexecd
root 2627 1119 0 15:16:47 ? 0:00 /usr/sbin/nsr/nsrmmd -n
2
root 1119 1 0 15:10:22 ? 0:05 /usr/sbin/nsr/nsrd
root 2634 1119 0 15:17:01 ? 0:00 /usr/sbin/nsr/nsrmmd -n
9
root 1266 1119 0 15:16:12 ? 0:01 /usr/sbin/nsr/nsrjobd
root 2633 1119 0 15:16:59 ? 0:00 /usr/sbin/nsr/nsrmmd -n
8
root 2624 1119 0 15:16:43 ? 0:00 /usr/sbin/nsr/nsrmmgd
root 2628 1119 0 15:16:49 ? 0:00 /usr/sbin/nsr/nsrmmd -n
3
root 2635 1119 0 15:17:03 ? 0:00 /usr/sbin/nsr/nsrmmd -n
10
root 2629 1119 0 15:16:51 ? 0:00 /usr/sbin/nsr/nsrmmd -n
4
root 2626 1119 0 15:16:45 ? 0:00 /usr/sbin/nsr/nsrmmd -n
1
root 2652 1119 0 15:17:39 ? 0:00 /usr/sbin/nsr/nsrmmd -n
12
root 2653 1119 0 15:17:41 ? 0:00 /usr/sbin/nsr/nsrmmd -n
13
root 2641 1103 0 15:17:08 ? 0:01 /usr/sbin/nsr/nsrlcpd
-s backup_serv -N 1 -n 2
root 2642 1103 0 15:17:08 ? 0:01 /usr/sbin/nsr/nsrlcpd
-s backup_serv -N 1 -n 3
root 2643 1103 0 15:17:08 ? 0:01 /usr/sbin/nsr/nsrlcpd
-s backup_serv -N 1 -n 1
root 2654 355 0 15:17:51 console 0:00 grep nsr
(here are the errors that ensued)
Sep 28 15:18:57 backup_serv Resetting scsi bus, <null string> from (0,0)
Sep 28 15:18:57 backup_serv scsi: WARNING: /ssm@0,0/pci@18,600000/scsi@2
(glm0):
Sep 28 15:18:57 backup_serv Target 0 reducing sync. transfer rate
Sep 28 15:18:57 backup_serv scsi: WARNING: /ssm@0,0/pci@18,600000/scsi@2
(glm0):
Sep 28 15:18:57 backup_serv got SCSI bus reset
Sep 28 15:19:00 backup_serv scsi: WARNING: /ssm@0,0/pci@18,600000/scsi@2
(glm0):
Sep 28 15:19:00 backup_serv Resetting scsi bus, <null string> from (0,0)
Sep 28 15:19:00 backup_serv scsi: WARNING: /ssm@0,0/pci@18,600000/scsi@2
(glm0):
Sep 28 15:19:00 backup_serv Target 0 disabled wide SCSI mode
Sep 28 15:19:00 backup_serv scsi: WARNING: /ssm@0,0/pci@18,600000/scsi@2
(glm0):
Sep 28 15:19:00 backup_serv Target 0 reverting to async. mode
Sep 28 15:19:00 backup_serv scsi: WARNING: /ssm@0,0/pci@18,600000/scsi@2
(glm0):
Sep 28 15:19:00 backup_serv got SCSI bus reset
Sep 28 15:19:03 backup_serv scsi: WARNING: /ssm@0,0/pci@18,600000/scsi@2
(glm0):
Sep 28 15:19:03 backup_serv Resetting scsi bus, got incorrect phase from
(0,0)
Sep 28 15:19:03 backup_serv scsi: WARNING: /ssm@0,0/pci@18,600000/scsi@2
(glm0):
Sep 28 15:19:03 backup_serv got SCSI bus reset
...etc., forever.
----------------------------------------------------------------------------
----------------------------------------------------------------------------
We have 7.5.1 running here at home on Solaris 9 and 10. We have 7.5.3
running on Solaris 10 and even a not-yet-upgraded small environment still
running 7.2 on Solaris 10 (all SPARC). We've never experienced anything
like this before. As was mentioned in the email quoted above, they moved
the OS boot disk to another Sun e2900 server and got the same errros there
(up until which time they were suspecting a bad backplane).
They're going to try putting Networker on a Solaris 9 e2900 they have, but
I'm also going to suggest as an alternative maybe removing all tape
devices from Networker (as well as from /dev/rmt), doing a "boot -r" and
let Networker add them fresh again. But I have no idea if that will help..
Anyone ever seen this?
Thanks!
To sign off this list, send email to listserv AT listserv.temple DOT edu and
type "signoff networker" in the body of the email. Please write to
networker-request AT listserv.temple DOT edu if you have any problems with this
list. You can access the archives at
http://listserv.temple.edu/archives/networker.html or
via RSS at http://listserv.temple.edu/cgi-bin/wa?RSS&L=NETWORKER
|