Networker

[Networker] Networker screws up SCSI bus ... maybe ?

2010-09-28 17:25:20
Subject: [Networker] Networker screws up SCSI bus ... maybe ?
From: Len Philpot <Len.Philpot AT CLECO DOT COM>
To: NETWORKER AT LISTSERV.TEMPLE DOT EDU
Date: Tue, 28 Sep 2010 16:24:33 -0500
We're seeing something really weird from our crew doing offsite 
diaster-recovery testing. First, a little background: They're running 
7.5.1 on Solaris 10, using a tar-to-tape dump of our /nsr directory to 
avoid having to recover indexes via mmrecov, nsrck, etc. (since we have 
limited time for testing). For the past six years, this procedure has 
worked fine.

Last year it was:

Production              DR test
-------------------------------
Solaris 8               Solaris 10
Networker 7.2   Networker 7.2
Sun Fire V880   Sun e2900

This year it's:

Production              DR test
-------------------------------
Solaris 9               Solaris 10
Networker 7.5.1 Networker 7.5.1
Sun Fire V880   Sun e2900

Networker starts OK, but shortly thereafter, it (apparently) starts 
resetting the SCSI buses, etc., to the degree where other disk ops fail, 
OS functions fail, etc. For example, this is from an email one of my 
coworkers sent this afternoon:

----------------------------------------------------------------------------
----------------------------------------------------------------------------
We moved the boot disk from the backup_serv server to another 2900 and 
were 
able to boot successfully, but when the Networker 'secondary' jobs kicked 
off we got the following errors just like on the original server. Any 
ideas?

Sep 28 15:16:25 backup_serv last message repeated 1 time
Sep 28 15:16:25 backup_serv root: Sun StorageTek(TM) Enterprise 
backup_serv index: (notice) Completed checking 112 client(s)
Sep 28 15:16:25 backup_serv last message repeated 1 time

# ps -ef |grep nsr
    root  2632  1119   0 15:16:57 ?           0:00 /usr/sbin/nsr/nsrmmd -n 
7
    root  2636  1119   0 15:17:05 ?           0:00 /usr/sbin/nsr/nsrmmd -n 
11
    root  1265  1119   0 15:16:10 ?           0:00 /usr/sbin/nsr/nsrindexd
    root  2630  1119   0 15:16:53 ?           0:00 /usr/sbin/nsr/nsrmmd -n 
5
    root  2631  1119   0 15:16:55 ?           0:00 /usr/sbin/nsr/nsrmmd -n 
6
    root  1260  1119   0 15:16:08 ?           0:20 /usr/sbin/nsr/nsrmmdbd
    root  1103     1   0 15:10:17 ?           0:03 /usr/sbin/nsr/nsrexecd
    root  2627  1119   0 15:16:47 ?           0:00 /usr/sbin/nsr/nsrmmd -n 
2
    root  1119     1   0 15:10:22 ?           0:05 /usr/sbin/nsr/nsrd
    root  2634  1119   0 15:17:01 ?           0:00 /usr/sbin/nsr/nsrmmd -n 
9
    root  1266  1119   0 15:16:12 ?           0:01 /usr/sbin/nsr/nsrjobd
    root  2633  1119   0 15:16:59 ?           0:00 /usr/sbin/nsr/nsrmmd -n 
8
    root  2624  1119   0 15:16:43 ?           0:00 /usr/sbin/nsr/nsrmmgd
    root  2628  1119   0 15:16:49 ?           0:00 /usr/sbin/nsr/nsrmmd -n 
3
    root  2635  1119   0 15:17:03 ?           0:00 /usr/sbin/nsr/nsrmmd -n 
10
    root  2629  1119   0 15:16:51 ?           0:00 /usr/sbin/nsr/nsrmmd -n 
4
    root  2626  1119   0 15:16:45 ?           0:00 /usr/sbin/nsr/nsrmmd -n 
1
    root  2652  1119   0 15:17:39 ?           0:00 /usr/sbin/nsr/nsrmmd -n 
12
    root  2653  1119   0 15:17:41 ?           0:00 /usr/sbin/nsr/nsrmmd -n 
13
    root  2641  1103   0 15:17:08 ?           0:01 /usr/sbin/nsr/nsrlcpd 
-s backup_serv -N 1 -n 2
    root  2642  1103   0 15:17:08 ?           0:01 /usr/sbin/nsr/nsrlcpd 
-s backup_serv -N 1 -n 3
    root  2643  1103   0 15:17:08 ?           0:01 /usr/sbin/nsr/nsrlcpd 
-s backup_serv -N 1 -n 1
    root  2654   355   0 15:17:51 console     0:00 grep nsr

(here are the errors that ensued)

Sep 28 15:18:57 backup_serv  Resetting scsi bus, <null string> from (0,0)
Sep 28 15:18:57 backup_serv scsi: WARNING: /ssm@0,0/pci@18,600000/scsi@2 
(glm0):
Sep 28 15:18:57 backup_serv  Target 0 reducing sync. transfer rate
Sep 28 15:18:57 backup_serv scsi: WARNING: /ssm@0,0/pci@18,600000/scsi@2 
(glm0):
Sep 28 15:18:57 backup_serv  got SCSI bus reset
Sep 28 15:19:00 backup_serv scsi: WARNING: /ssm@0,0/pci@18,600000/scsi@2 
(glm0):
Sep 28 15:19:00 backup_serv  Resetting scsi bus, <null string> from (0,0)
Sep 28 15:19:00 backup_serv scsi: WARNING: /ssm@0,0/pci@18,600000/scsi@2 
(glm0):
Sep 28 15:19:00 backup_serv  Target 0 disabled wide SCSI mode
Sep 28 15:19:00 backup_serv scsi: WARNING: /ssm@0,0/pci@18,600000/scsi@2 
(glm0):
Sep 28 15:19:00 backup_serv  Target 0 reverting to async. mode
Sep 28 15:19:00 backup_serv scsi: WARNING: /ssm@0,0/pci@18,600000/scsi@2 
(glm0):
Sep 28 15:19:00 backup_serv  got SCSI bus reset
Sep 28 15:19:03 backup_serv scsi: WARNING: /ssm@0,0/pci@18,600000/scsi@2 
(glm0):
Sep 28 15:19:03 backup_serv  Resetting scsi bus, got incorrect phase from 
(0,0)
Sep 28 15:19:03 backup_serv scsi: WARNING: /ssm@0,0/pci@18,600000/scsi@2 
(glm0):
Sep 28 15:19:03 backup_serv  got SCSI bus reset

...etc., forever.
----------------------------------------------------------------------------
----------------------------------------------------------------------------

We have 7.5.1 running here at home on Solaris 9 and 10. We have 7.5.3 
running on Solaris 10 and even a not-yet-upgraded small environment still 
running 7.2 on Solaris 10 (all SPARC). We've never experienced anything 
like this before. As was mentioned in the email quoted above, they moved 
the OS boot disk to another Sun e2900 server and got the same errros there 
(up until which time they were suspecting a bad backplane).

They're going to try putting Networker on a Solaris 9 e2900 they have, but 
I'm also going to suggest as an alternative maybe removing all tape 
devices from Networker (as well as from /dev/rmt), doing a "boot -r" and 
let Networker add them fresh again. But I have no idea if that will help..

Anyone ever seen this?

Thanks!

To sign off this list, send email to listserv AT listserv.temple DOT edu and 
type "signoff networker" in the body of the email. Please write to 
networker-request AT listserv.temple DOT edu if you have any problems with this 
list. You can access the archives at 
http://listserv.temple.edu/archives/networker.html or
via RSS at http://listserv.temple.edu/cgi-bin/wa?RSS&L=NETWORKER

<Prev in Thread] Current Thread [Next in Thread>
  • [Networker] Networker screws up SCSI bus ... maybe ?, Len Philpot <=