Networker

Re: [Networker] Drive state never goes from idle to done. nsrmmd keeping drive open.

2003-09-11 09:46:38
Subject: Re: [Networker] Drive state never goes from idle to done. nsrmmd keeping drive open.
From: Oscar Olsson <spam1 AT QBRANCH DOT SE>
To: NETWORKER AT LISTMAIL.TEMPLE DOT EDU
Date: Thu, 11 Sep 2003 15:46:31 +0200
On Thu, 11 Sep 2003, Stan Horwitz wrote:

SH> >It seems like nsrmmd keeps a drive unavailable, without any reason. This
SH> >probably happens after a backup job finishes. However, I can't seem to be
SH> >able to figure out what causes this. No jobs are running, and there are no
SH> >savesets waiting to be backed up. There are no pending media requests. The
SH> >drive still sits in the state "ready for writing, idle". An attempt to
SH> >unmount the drive makes the server claim that the drive is busy. I can't
SH> >see any rouge savegrp processer or similar.
SH> If you haven't done so yet, try looking in the /nsr/logs/daemon.log
SH> file to see if there are any error messages there.

Nope, nothing related there, as far I can see.

I tried this:

[root@britt:/nsr/logs] ps -ef |grep 587
    root   587   245  0 18:30:35 ?        7:05 /usr/sbin/nsrmmd -n 4
    root 11628 11570  0 15:32:51 pts/0    0:00 grep 587
[root@britt:/nsr/logs] kill 587
[root@britt:/nsr/logs] ps -ef |grep 587
    root 11633 11570  0 15:32:55 pts/0    0:00 grep 587

Which gave the following output in the daemon.log:

2003-09-11 15.32.54 nsrd: media info: restarting nsrmmd #4 on
britt.qbranch.se in 2 minute(s)
2003-09-11 15.32.59 nsrd: media info: restart of nsrmmd #4 on
britt.qbranch.se cancelled

Then I ejected the tape.

After that, I tried starting a group, which created a pending mount
request, which was partly solved by mounting a volume in this drive, which
previously had the stale nsrmmd process attached to it. It appears that
another nsrmmd process has taken over the control over this drive:

[root@britt:/nsr/logs] fuser /dev/rmt/9cbn
/dev/rmt/9cbn:     1199o
[root@britt:/nsr/logs] ps -ef |grep nsrmmd
    root   583   245  0 18:30:30 ?       30:51 /usr/sbin/nsrmmd -n 1 -r
britt.qbranch.se
    root 12849   245  0 15:41:12 ?        0:00 /usr/sbin/nsrmmd -n 13
    root   499   245  1 18:30:04 ?        2:29 /usr/sbin/nsrmmdbd
    root   586   245  0 18:30:33 ?       51:10 /usr/sbin/nsrmmd -n 3
    root   585   245  0 18:30:32 ?       41:16 /usr/sbin/nsrmmd -n 2
    root   588   245  0 18:30:37 ?       32:12 /usr/sbin/nsrmmd -n 5
    root   589   245  0 18:30:39 ?       11:03 /usr/sbin/nsrmmd -n 6
    root 12502   245  0 15:40:28 ?        0:00 /usr/sbin/nsrmmd -n 10
    root 13014 11570  0 15:42:03 pts/0    0:00 grep nsrmmd
    root 12098   245  0 15:39:52 ?        0:00 /usr/sbin/nsrmmd -n 8
    root 12810   245  0 15:41:03 ?        0:00 /usr/sbin/nsrmmd -n 12
    root 12511   245  0 15:40:30 ?        0:00 /usr/sbin/nsrmmd -n 11
    root  1199   245  2 02:10:32 ?        0:02 /usr/sbin/nsrmmd -n 7

I.e. number 7 has control over the drive now. And the group seems to run
as normal... Annoying.

So does anyone have any idea of what could be causing this?? My guess
would be a buggy nsrmmd or some kind of bad tape/tape drive/tape
configuration or something..

I'm grateful for (almost) all input I can get on this issue. :)

//Oscar

--
Note: To sign off this list, send a "signoff networker" command via email
to listserv AT listmail.temple DOT edu or visit the list's Web site at
http://listmail.temple.edu/archives/networker.html where you can
also view and post messages to the list.
=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=