> Very often, networker starts a drive operation and it takes 3 to 8 hours to
> complete. The operation can be just about anything, eject, move forward,
> verify the tape etc. It's like networker sent the command to the drive, but
> the drive never got it.
More to the point, networker sends the commands to the driver, but the
driver doesn't deal with it properly.
> When this happens, the nsrmmd for that drive gets
> locked into an uninteruptable i/o state. If we stop networker, the nsrmmd
> for that drive still hangs around and no amount of killing will get rid of
> it. This problem happens at various times on all drives. The only resolution
> so far seems to be a reboot. We have a call open with legato on this and are
> in the process of opening a call with Red Hat.
Right. A userland process cannot be killed while it's in the middle of
a system call. If the driver decides to wig out and not return (or
properly time out), too bad. There's nothing the userland process can
do about it. It also suggests that it's not really a networker issue.
> Is there any sort of retry timer we can set for the tape operations?
> Would an upgrade to 7.1.3 help?
If you can't do a kill -9 on nsrmmd, then the problem is in the kernel.
Either in a generic tape driver, a specific HBA driver, or elsewhere.
It's possible that a different version of Networker would make the calls
differently and avoid the bug, but that's not guaranteed.
Darren Dunham ddunham AT taos DOT com
Senior Technical Consultant TAOS http://www.taos.com/
Got some Dr Pepper? San Francisco, CA bay area
< This line left intentionally blank to confuse you. >
Note: To sign off this list, send a "signoff networker" command via email
to listserv AT listserv.temple DOT edu or visit the list's Web site at
http://listserv.temple.edu/archives/networker.html where you can
also view and post messages to the list. Questions regarding this list
should be sent to stan AT temple DOT edu