Networker

Re: [Networker] DDS woes

2003-01-21 12:30:59
Subject: Re: [Networker] DDS woes
From: Andre Beck <networker AT IBH DOT DE>
To: NETWORKER AT LISTMAIL.TEMPLE DOT EDU
Date: Tue, 21 Jan 2003 18:30:54 +0100
Re,

On Tue, Jan 21, 2003 at 06:14:29AM -0500, Davina Treiber wrote:
>
> Both of your postings this morning have intrigued me, for the reason that
> you seem to be struggling with 6.2. My question is why? 6.2 is a
> non-mainstream release of NetWorker designed to address two specific needs:
> Unix style filenames in NDMP, and XP support. If you don't need either of
> these then stick with the 6.1.x tree, it is better supported. I have heard
> several reports of buggy behaviour in 6.2 and it is best avoided IMHO. The
> problem is that is the highest release number so many users assume it is the
> greatest and latest. However the same happened with 5.7 which was released
> to provide early W2K support until 6.x was released, and that release was
> short-lived too.

Yes, I was partially trapped by the version numbering scheme. I expect
that a version 6.2 is newer or at least the base of a newer branch of
development compared to 6.1.x. Of course it happens that maintenance
to both branches might create a 6.1.x later than a 6.2.y with fixes
applied to the first one that the second doesn't have. I'd just expect
that such fixes will also be present in 6.2.(y+1) whenever it appears,
or the next major branch (like 7.0). After all, I've read the 60 page
software version design whitepaper from Cisco, so nothing can really
fear me again ;)

On the other hand, in both cases either the requirement or at least the
intention to support XP clients was expressed by the customer, so the
choice wasn't entirely faulty. We're renegotiating with the customer in
the latter case whether XP can wait and if so, we'll go with 6.1.3 as
this beeing some kind of "golden release" compared to 6.2 seems to be a
common point of view of both Legato and the community.

> Your issue seems odd. My first reaction would be to blame RSM, but you say
> you have disabled it.

Some of the issues are clearly RSM. But there were two issues, and if you
aren't debugging in a lab for weeks you don't easily get them separated.

> In any case, I would leave the whole service switched
> off unless you have some other unrelated need for it.

That's my thought about this as well, I also don't see a reason to give
W2k the library drivers in the first place. The problem is just that
PnP pops up unrequested "New hardware found..." dialogs and Legato is
clear about that issue in the docs in saying "disable library, but leave
tape and medium enabled". There must be some reason for this, at least
one thinks. But I've seen all kinds of nonsense happening even with the
library disabled in the last hours - disabled obviously doesn't mean
"don't ever try to talk to it" but rather "pretend to the upper layers
it isn't there", so the lower layers still talk to tape and library and
interrogate the ways of Networker. So I decided to disable the service
completely and (together with SP3) it now just works.

> Second culprit would
> be a possible lack of persistent binding. Devices can move about in W2K, and
> persistent binding goes some way to sorting this out. There are 2 "buts"
> here though. Firstly, Compaq have crippled the Emulex HBA when they OEMed
> it, and their firmware doesn't include useful things like persistent
> binding. Perhaps you could install native Emulex firmware?

This might introduce some unwanted complications regarding service to
this machines. I'd rather stay with the firmware and trigger Compaq (HP)
to fix the issues, if this involves uncrippling that functionality it
would be even better. For the moment we think we can live with that by
letting the switch do zoning.

> Secondly, Windows
> 2K is really bad at handling tape devices, and even with RSM off and
> persistent binding configured there are still circumstances where tape
> devices can move to different addresses.

I don't expect this to be a problem as long as you have only one drive.
2k alias NT is mapping tapes to these \\.\TapeX names and with only one
of them, this should always come out as Tape0. We are, of course, seeing
problems with changed SCSI mappings for the library which is hardcoded
in Legato, but thanks to inquire this can be detected and fixed somewhat
easily. At least in the simple case of one library.

> Solaris is better at this because
> it only maps devices to SCSI addresses when you tell it to. AIX is better
> still because it dispenses with the pointless SCSI type addresses and maps a
> device directly to a WWPN.

Native FC access without mapping to some good old SCSI HBA-Bus-Target-LUN
scheme could solve this, yes. But it has to be supplied by the OS imple-
mentor first.

> Blame Microsoft not Legato.

Exactly. I'm the first to do this, believe me ;)

> >2) DDS licensing clearly states that you need one DDS license per drive to
> >   be shared. For entirely unknown reasons, I need two.
> Sounds like a bug.

Yep, it is, and it has a fix.

> FWIW I have successfully implemented DDS on Compaq SAN
> kit running W2K. This was about 18 months ago using 6.1.1. It all just
> worked. Interestingly at 6.1.1 the DDS licensing didn't seem to be enforced
> at all, it worked before I put the licences in.

So it balances out, doesn't it? ;-)

> >3) Regarding the somewhat stress sensitive tape pickup mechanism of DLT,
> >   I really don't like the fact that DDS requires unloading and instantly
> >   reloading the same cartridge to move it between the virtual drives. I
> >   don't expect a way around this, as it seems to be designed that way,
> >   but I'll ask anyway.
> Room for improvement IMO. I don't see why it can't just remap the drive
> without unmounting. I also can't see why they need to rely on a timeout to
> unmount the drive - surely the NW server can decide to unmount a drive when
> it is required for another storage node. This would be far more efficient.

Yep, but I do somehow understand that a change like this could be a
medium earthquake to the basement infrastructure when it wasn't designed
to work like this in the first place. So instead of rewriting a lot of
proven code and probably introducing new bugs that even strike in non-DDS
setups, they implemented it in the way we know it. But there's still hope
that major redesigns which use to happen at new major version numbers
will improve this some day. I'd vote for a feature request instantly.

> >From the points above, what clearly disturbs me most is 1), as it seems
> >that these errors even cause media to be errouneously marked as full.
> >Anyone with some insight on this?
> I would expect this. Device errors usually cause this and usually it is what
> you want, since you may have a case where a drive has been reset and thus
> rewound. Writing further data at this point could be disastrous. The answer
> is to fix the source of the problem.

Legato even has a section on that topic in their release notes, saying
they don't support any constellation where a host could SCSI-reset a
tape drive that is in use by another host. Preventing this from ever
happening is left as an exercise to the SAN deployer, and that's of
course all they can do. It's like properly terminating your SCSI bus,
if you don't do it, it's nobodies but your problem.

Thanks,
Andre.
--
Thanks to DRM technology, this mail will destroy itself in five seconds.

-> Andre Beck    +++ ABP-RIPE +++    IBH Prof. Dr. Horn GmbH, Dresden <-

--
Note: To sign off this list, send a "signoff networker" command via email
to listserv AT listmail.temple DOT edu or visit the list's Web site at
http://listmail.temple.edu/archives/networker.html where you can
also view and post messages to the list.
=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=

<Prev in Thread] Current Thread [Next in Thread>