Networker

Re: [Networker] SCSI Bus Resets : Filling Tapes Prematurley

2008-11-06 03:21:34
Subject: Re: [Networker] SCSI Bus Resets : Filling Tapes Prematurley
From: Mesut Mert <mesut.mert AT GMAIL DOT COM>
To: NETWORKER AT LISTSERV.TEMPLE DOT EDU
Date: Thu, 6 Nov 2008 12:19:42 +0400
First of all, 7.4.3 has lock mechanism on robotic arm. It would be helping
you against to RSM or similar unwanted bus reset requests.

My understanding from your zoning that you use dynamic drive sharing for 4
drives. Disable RSM on 4 hosts. Check the system log on 4 hosts, any i/o
errors? Make sure this is not media or cleaning errors. It is better if you
can use new branded tapes for a while and also  manual cleaning can help.

Please also check the daemon.log on 4 hosts. Does nsrd keep restarting
nsrmmd frequently? In case your nsrmmd die in the middle of backup,
then volume could be marked as full prematurely. Check the name to IP and IP
to name resolution on 4 servers. Please note  that there are bi-direction
comm. between backup server, storage nodes and clients. Make sure
each way is working fine.

Most likely this case related with scsi i/o errors, driver and/or device
problems. Also please check  esg66930 on powerlink, this might be very
helpful. I am posting the content of esg66930 below for you;
***********************
 Fact: ASC/ASCQ
Fact: I/O Error
Symptom: Error: 'I/O Error'
Symptom: nsrd: Jukebox 'name' failed: I/O error
Symptom: Error: 'SJI Failure[0x 29]: Illegal Request, Logical Unit Not
Supported'
Symptom: Error: 'SJI Failure[0x 29]: Aborted Command, ASC 0x8f ASCQ 0x00
Symptom: Error: 'Jukebox error; illegal request, ASC 0x83 ASCQ 0x10
Symptom: Error: 'SJI Failure [0x29] Not Ready, ASC 0x80 ASCQ 0x01
Symptom: Error: '(jukebox name) failed: SJI_Error 0x29: I/O Error
Symptom: nsrd: media warning: /dev/rmt1.1 moving: eject I/O error
Symptom: nsrd: media warning: /dev/rmt1.1 reading: I/O error
Symptom: Error: 'nsrjb: SYSTEM error: I/O error'
Symptom: Error: 'lus: [ID 497811 kern.notice] NOTICE: lus_intr(b.t.l):
transport failure (reset)'
Symptom: Compaq Jukebox with windows reports: 'The description for Event ID
( 1107 ) in Source ( Storage Agents ) could not be found. It contains the
following insertion string(s): 5,2,! = ,2,1,1,0.'
Symptom: Log /var/adm/messages shows: Cannot check out flexlm license
APD_HPLTO_SCSI version 1.000, feature has expired
Symptom:NetWorker resource file out of
sync<https://solutions.emc.com/nsepn/webapps/stqv768481dmts46655278/EMCSolutionview.asp?type=V3&altid=legato10317>
Symptom: Is RSM service
enabled<https://solutions.emc.com/nsepn/webapps/stqv768481dmts46655278/EMCSolutionview.asp?type=V3&altid=legato63008>
Symptom: Jukebox firmware not
uptodate<https://solutions.emc.com/nsepn/webapps/stqv768481dmts46655278/EMCSolutionview.asp?type=V3&altid=Legato31660>
Symptom: ASC/ASCQ Error codes
returned<https://solutions.emc.com/nsepn/webapps/stqv768481dmts46655278/EMCSolutionview.asp?type=V3&altid=legato69902>
Symptom: Check for device compatibility from the legato hardware
compatibility guide <http://www.legato.com/resources/compatibility/>
Cause: In troubleshooting jukebox I/O errors it is critical to determine
which
process is generating the error and at what stage. For example it could
get logged if nsrjb fails to communicate with the library to unload tape
stuck in drive. In this situation the error is reported to nsrd which logs
it as such since the nsrjb execution never completed successfully.
Fix: I/O errors can be returned from a multitude issues with jukeboxes.
While the majority of the problems can be alleviated from getting the
hardware vendor to check over the hardware, there are several things that
can be done in the interim. From the hardware side verify that the robot is
not locked by the
library being put in Offline or LCD mode. NetWorker requires the robot to be

in random access mode and have full control of it. I/O Errors are generic in
nature most of the time returned by the OS to the software application.

Things to check:

NetWorker resource file out of
sync<https://solutions.emc.com/nsepn/webapps/stqv768481dmts46655278/EMCSolutionview.asp?type=V3&altid=legato10317>

Is RSM service 
enabled<https://solutions.emc.com/nsepn/webapps/stqv768481dmts46655278/EMCSolutionview.asp?type=V3&altid=legato63008>

Jukebox firmware not
uptodate<https://solutions.emc.com/nsepn/webapps/stqv768481dmts46655278/EMCSolutionview.asp?type=V3&altid=Legato31660>

Firmware and drivers are usually the main issue with I/O robotic errors,
more so with ASC/ASCQ errors(depeding on the error code returned). check the
hardware vendors site for updates for drivers and firmware.

The library firmware can be checked on the front panel of the jukebox
throught the LCD screen. Tentatively it can be checked from the NetWorker
inquire command. Verify that this is the latest or at the very least meets
the firmware requirements in the legato hardware compatibility
guide<http://www.legato.com/resources/compatibility/>for that
particular Jukebox.

FC and SCSI HBA's can also be the problematic piece in relation to I/O
errors and ASC/ASCQ codes. legato67857 refers to issues with firmware and
drivers for FC 
HBA's.<https://solutions.emc.com/nsepn/webapps/stqv768481dmts46655278/EMCSolutionview.asp?type=V3&altid=legato67857>This
same knowledge can be applied for direct SCSI attached issues for the
most part.

ASC/ASCQ Error codes
returned<https://solutions.emc.com/nsepn/webapps/stqv768481dmts46655278/EMCSolutionview.asp?type=V3&altid=legato69902>

Check the ASC/ASCQ code table for a description of the problem. ASC/ASCQ
(Additional Sense Codes/Additional Sense Code Qualifier) codes, if returned
in the daemon.log for instance, will give additional information regarding
the problem. The first code (ASC) is the generic SCSI code returned from the
error. This is returned from the REQUEST SENSE SCSI command. The second code
(ASCQ), if available, returns more detailed information regarding the
problem. most often these are enough to lead you in the right direction to
solving the problem.


Check for device compatibility from the legato hardware compatibility
guide<http://www.legato.com/resources/compatibility/>

Check the compatibility guide to verify that your hardware is supported and
qualified to work with NetWorker. Due to the number of jukeboxes, HBA's and
drivers/firmware out there, low level jukebox drivers such as Legato's LUS
drivers, may not work correctly with different hardware configurations or
type of libraries.

A reboot of library and/or OS or deletion and reconfiguration of the jukebox

resource has also resolved this error condition. Do it only after exhausting

the options listed above.


On Thu, Nov 6, 2008 at 4:21 AM, psoni <networker-forum AT backupcentral DOT 
com>wrote:

> Does anyone know how to block SCSI bus resets in MS windows 2003
> environment?
>
> I am using Networker with the following configuration
>
> [1] One N/W server 7.3.3 : Windows 2003 server
> [2] Three N/W storage nodes : Win 2003 servers in a cluster
>
> The N/W server and all the three storage nodes have 2 Qlogic 2460 HBAs
> ( STORMiniport driver V 9.1.7.16 and Microsoft Q943545 hotfix installed )
>
> Also they are connected to ClARiiON CX3-80 storage array and ML6000 tape
> library having 4 IBm LTO-3 tape drives.
>
> In NMC I have found few tapes  being marked as "FULL" before reaching the
> total capacity.
>
> CX3-80 ( 4 FC ports / SP )
> A0,A1,A2,A3 for SP-A
> B0,B1,B2,B3 for SP-B
>
> ML6000 has 4 ports ( T1,T2,T3,T4)
>
> There are 2 Brocade FC switches with the folllowing zoning config.
>
>                                      FC SWITCH # 1
>
> [1] Backup_Zone1
> Members : N/W server HBA 1; A0; B1; T1; T2
>
> [2] Backup_Zone2
> Members: N/W storage node(1) HBA 1; A0; B1; T1; T2
>
> [3] Backup_Zone3
> Members: N/W storage node(2) HBA 1; A0; B1; T1; T2
>
> [4] Backup_Zone4
> Members: N/W storage node(3) HBA 1; A0; B1; T1; T2
>
>
>                                FC SWITCH # 2
>
> [1] Backup_Zone1
> Members : N/W server HBA 2; A1; B0; T3; T4
>
> [2] Backup_Zone2
> Members: N/W storage node(1) HBA 2; A1; B0; T3; T4
>
> [3] Backup_Zone3
> Members: N/W storage node(2) HBA 2; A1; B0; T3; T4
>
> [4] Backup_Zone4
> Members: N/W storage node(3) HBA 2; A1; B0; T3; T4
>
> I read that tape drives shared by multiple computers experience
> unpredictable bus resets.
>
> what should i do to resolve this issue?
>
> Thanks
>
> +----------------------------------------------------------------------
> |This was sent by soni.parth AT gmail DOT com via Backup Central.
> |Forward SPAM to abuse AT backupcentral DOT com.
> +----------------------------------------------------------------------
>
> To sign off this list, send email to listserv AT listserv.temple DOT edu and 
> type
> "signoff networker" in the body of the email. Please write to
> networker-request AT listserv.temple DOT edu if you have any problems with 
> this
> list. You can access the archives at
> http://listserv.temple.edu/archives/networker.html or
> via RSS at http://listserv.temple.edu/cgi-bin/wa?RSS&L=NETWORKER
>

To sign off this list, send email to listserv AT listserv.temple DOT edu and 
type "signoff networker" in the body of the email. Please write to 
networker-request AT listserv.temple DOT edu if you have any problems with this 
list. You can access the archives at 
http://listserv.temple.edu/archives/networker.html or
via RSS at http://listserv.temple.edu/cgi-bin/wa?RSS&L=NETWORKER