Networker

[Networker] "Operation would block" error causing failures

2009-10-02 10:20:43
Subject: [Networker] "Operation would block" error causing failures
From: MIchael Leone <Michael.Leone AT PHA.PHILA DOT GOV>
To: NETWORKER AT LISTSERV.TEMPLE DOT EDU
Date: Fri, 2 Oct 2009 10:16:32 -0400
I'm experiencing problems with a save job on a dedicated storage node 
(AFTD device). This is a Win2003 cluster. The job has been running fine 
for years, and no parameters have changed. However, it's been failing 
regularly now with errors like:

* nt_san1:H:\PHAEBS 7164:save: shutdown failed
     * <ERROR> :  Error: nsr_end failed: Operation would block

("nt_san1" being the name of the cluster resource being backed up)

Additionally, I see errors like:

NetWorker media: (emergency) Cannot write to 
X:\DBO\18\63\9acd2fea-00000006-12c5b6f0-4ac5b6f0-03ef0000-0a407d46 - 
errno=22

("X:\DBO" being the path the AFTD device writes to. BTW, couldn't EMC have 
at least included a client name with this error? Sheesh ...)

The weird thing is that it's always the same 2 folders that error out (we 
enumerate about a dozen folders, for speed reasons, rather than just 
saying ALL). Both are very large (one is like 260G, the other about 900G). 


At first we thought we had mis-set the ant-virus scan, which was recently 
re-installed. We did see errors - errors showed up in the system log - 
that indicated that a scan was taking too long, and was being terminated. 
so we increased the scanning time, and also told McAfee 8.5 *not* to scan 
files opened for backup.   However, the job is still failing, but now 
there is no error in the system event log (so we don't think that it's 
McAfee blocking access to files). And backups of other folders/shares on 
this same server go off without a hitch, no AV errors at all.

I turned on "verbose" on a test job that did only those 2 folders, but all 
it shows is the "Operation would block" message. The NSR log shows:

----------------
32496 10/2/2009 3:54:06 AM  2 0 0 4028 5212 0 admnman004 savegrp job 
(3264918) host: nt_san1 savepoint: H:\PHAEBS had ERROR indication(s) at 
completion. 
Unable to render the following message: savegrp:RESTART FAILED JOBS * 
nt_san1:H:\PHAEBS  See the file D:\Program Files\Legato\nsr\tmp\sg\RESTART 
FAILED JOBS\sso.000005 for output of save command.

7341 10/2/2009 3:54:07 AM  2 0 0 4028 5212 0 admnman004 savegrp 
nt_san1:H:\PHAEBS failed. 
-----------------

The referenced file shows each file being backed up, and ends with 
"7164:save: shutdown failed".  Not a whole lot of useful, IMO ....

The job is so large (the 900G folder is my main user home directories) 
that I can't just run it at a whim, as it takes multiple hours, especially 
since it has to write to disk first, then clone to tape.

I've been searching PowerLink, which has been less than helpful. Web 
searches indicate turning off AV, but the saves of other folders on this 
server show no problems.

I'm about to get EMC on the line (severity 2, as I'm not down, actually, 
but I am impacted in a major way - tonight is EOM backup ...)

Thoughts? Next steps?

-- 
Michael Leone
Network Administrator, ISM
Philadelphia Housing Authority
2500 Jackson St
Philadelphia, PA 19145
Tel:  215-684-4180
Cell: 215-252-0143
<mailto:michael.leone AT pha.phila DOT gov>

To sign off this list, send email to listserv AT listserv.temple DOT edu and 
type "signoff networker" in the body of the email. Please write to 
networker-request AT listserv.temple DOT edu if you have any problems with this 
list. You can access the archives at 
http://listserv.temple.edu/archives/networker.html or
via RSS at http://listserv.temple.edu/cgi-bin/wa?RSS&L=NETWORKER

<Prev in Thread] Current Thread [Next in Thread>