Networker

Re: [Networker] NMDA Oracle RMAN backups fail only when when nsrexecd is run

2011-08-27 03:38:18
Subject: Re: [Networker] NMDA Oracle RMAN backups fail only when when nsrexecd is run
From: jee <jee AT ERESMAS DOT NET>
To: NETWORKER AT LISTSERV.TEMPLE DOT EDU
Date: Sat, 27 Aug 2011 08:32:37 +0100
Hi bingo,

the problem my customer is facing is not related to the backup of the client 
file index. The problem occurs during the backup of the RMAN savesets only.

If you mean the creation of the index entries, then I do agree. that's 
actually the idea but RMAN seems to be doing saveset work... (I don't know 
how the module handles the creation of index entries )

The error does include the word index but is his CFI or "media index" ( i.e. 
media database) too ambiguous, grrrr!  

The backup makes many individual connections to the NW server using the 
module, and the function that apparently makes those conections fails:
 
+-----------------------------------
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03009: failure of backup command on t2 channel at mm/dd/yyyy HH:MM:SS
ORA-19506: failed to create sequential file, 
name="df_DBNAME_156027303_8317_1", parms=""
ORA-27028: skgfqcre: sbtbackup returned error
ORA-19511: Error received from media manager layer, error text:
   lnm_index_cfx_connect retry failed.  See privious error messages. (3:3:111)
---------------------------------------------------------+


This happens at random and if I retry the backup it will run ok. Allocating 4 
chanels instead of 4 does help as it gives the backup more chances.
  
The question is why when I run the RMAN backups using cron this error doesn't 
occur? What is the difference? If the index is involved then EMC should look 
there.  


"ulimit -a" on the client shows "unlimited" or high numbers, the OS tcp 
keepalive is set to 30 minutes to avoid connections broken ater a long period 
(default 2h) by a stateful inspection firewall. The connections that fail do 
so almost immediately.
 
I think I will have to monitor the RMAN log waiting for this message and get 
some output from netstat etc when the error occurs.

But what I actually need is someting better than

ORA-19511: Error received from media manager layer, error text:
   lnm_index_cfx_connect retry failed.  See privious error messages. (3:3:111)

and EMC should be able to produce more useful output on that area of the code. 
still waiting for that day...


jee


On Thursday 25 August 2011 06:16:39 bingo wrote:
> Jee and Ronbenton mention 2 issues that are imporant to me:
>
>   - "We manage to get rid of the error using cron. When the RMAN script is
> executed by cron on the linux client the error doesn't occur. "
>
>   - "The debug log files they have me collecting indicate that it is timing
> out connecting to the NSR server for the client file index."
>
> Both statements are connected. A cron job does never run an index backup in
> the end. It is logical that if the problem is due to the index backup (or
> in the transition phase), you may prevent it just by running
> client-initiated backups.
>
> We have a similar phenomenon in our NW 7.6.1.6 (with older clients)
> environment - i call it a 'sleeping group'. But it is obviously a 'sleeping
> client'. The effect shows as follows: - A group will run until it is almost
> finished (99%).
>   - Then no more save sets will become active.
>   - To proceed, simply restart nsrexeced on (all) the remaining clients.
>        If you are lucky, the group ends successfully.
>        Sometimes you can even see an index backup rush through,
>   - If the group fails, you can still restart it and to almost 100%, it
> will end successfully.
>
> Unfortunately, the problem occurs randomly on our 400+ client environment.
>
> The only help EMC provided so far was:
>   - Activate the NSR_KEEP_ALIVE functionality at the client
>   - Delete the "nsr peer information" on the server and the client
>
> +----------------------------------------------------------------------
>
> |This was sent by carsten_reinfeld AT avus-cr DOT de via Backup Central.
> |Forward SPAM to abuse AT backupcentral DOT com.
>
> +----------------------------------------------------------------------
>
> To sign off this list, send email to listserv AT listserv.temple DOT edu and 
> type
> "signoff networker" in the body of the email. Please write to
> networker-request AT listserv.temple DOT edu if you have any problems with 
> this
> list. You can access the archives at
> http://listserv.temple.edu/archives/networker.html or via RSS at
> http://listserv.temple.edu/cgi-bin/wa?RSS&L=NETWORKER

To sign off this list, send email to listserv AT listserv.temple DOT edu and 
type "signoff networker" in the body of the email. Please write to 
networker-request AT listserv.temple DOT edu if you have any problems with this 
list. You can access the archives at 
http://listserv.temple.edu/archives/networker.html or
via RSS at http://listserv.temple.edu/cgi-bin/wa?RSS&L=NETWORKER