Re: [Networker] glibc on RHEL4 x86-64 and nsrexecd core dump

> -----Original Message-----
> From: EMC NetWorker discussion 
> [mailto:NETWORKER AT LISTSERV.TEMPLE DOT EDU] On Behalf Of Preston de Guise
> Sent: Monday, February 11, 2008 6:57 PM
> To: NETWORKER AT LISTSERV.TEMPLE DOT EDU
> Subject: Re: [Networker] glibc on RHEL4 x86-64 and nsrexecd core dump
> 
> On 12/02/2008, at 10:28 AM, Preston de Guise wrote:
> 
> > Hi Patti,
> >
> >> I've looked through the archives and found a couple of references,
> >> mostly to the double free message below.  Suggestions have included
> >> disabling nsrauth for a client and modifying environment variable
> >> MALLOC_CHECK_  to prevent nsrexecd from being killed immediately  
> >> (both
> >> from 2 years ago).  I'm having an issue where I am receiving one of
> >> these messages at different nsrexecd core dumps.
> >>
> >> *** glibc detected *** double free or corruption (fasttop):
> >> 0x0000002a987ffc30 ***
> >> *** glibc detected *** corrupted double-linked list:  
> >> 0x00000037e6c316b8
> >> ***
> >>
> >> I am running v7.3.3 Networker - 64-bit on RHEL4 ES Update 6  
> >> x86_64.  If
> >> I restart Networker and the group in question everything is fine  
> >> for a
> >> while which can be a day, a few days, a week, ... and then 
> it happens
> >> again. I have a separate smaller system running the 32-bit OS and
> >> Networker and it does not have this issue.
> >
> > Are you running staging/disk backup units?
> >
> > I had a customer with this problem and it turned out their 
> NetWorker  
> > server was occasionally trying backups to disk to the read-only  
> > "portion" of the adv_file devices. When nsrclone/nsrstage/recover/ 
> > etc would go to read said savesets, it would cause nsrmmd to crash/ 
> > respawn, which would cause the error you're citing above. If the  
> > steps were taken to set MALLOC to just warn, rather than crash,  
> > eventually nsrexecd would consume too much shared memory and the  
> > server would need NetWorker restarted, or worst case, rebooted.
> 
> 
> I forgot to add - it's relatively easy to check for this; just do a  
> directory listing of any disk backup units, and if you have any long- 
> ssid named files appearing under the _AF_readonly subdirectory, then  
> you've got the bug^H^H^Hfeature that requires the fix I 
> outlined in my  
> previous email.
> 
> Cheers,
> 
> Preston.
> 
> --
> Preston de Guise
> 
> 

No, not using any disk backup units.  Straight to tape.

Patti 

To sign off this list, send email to listserv AT listserv.temple DOT edu and 
type "signoff networker" in the body of the email. Please write to 
networker-request AT listserv.temple DOT edu if you have any problems with this 
list. You can access the archives at 
http://listserv.temple.edu/archives/networker.html or
via RSS at http://listserv.temple.edu/cgi-bin/wa?RSS&L=NETWORKER