Networker

[Networker] Networker 7.3.3 Storagenode nsrmmd cores

2008-01-03 07:46:25
Subject: [Networker] Networker 7.3.3 Storagenode nsrmmd cores
From: Thorsten Linow <Thorsten.Linow AT TALKLINE DOT DE>
To: NETWORKER AT LISTSERV.TEMPLE DOT EDU
Date: Thu, 3 Jan 2008 13:44:52 +0100
Hi,
We have a networker datazone based on solaris 9 Server, some Solaris 
Storagenodes and about 200 clients (windows, linux, solaris...)
The Server, all Storagenodes and also all clients have been upgraded to 
Networker 7.3.3, I guess 6 weeks ago.
- After realizing problems with hanging savegroups we got advised to replace 
the savegrp binary by a 7.3.3 bversion "build jainso-2007-274"
- After realizing lot's of lost Slots in our jukeboxes (VTL "DL720, EMC; ADIC 
i2k; Powderhorn) we got adviced to replacing the nsrd-binary by a 7.3.3 version 
"buils varmas-2007-221"
  (the new nsrd-binary fixes the lost slots problem only partly...)
 
We're doing a D2D2T Backup (Disc-backup towards our virtual tape libraries 
based on the DL720) and than during daytime clone the virtual tapes towards the 
physical Tape libraries in a scripted way (perl pased script doing lot's of 
nsrclone). All clients are backed up by the Legato Server, just some of the 
Storagenodes backing up theyre own data towards theyre own SN-Devices. only one 
StorageNode has been declard as clone-storagenode and is doing all the 
clone-Jobs.
Now  (since approx. 3 weeks) we got problems with our nsrmmd Processes on the 
clone-storagenode. The nsrclone job failed with the following errors (from 
networker server daemon.log:)
 
01/03/08 13:09:30 nsrd: media info: can not read record 0 of file 48 on 9940B 
tape JB10815W
01/03/08 13:09:30 nsrd: cloning session:1 of 4 save set(s) reading from 
JB10815W 2888 MB of 8149 MB
01/03/08 13:09:31 ansrd: ansrd_clone FAILED: errnum is 25004 and errstr is RPC 
receive operation failed.  A network connection could not be established with
 the host.
01/03/08 13:09:31 ansrd: failed to execute MODE_CLONE
01/03/08 13:09:31 nsrd: SN1.domainname.de:cloning session done saving

where JB10815W is the virtual volume name
and SN1.domainname.de is the f-q-Hostname of the Clone-Storagenode
 
forget about the part "A network connection could not be established ..." I 
already got the statement from our Support that this is just irritation...
 
The nsrmmd on the clone-storagenode-side died by writing a core-dump
I used SUN's core-file analysis (MDeBug Rev 1.0) and got the following output:
 
  ******************************************************************************
  Application core Dump Analysis Output                     MDeBug Rev 1.0
  Thu Jan  3 12:20:45 MET 2008                   Files: /usr/sbin/nsrmmd  
core.SN1.nsrmmd.0.1
  ******************************************************************************
 
 
 
                ** Core file status **
                ------------------------
debugging core file of nsrmmd (64-bit) from tom
executable file: /usr/sbin/nsrmmd
initial argv: /usr/sbin/nsrmmd -n 502 -s LNS.domainname.de
threading model: multi-threaded
status: process terminated by SIGSEGV (Segmentation Fault)
 

                ** Thread stack($c) **
                ----------------------
0x10005a99c(100662e80, ffffffff7fff9f90, 1005947b0, 1bc0, ffffffff7fff9cc0,
ffffffff7fff9d10)
fixedrec_retrieve+0x400(100662e80, 10, 10059ef70, 0, ffffffff7fffa0d0, 1b78)
retrieve_next+0x124(10059ef70, ffffffff7fffa0e0, 0, 0, ffffffff7fffa0d0, 0)
0x100050220(0, 0, 1002af690, 1298, 1290, 1005aa750)
main+0x9a8(5, bc00, bc00, 1000, ffffffff7ffff7d0, 100592f20)
_start+0x7c(0, 0, 0, 0, 0, 0)
 

                ** Shared objects **
                ----------------------
            BASE            LIMIT             SIZE NAME
       100000000        1001a0000           1a0000 /usr/sbin/nsrmmd
ffffffff7f200000 ffffffff7f208000             8000 /usr/lib/sparcv9/libaio.so.1
ffffffff7f000000 ffffffff7f00c000             c000
/usr/lib/sparcv9/libsocket.so.1
ffffffff7ee00000 ffffffff7eea8000            a8000 /usr/lib/sparcv9/libnsl.so.1
ffffffff7eb00000 ffffffff7eb48000            48000
/usr/lib/sparcv9/libresolv.so.2
ffffffff7ea00000 ffffffff7ea02000             2000 /usr/lib/sparcv9/libdl.so.1
ffffffff7e800000 ffffffff7e808000             8000 /usr/lib/sparcv9/libgen.so.1
ffffffff7e600000 ffffffff7e604000             4000 /usr/lib/sparcv9/libkvm.so.1
ffffffff7e300000 ffffffff7e31e000            1e000 /usr/lib/sparcv9/libelf.so.1
ffffffff7e100000 ffffffff7e11a000            1a000 /usr/lib/sparcv9/libadm.so.1
ffffffff7df00000 ffffffff7df06000             6000
/usr/lib/sparcv9/libpthread.so.1
ffffffff7dd00000 ffffffff7dd1a000            1a000
/usr/lib/sparcv9/libthread.so.1
ffffffff7da00000 ffffffff7da08000             8000 /usr/lib/sparcv9/librt.so.1
ffffffff7d800000 ffffffff7d8b6000            b6000 /usr/lib/sparcv9/libc.so.1
ffffffff7d600000 ffffffff7d604000             4000 /usr/lib/sparcv9/libmp.so.2
ffffffff7d300000 ffffffff7d302000             2000 /usr/lib/sparcv9/libmd5.so.1
ffffffff7f400000 ffffffff7f402000             2000
/usr/platform/sun4u-us3/lib/sparcv9/libc_psr.so.1
ffffffff7cc00000 ffffffff7cc02000             2000
/usr/lib/iconv/sparcv9/UTF-8%8859-1.so
ffffffff7ca00000 ffffffff7ca06000             6000
/usr/lib/sparcv9/nss_files.so.1
ffffffff7c600000 ffffffff7c602000             2000
/usr/lib/iconv/sparcv9/8859-1%UTF-8.so
ffffffff7f600000 ffffffff7f630000            30000 /usr/lib/sparcv9/ld.so.1
 

                Thread stack for MT app
                ------------------------
stack pointer for thread 1: ffffffff7fff9461
[ ffffffff7fff9461 0x10005a99c() ]
  ffffffff7fff95e1 fixedrec_retrieve+0x400()
  ffffffff7fff9701 retrieve_next+0x124()
  ffffffff7fff9821 0x100050220()
  ffffffff7fffb8f1 main+0x9a8()
  ffffffff7fffefe1 _start+0x7c()
 
******************************************************************************

The corresponding source-device (VTL) on the clone-storagenode is hanging in a 
undefined status, "ready for reading, idle" or sometimes "mooving backward 2 
files". 
Umounting of this hanging source-device is not possible.
Just restarting the networker on the clone-storagenode solves this situation.
 
I found a semilar complain from Oscar Olsson "Rant about 7.3.3" some time ago, 
here at the forum, but no usefull answer.
 
Any Clue?

 
------------------------------------------------------------------------
Der Inhalt dieser E-Mail ist nur dann rechtsverbindlich, wenn er von unserer 
Seite schriftlich bestätigt wird. Diese E-Mail enthält vertrauliche 
Informationen. Wenn Sie wissen oder erkennen können, dass Sie diese 
vertraulichen Informationen nicht erhalten sollten, informieren Sie uns bitte 
und löschen Sie diese E-Mail von Ihrem System. Eine Weiterverwendung oder 
Verbreitung dieser vertraulichen Informationen ist nicht gestattet.

The content of this e-mail may only be deemed to be legally binding if it is 
confirmed by us in writing. This e-mail contains confidential information. If 
you know or if you can perceive that you are not intended to receive this 
confidential information please inform us and delete this e-mail from your 
system. It is not allowed to use or distribute the confidential information.

------------------------------------------------------------------------
TALKLINE GmbH, Talkline-Platz 1, 25337 Elmshorn, AG Pinneberg HRB 1696 EL; 
Geschäftsführung: Christian Winther, Vorsitzender und CEO, Joachim Preisig, 
CFO; Vorsitzender des Aufsichtsrats: Oliver Steil

To sign off this list, send email to listserv AT listserv.temple DOT edu and 
type "signoff networker" in the body of the email. Please write to 
networker-request AT listserv.temple DOT edu if you have any problems with this 
list. You can access the archives at 
http://listserv.temple.edu/archives/networker.html or
via RSS at http://listserv.temple.edu/cgi-bin/wa?RSS&L=NETWORKER