Hi,
We have a networker datazone based on solaris 9 Server, some Solaris
Storagenodes and about 200 clients (windows, linux, solaris...)
The Server, all Storagenodes and also all clients have been upgraded to
Networker 7.3.3, I guess 6 weeks ago.
- After realizing problems with hanging savegroups we got advised to replace
the savegrp binary by a 7.3.3 bversion "build jainso-2007-274"
- After realizing lot's of lost Slots in our jukeboxes (VTL "DL720, EMC; ADIC
i2k; Powderhorn) we got adviced to replacing the nsrd-binary by a 7.3.3 version
"buils varmas-2007-221"
(the new nsrd-binary fixes the lost slots problem only partly...)
We're doing a D2D2T Backup (Disc-backup towards our virtual tape libraries
based on the DL720) and than during daytime clone the virtual tapes towards the
physical Tape libraries in a scripted way (perl pased script doing lot's of
nsrclone). All clients are backed up by the Legato Server, just some of the
Storagenodes backing up theyre own data towards theyre own SN-Devices. only one
StorageNode has been declard as clone-storagenode and is doing all the
clone-Jobs.
Now (since approx. 3 weeks) we got problems with our nsrmmd Processes on the
clone-storagenode. The nsrclone job failed with the following errors (from
networker server daemon.log:)
01/03/08 13:09:30 nsrd: media info: can not read record 0 of file 48 on 9940B
tape JB10815W
01/03/08 13:09:30 nsrd: cloning session:1 of 4 save set(s) reading from
JB10815W 2888 MB of 8149 MB
01/03/08 13:09:31 ansrd: ansrd_clone FAILED: errnum is 25004 and errstr is RPC
receive operation failed. A network connection could not be established with
the host.
01/03/08 13:09:31 ansrd: failed to execute MODE_CLONE
01/03/08 13:09:31 nsrd: SN1.domainname.de:cloning session done saving
where JB10815W is the virtual volume name
and SN1.domainname.de is the f-q-Hostname of the Clone-Storagenode
forget about the part "A network connection could not be established ..." I
already got the statement from our Support that this is just irritation...
The nsrmmd on the clone-storagenode-side died by writing a core-dump
I used SUN's core-file analysis (MDeBug Rev 1.0) and got the following output:
******************************************************************************
Application core Dump Analysis Output MDeBug Rev 1.0
Thu Jan 3 12:20:45 MET 2008 Files: /usr/sbin/nsrmmd
core.SN1.nsrmmd.0.1
******************************************************************************
** Core file status **
------------------------
debugging core file of nsrmmd (64-bit) from tom
executable file: /usr/sbin/nsrmmd
initial argv: /usr/sbin/nsrmmd -n 502 -s LNS.domainname.de
threading model: multi-threaded
status: process terminated by SIGSEGV (Segmentation Fault)
** Thread stack($c) **
----------------------
0x10005a99c(100662e80, ffffffff7fff9f90, 1005947b0, 1bc0, ffffffff7fff9cc0,
ffffffff7fff9d10)
fixedrec_retrieve+0x400(100662e80, 10, 10059ef70, 0, ffffffff7fffa0d0, 1b78)
retrieve_next+0x124(10059ef70, ffffffff7fffa0e0, 0, 0, ffffffff7fffa0d0, 0)
0x100050220(0, 0, 1002af690, 1298, 1290, 1005aa750)
main+0x9a8(5, bc00, bc00, 1000, ffffffff7ffff7d0, 100592f20)
_start+0x7c(0, 0, 0, 0, 0, 0)
** Shared objects **
----------------------
BASE LIMIT SIZE NAME
100000000 1001a0000 1a0000 /usr/sbin/nsrmmd
ffffffff7f200000 ffffffff7f208000 8000 /usr/lib/sparcv9/libaio.so.1
ffffffff7f000000 ffffffff7f00c000 c000
/usr/lib/sparcv9/libsocket.so.1
ffffffff7ee00000 ffffffff7eea8000 a8000 /usr/lib/sparcv9/libnsl.so.1
ffffffff7eb00000 ffffffff7eb48000 48000
/usr/lib/sparcv9/libresolv.so.2
ffffffff7ea00000 ffffffff7ea02000 2000 /usr/lib/sparcv9/libdl.so.1
ffffffff7e800000 ffffffff7e808000 8000 /usr/lib/sparcv9/libgen.so.1
ffffffff7e600000 ffffffff7e604000 4000 /usr/lib/sparcv9/libkvm.so.1
ffffffff7e300000 ffffffff7e31e000 1e000 /usr/lib/sparcv9/libelf.so.1
ffffffff7e100000 ffffffff7e11a000 1a000 /usr/lib/sparcv9/libadm.so.1
ffffffff7df00000 ffffffff7df06000 6000
/usr/lib/sparcv9/libpthread.so.1
ffffffff7dd00000 ffffffff7dd1a000 1a000
/usr/lib/sparcv9/libthread.so.1
ffffffff7da00000 ffffffff7da08000 8000 /usr/lib/sparcv9/librt.so.1
ffffffff7d800000 ffffffff7d8b6000 b6000 /usr/lib/sparcv9/libc.so.1
ffffffff7d600000 ffffffff7d604000 4000 /usr/lib/sparcv9/libmp.so.2
ffffffff7d300000 ffffffff7d302000 2000 /usr/lib/sparcv9/libmd5.so.1
ffffffff7f400000 ffffffff7f402000 2000
/usr/platform/sun4u-us3/lib/sparcv9/libc_psr.so.1
ffffffff7cc00000 ffffffff7cc02000 2000
/usr/lib/iconv/sparcv9/UTF-8%8859-1.so
ffffffff7ca00000 ffffffff7ca06000 6000
/usr/lib/sparcv9/nss_files.so.1
ffffffff7c600000 ffffffff7c602000 2000
/usr/lib/iconv/sparcv9/8859-1%UTF-8.so
ffffffff7f600000 ffffffff7f630000 30000 /usr/lib/sparcv9/ld.so.1
Thread stack for MT app
------------------------
stack pointer for thread 1: ffffffff7fff9461
[ ffffffff7fff9461 0x10005a99c() ]
ffffffff7fff95e1 fixedrec_retrieve+0x400()
ffffffff7fff9701 retrieve_next+0x124()
ffffffff7fff9821 0x100050220()
ffffffff7fffb8f1 main+0x9a8()
ffffffff7fffefe1 _start+0x7c()
******************************************************************************
The corresponding source-device (VTL) on the clone-storagenode is hanging in a
undefined status, "ready for reading, idle" or sometimes "mooving backward 2
files".
Umounting of this hanging source-device is not possible.
Just restarting the networker on the clone-storagenode solves this situation.
I found a semilar complain from Oscar Olsson "Rant about 7.3.3" some time ago,
here at the forum, but no usefull answer.
Any Clue?
------------------------------------------------------------------------
Der Inhalt dieser E-Mail ist nur dann rechtsverbindlich, wenn er von unserer
Seite schriftlich bestätigt wird. Diese E-Mail enthält vertrauliche
Informationen. Wenn Sie wissen oder erkennen können, dass Sie diese
vertraulichen Informationen nicht erhalten sollten, informieren Sie uns bitte
und löschen Sie diese E-Mail von Ihrem System. Eine Weiterverwendung oder
Verbreitung dieser vertraulichen Informationen ist nicht gestattet.
The content of this e-mail may only be deemed to be legally binding if it is
confirmed by us in writing. This e-mail contains confidential information. If
you know or if you can perceive that you are not intended to receive this
confidential information please inform us and delete this e-mail from your
system. It is not allowed to use or distribute the confidential information.
------------------------------------------------------------------------
TALKLINE GmbH, Talkline-Platz 1, 25337 Elmshorn, AG Pinneberg HRB 1696 EL;
Geschäftsführung: Christian Winther, Vorsitzender und CEO, Joachim Preisig,
CFO; Vorsitzender des Aufsichtsrats: Oliver Steil
To sign off this list, send email to listserv AT listserv.temple DOT edu and
type "signoff networker" in the body of the email. Please write to
networker-request AT listserv.temple DOT edu if you have any problems with this
list. You can access the archives at
http://listserv.temple.edu/archives/networker.html or
via RSS at http://listserv.temple.edu/cgi-bin/wa?RSS&L=NETWORKER
|