Veritas-bu

[Veritas-bu] 41's on Solaris while connecting to localhost

2002-12-28 06:53:20
Subject: [Veritas-bu] 41's on Solaris while connecting to localhost
From: rafamiga AT uucp.polbox DOT pl (rafal wiosna)
Date: Sat, 28 Dec 2002 12:53:20 +0100
        I'm getting lots of error 41 on jobs screen lately. Funny, I swear I
didn't change a thing in the last few days. I'm getting this _ONLY_ when
accessing backup server which in fact is also a client with 260+ GB slice of
our disk array -- we store rsynced version of some hard-to-reach remote
servers there [or Linux servers with libc/glibc2.0 on which the NB client
does not run]. The server's Solaris machine and I tried to truss the bpkar
and bptm processes -- on bpbkar I'm getting read 0's out of file descriptor 1
and the bptm process seems like it hung [no truss output]. Note that this
only happens on this backup-server-client but not while doing remote client
jobs.

        I had this situation lately and I belive turning off the move
detection helped to solve this but it's not a remedy, move detection [the
"bug fix" feature] is rather important to me.

        Also I noticed that for some reason one of the server directories I
backup from has missing TIR info while the others that I backup in the same
job has all of them in place. This results in server backuping all the data
for inc-cummulative backups for this directory only. I'm not 100% sure if
the TIR info on this directory disappeared or it was not written from the
beginnig. Could be result of killing bptm and bpbkar all together to get rid
of unstoppable job hanging out there for 4-5 hours with log entries like
this:

11:43:21.950 [26234] <2> bpbrm sighandler: signal 14 caught by bpbrm
11:43:21.950 [26234] <2> bpbrm sighandler: bpbrm timeout after 300 seconds
11:43:21.950 [26234] <2> bpbrm kill_child_process: start
11:43:21.951 [26234] <2> bpbrm wait_for_child: start
11:44:52.348 [26234] <2> bpbrm wait_for_child: child exit_status = 82 
signal_status = 0
11:44:52.348 [26234] <2> inform_client_of_status: INF - Server status = 41
11:46:21.354 [26234] <2> OpenMailPipe: /usr/ucb/mail 
................................
11:46:21.364 [26234] <2> OpenMailPipe: Before subject string write
11:46:21.365 [26234] <2> OpenMailPipe: After subject string write
11:46:21.371 [26234] <2> bpbrm Exit: client backup EXIT STATUS 41: network 
connection timed out

        I'm curious why bpbrm gets sig 14/timeout -- from watching other
process or by itself? What child is this log talking about [what processes
does it fork?].

        I _DO_ have NFS volumes on this Solaris 8+8_recommended machine but
the 41's happen also on VxFS volumes mounted from disk array [checked the
Faq-O-Matic first].

        It all started a 2 days ago, previously I didn't get any 41's while
doing inc-diff and inc-cumm type backups.

        Anyone having expirience with error 41 and bpbkar/bpbrm/bptm behaving
strange? Is there anything I could check?

        NB 4.5GA with NB_45_2 applied.

-- 
__________________________________________________________________________
rafal wiosna * TDC Internet Polska S.A. * Polbox * In ARP we trust * AR164
RAFD-RIPE * PGP nyckeln finns tillgänglig på www.se.pgp.net (ID: 3CDCB7A9)

<Prev in Thread] Current Thread [Next in Thread>