Veritas-bu

[Veritas-bu] same job keeps hanging.

2007-07-09 13:25:51
Subject: [Veritas-bu] same job keeps hanging.
From: "Aaron Mills" <aaron.mills AT returnpath DOT net>
To: <veritas-bu AT mailman.eng.auburn DOT edu>
Date: Mon, 9 Jul 2007 12:57:53 -0400

Hi all,

 

I’m hoping someone’s seen this before. I’m running 5.1MP6 w/ AIT3 – I’ve got a ~126GB backup that kicks off weekly, but hangs within a few hours every time – the error I get is always “media manager terminated by parent process” but the logs don’t seem to show anything odd. No other backups hang like this. This is also the only job that runs on the server itself.

 

bptm gives me:

 

03:28:45.470 [4999] <2> io_ioctl: command (1)MTFSF 1 from (bptm.c.8307) on drive index 1

03:28:45.530 [4999] <2> io_close: closing /usr/openv/netbackup/db/media/tpreq/AK6503, from bptm.c.8310

03:28:45.530 [4999] <2> catch_signal: EXITING with status 82

 

so I check bpbrm:

 

02:05:33.882 [4992] <2> bpbrm spawn_child: /usr/openv/netbackup/bin/bptm bptm -w -c foo.bar.com -den 17 -rt 6 -rn 0 -stunit Spectra2 -cl inbound -bt 1183968330 -b foo.bar.com _1183968330 -st 0 -cj 1 -p inbound -hostname foo.bar.com -ru root -rclnt foo.bar.com -rclnthostname foo.bar.com -rl 5 -rp 8035200 -sl ftpif -ct 0 -maxfrag 1048576 -tir -v -Z –mediasvr foo.bar.com -jobid 117926 -jobgrpid 117926 -masterversion 510000 -shm

02:05:33.884 [4992] <2> bpbrm write_continue_backup: wrote CONTINUE BACKUP on COMM_SOCK <4>

02:05:33.884 [4992] <2> bpbrm main: wrote /na270/pub/inbound on COMM_SOCK

02:05:33.884 [4992] <2> bpbrm main: wrote /na270/pub/ftp on COMM_SOCK

02:05:33.884 [4992] <2> bpbrm main: wrote CONTINUE on COMM_SOCK

02:05:33.885 [4992] <2> bpbrm main: ESTIMATE -1 -1 nbu0 foo.bar.com _1183968330

02:09:44.763 [4992] <2> bpbrm mm_sig: received ready signal from media manager

02:09:44.763 [4992] <2> bpbrm readline: retrying partial read from fgets ::

03:27:22.261 [4992] <2> bpbrm sighandler: signal 14 caught by bpbrm

03:27:22.272 [4992] <2> bpbrm sighandler: bpbrm timeout after 3600 seconds

03:27:22.287 [4992] <2> clear_held_signals: clearing signal mask stack, mask_stack_depth = 0

03:27:22.287 [4992] <2> bpbrm kill_child_process: start

03:27:22.287 [4992] <2> bpbrm wait_for_child: start

03:28:48.546 [4992] <2> bpbrm wait_for_child: child exit_status = 82 signal_status = 0

03:28:48.557 [4992] <2> inform_client_of_status: INF - Server status = 41

 

but I can’t seem to figure out why there was a timeout. I checked all the related logs – bpbkar just shows file writing stopping at 2:42am – like the process just hangs there, no errors though. Looking right now, the bpbrm and bpbkar processes for this backup are still running, but nothing is happening. The job shows as active and everything is queueing up behind it.  I’ve also adjusted the CLIENT_READ_TIMEOUT in /usr/openv/netbackup/bp.conf to no avail.

 

Can anyone point me in the right direction as to what I’m missing? I’m guessing there’s something I’m not seeing in one of the logs.

 

            -Aaron

 

Aaron Mills

Systems Administrator

Return Path, Inc.

http://www.returnpath.net

aaron.mills AT returnpath DOT net

 

 

_______________________________________________
Veritas-bu maillist  -  Veritas-bu AT mailman.eng.auburn DOT edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
<Prev in Thread] Current Thread [Next in Thread>