Hi all,
I’m hoping someone’s seen this before. I’m
running 5.1MP6 w/ AIT3 – I’ve got a ~126GB backup that kicks off
weekly, but hangs within a few hours every time – the error I get is always
“media manager terminated by parent process” but the logs don’t
seem to show anything odd. No other backups hang like this. This is also the
only job that runs on the server itself.
bptm gives me:
03:28:45.470 [4999] <2> io_ioctl:
command (1)MTFSF 1 from (bptm.c.8307) on drive index 1
03:28:45.530 [4999] <2> io_close:
closing /usr/openv/netbackup/db/media/tpreq/AK6503, from bptm.c.8310
03:28:45.530 [4999] <2>
catch_signal: EXITING with status 82
so I check bpbrm:
02:05:33.882 [4992] <2> bpbrm
spawn_child: /usr/openv/netbackup/bin/bptm bptm -w -c foo.bar.com -den 17 -rt 6
-rn 0 -stunit Spectra2 -cl inbound -bt 1183968330 -b foo.bar.com _1183968330
-st 0 -cj 1 -p inbound -hostname foo.bar.com -ru root -rclnt foo.bar.com -rclnthostname
foo.bar.com -rl 5 -rp 8035200 -sl ftpif -ct 0 -maxfrag 1048576 -tir -v -Z –mediasvr
foo.bar.com -jobid 117926 -jobgrpid 117926 -masterversion 510000 -shm
02:05:33.884 [4992] <2> bpbrm
write_continue_backup: wrote CONTINUE BACKUP on COMM_SOCK <4>
02:05:33.884 [4992] <2> bpbrm main:
wrote /na270/pub/inbound on COMM_SOCK
02:05:33.884 [4992] <2> bpbrm main:
wrote /na270/pub/ftp on COMM_SOCK
02:05:33.884 [4992] <2> bpbrm main:
wrote CONTINUE on COMM_SOCK
02:05:33.885 [4992] <2> bpbrm main:
ESTIMATE -1 -1 nbu0 foo.bar.com _1183968330
02:09:44.763 [4992] <2> bpbrm
mm_sig: received ready signal from media manager
02:09:44.763 [4992] <2> bpbrm readline:
retrying partial read from fgets ::
03:27:22.261 [4992] <2> bpbrm
sighandler: signal 14 caught by bpbrm
03:27:22.272 [4992] <2> bpbrm sighandler: bpbrm
timeout after 3600 seconds
03:27:22.287 [4992] <2>
clear_held_signals: clearing signal mask stack, mask_stack_depth = 0
03:27:22.287 [4992] <2> bpbrm
kill_child_process: start
03:27:22.287 [4992] <2> bpbrm
wait_for_child: start
03:28:48.546 [4992] <2> bpbrm
wait_for_child: child exit_status = 82 signal_status = 0
03:28:48.557 [4992] <2>
inform_client_of_status: INF - Server status = 41
but I can’t seem to figure out why there was a
timeout. I checked all the related logs – bpbkar just shows file writing
stopping at 2:42am – like the process just hangs there, no errors though.
Looking right now, the bpbrm and bpbkar processes for this backup are still
running, but nothing is happening. The job shows as active and everything is
queueing up behind it. I’ve also adjusted the CLIENT_READ_TIMEOUT
in /usr/openv/netbackup/bp.conf to no avail.
Can anyone point me in the right direction as to what I’m
missing? I’m guessing there’s something I’m not seeing in one
of the logs.
-Aaron
Aaron Mills
Systems Administrator
Return Path, Inc.
http://www.returnpath.net
aaron.mills AT returnpath DOT net