Veritas-bu

Re: [Veritas-bu] same job keeps hanging.

2007-07-10 08:39:27
Subject: Re: [Veritas-bu] same job keeps hanging.
From: ckstehman AT pepco DOT com
To: "Aaron Mills" <aaron.mills AT returnpath DOT net>
Date: Tue, 10 Jul 2007 08:21:54 -0400

If you are backing up files on a UNIX system, check if there is a hung nfs mount.  I have had backups hang
because the "ls -l" command hangs on the mount point.  - Just a thought..
=============================
Carl Stehman
IT Distributed Services Team
Pepco Holdings, Inc.
202-331-6619
Pager 301-765-2703
ckstehman AT pepco DOT com



"Aaron Mills" <aaron.mills AT returnpath DOT net>
Sent by: veritas-bu-bounces AT mailman.eng.auburn DOT edu

07/09/2007 04:39 PM

To
"Liddle, Stuart" <liddles AT amgen DOT com>, <veritas-bu AT mailman.eng.auburn DOT edu>
cc
Subject
Re: [Veritas-bu] same job keeps hanging.





Actually, I’ve done that before with the same results. We upped the timeouts to around 10,000 seconds to no avail. It’s as though at some point the backups just hang for no good reason.
 
A quick find shows that I’m backing up roughly 29,000 files – that shouldn’t take too long to enumerate, should it?
 
 
Aaron Mills
Systems Administrator
Return Path, Inc.
http://www.returnpath.net
aaron.mills AT returnpath DOT net
 



From: Liddle, Stuart [mailto:liddles AT amgen DOT com]
Sent:
Monday, July 09, 2007 11:14 AM
To:
Aaron Mills; veritas-bu AT mailman.eng.auburn DOT edu
Subject:
RE: [Veritas-bu] same job keeps hanging.

 
So, are you trying to back up a filesystem with lots and lots of small files?  If so, remember that NetBackup will try to enumerate all of the files that you are trying to back up.  We had a similar situation where we were trying to back up a filesystem with 3.5 million files in 50,000 directories.  It took hours to do a filelist of all of that….consequently, it timed out.
 
 
Symantec told us the best solution for that particular directory was NDMP (since the timeouts are much longer).
 
 
OR…I suppose you could up the timeout value to more than 3600 seconds and see what happens.
 



From: veritas-bu-bounces AT mailman.eng.auburn DOT edu [mailto:veritas-bu-bounces AT mailman.eng.auburn DOT edu] On Behalf Of Aaron Mills
Sent:
Monday, July 09, 2007 9:58 AM
To:
veritas-bu AT mailman.eng.auburn DOT edu
Subject:
[Veritas-bu] same job keeps hanging.

 
Hi all,
 
I’m hoping someone’s seen this before. I’m running 5.1MP6 w/ AIT3 – I’ve got a ~126GB backup that kicks off weekly, but hangs within a few hours every time – the error I get is always “media manager terminated by parent process” but the logs don’t seem to show anything odd. No other backups hang like this. This is also the only job that runs on the server itself.
 
bptm gives me:
 
03:28:45.470 [4999] <2> io_ioctl: command (1)MTFSF 1 from (bptm.c.8307) on drive index 1
03:28:45.530 [4999] <2> io_close: closing /usr/openv/netbackup/db/media/tpreq/AK6503, from bptm.c.8310
03:28:45.530 [4999] <2> catch_signal: EXITING with status 82
 
so I check bpbrm:
 
02:05:33.882 [4992] <2> bpbrm spawn_child: /usr/openv/netbackup/bin/bptm bptm -w -c foo.bar.com -den 17 -rt 6 -rn 0 -stunit Spectra2 -cl inbound -bt 1183968330 -b foo.bar.com _1183968330 -st 0 -cj 1 -p inbound -hostname foo.bar.com -ru root -rclnt foo.bar.com -rclnthostname foo.bar.com -rl 5 -rp 8035200 -sl ftpif -ct 0 -maxfrag 1048576 -tir -v -Z –mediasvr foo.bar.com -jobid 117926 -jobgrpid 117926 -masterversion 510000 -shm
02:05:33.884 [4992] <2> bpbrm write_continue_backup: wrote CONTINUE BACKUP on COMM_SOCK <4>
02:05:33.884 [4992] <2> bpbrm main: wrote /na270/pub/inbound on COMM_SOCK
02:05:33.884 [4992] <2> bpbrm main: wrote /na270/pub/ftp on COMM_SOCK
02:05:33.884 [4992] <2> bpbrm main: wrote CONTINUE on COMM_SOCK
02:05:33.885 [4992] <2> bpbrm main: ESTIMATE -1 -1 nbu0 foo.bar.com _1183968330
02:09:44.763 [4992] <2> bpbrm mm_sig: received ready signal from media manager
02:09:44.763 [4992] <2> bpbrm readline: retrying partial read from fgets ::
03:27:22.261 [4992] <2> bpbrm sighandler: signal 14 caught by bpbrm
03:27:22.272 [4992] <2> bpbrm sighandler: bpbrm timeout after 3600 seconds
03:27:22.287 [4992] <2> clear_held_signals: clearing signal mask stack, mask_stack_depth = 0
03:27:22.287 [4992] <2> bpbrm kill_child_process: start
03:27:22.287 [4992] <2> bpbrm wait_for_child: start
03:28:48.546 [4992] <2> bpbrm wait_for_child: child exit_status = 82 signal_status = 0
03:28:48.557 [4992] <2> inform_client_of_status: INF - Server status = 41
 
but I can’t seem to figure out why there was a timeout. I checked all the related logs – bpbkar just shows file writing stopping at 2:42am – like the process just hangs there, no errors though. Looking right now, the bpbrm and bpbkar processes for this backup are still running, but nothing is happening. The job shows as active and everything is queueing up behind it.  I’ve also adjusted the CLIENT_READ_TIMEOUT in /usr/openv/netbackup/bp.conf to no avail.
 
Can anyone point me in the right direction as to what I’m missing? I’m guessing there’s something I’m not seeing in one of the logs.
 
            -Aaron
 
Aaron Mills
Systems Administrator
Return Path, Inc.
http://www.returnpath.net
aaron.mills AT returnpath DOT net
 
 _______________________________________________
Veritas-bu maillist  -  Veritas-bu AT mailman.eng.auburn DOT edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu



This Email message and any attachment may contain information that is proprietary, legally privileged, confidential and/or subject to copyright belonging to Pepco Holdings, Inc. or its affiliates ("PHI"). This Email is intended solely for the use of the person(s) to which it is addressed. If you are not an intended recipient, or the employee or agent responsible for delivery of this Email to the intended recipient(s), you are hereby notified that any dissemination, distribution or copying of this Email is strictly prohibited. If you have received this message in error, please immediately notify the sender and permanently delete this Email and any copies. PHI policies expressly prohibit employees from making defamatory or offensive statements and infringing any copyright or any other legal right by Email communication. PHI will not accept any liability in respect of such communications.
_______________________________________________
Veritas-bu maillist  -  Veritas-bu AT mailman.eng.auburn DOT edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
<Prev in Thread] Current Thread [Next in Thread>