Veritas-bu

Re: [Veritas-bu] same job keeps hanging.

2007-07-09 18:24:50
Subject: Re: [Veritas-bu] same job keeps hanging.
From: "Aaron Mills" <aaron.mills AT returnpath DOT net>
To: "David Rock" <dave-bu AT graniteweb DOT com>, <veritas-bu AT mailman.eng.auburn DOT edu>
Date: Mon, 9 Jul 2007 18:13:01 -0400
Anecdotally - it doesn't always die at the same time, but roughly an
hour or two into the job. I never actually looked to see if it was
within a few minutes, but the symptom is always the same: daemon
terminated by parent process, "bpbrm timeout after 3600 seconds"

Something seems to be causing the client process to get stuck, for lack
of a better word.

As to the server - the job runs on the NBU server itself. I have an NFS
mount hanging off it that I'm backing up. I've checked /var/adm/messages
and I don't see anything weird happening at the time the backup fails
(mount going stale, etc.), either. 


Aaron Mills
Systems Administrator
Return Path, Inc.
http://www.returnpath.net
aaron.mills AT returnpath DOT net
 

-----Original Message-----
From: David Rock [mailto:dave-bu AT graniteweb DOT com] 
Sent: Monday, July 09, 2007 3:12 PM
To: veritas-bu AT mailman.eng.auburn DOT edu
Subject: Re: [Veritas-bu] same job keeps hanging.

* Aaron Mills <aaron.mills AT returnpath DOT net> [2007-07-09 16:39]:
> Hi all,
> 
> I'm hoping someone's seen this before. I'm running 5.1MP6 w/ AIT3 -
I've
> got a ~126GB backup that kicks off weekly, but hangs within a few
hours
> every time - the error I get is always "media manager terminated by
> parent process" but the logs don't seem to show anything odd. No other
> backups hang like this. This is also the only job that runs on the
> server itself.

When you say "runs on the server itself", what do you actually mean?  We
say an odd timeout that always happened at the same time into the
backup, but the specific circumstances were:

1. a bpbackup command running on a client system
2. client on the other side of a firewall

What was happening in our case was the backup would start, one hour into
the backup, the firewall would decide since it didn't see any traffic
coming from the client to the master server, it would drop the entry in
the state table.  Then, one hour later, the client would try to send a
keepalive packet through the now-defunct connection, fail, retry several
times, and then finally give up and die, taking the backup with it.

This may not be anything like what you are dealing with, but it is a
pretty good example of how things other than NBU can cause weird things
to happen and make it look like NBU is the cause.  Does your job always
die at the same time, or does it vary from attempt to attempt?

-- 
David Rock
david AT graniteweb DOT com


_______________________________________________
Veritas-bu maillist  -  Veritas-bu AT mailman.eng.auburn DOT edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu