Veritas-bu

[Veritas-bu] hanging bpbkar process

2006-05-10 12:34:08
Subject: [Veritas-bu] hanging bpbkar process
From: steve_cashman AT symantec DOT com (Steven Cashman)
Date: Wed, 10 May 2006 11:34:08 -0500
This is a multi-part message in MIME format.

------_=_NextPart_001_01C6744F.8FAD4114
Content-Type: text/plain;
        charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

Just because they are the same box does not exclude the Network issues,
if you are using an NFS mount it is probably over the network. It could
be a problem with the NFS mount / share going away momentarily but its
hard to tell from this. Can you post more of the Bpbrm log?
=20
o Does this only happen on the NFS backup?
=20
o Is this client able to perform other types of backups (Local drives,
System_state)?
=20
o Is it possible to backup the slice on the local computer vs NFS
mounting it to the NBU server?
=20
Steve
=20


  _____ =20

From: veritas-bu-admin AT mailman.eng.auburn DOT edu
[mailto:veritas-bu-admin AT mailman.eng.auburn DOT edu] On Behalf Of Aaron
Mills
Sent: Tuesday, May 09, 2006 4:59 PM
To: Steve Cashman; Justin Piszcz
Cc: veritas-bu AT mailman.eng.auburn DOT edu
Subject: RE: [Veritas-bu] hanging bpbkar process


Well the server/client are the same box so network issues are out. I'm
NFS mounting a slice to the NBU server and then backing up it up. After
about 3 hours, the client process just stops logging. I've checked
bpbrm, bptm, and bpbkar logs to no avail.
=20
Could an NFS hicup cause this to happen?
=20
bpbrm.log shows:
=20
06:57:36.696 [13465] <2> bpbrm sighandler: signal 14 caught by bpbrm
06:57:36.696 [13465] <2> bpbrm sighandler: bpbrm timeout after 10800
seconds
06:57:36.696 [13465] <2> clear_held_signals: clearing signal mask stack,
mask_stack_depth =3D 0
06:57:36.696 [13465] <2> bpbrm kill_child_process: start
06:57:36.697 [13465] <2> bpbrm wait_for_child: start
06:59:10.955 [13465] <2> bpbrm wait_for_child: child exit_status =3D 82
signal_status =3D 0
06:59:10.955 [13465] <2> inform_client_of_status: INF - Server status =
=3D
41
=20
some three hours earlier, the last log from bpbkar looks like:
=20
...snip...
03:58:24.319 [13472] <2> bpbkar process_file: INF - /path/to/some/file
is sparse: stat.st_size =3D 12, stat.st_blocks * 512 =3D 0
03:58:24.320 [13472] <2> bpbkar process_file: INF - /path/to/some/file
is now size 12
03:58:24.320 [13472] <4> bpbkar PrintFile: /path/to/some/file
03:58:24.320 [13472] <2> bpbkar process_file: INF - /path/to/some/file
is sparse: stat.st_size =3D 12, stat.st_blocks * 512 =3D 0
03:58:24.321 [13472] <2> bpbkar process_file: INF - /path/to/some/file
is now size 12
03:58:24.322 [13472] <4> bpbkar PrintFile: /path/to/some/file
03:58:24.322 [13472] <2> bpbkar process_file: INF - /path/to/some/file
is sparse: stat.st_size =3D 12, stat.st_blocks * 512 =3D 0
03:58:24.323 [13472] <2> bpbkar process_file: INF - /path/to/some/file
is now size 12
=20
(Is this "is sparse" message what I should be worried about?)
=20
then nothing 'till I killed the process some 27 hours later:
=20
11:33:35.407 [13472] <16> bpbkar sighandler: ERR - bpbkar killed by
signal 15
=20
=20
=20

  _____ =20

From: Steve Cashman [mailto:nbu.admin AT gmail DOT com]=20
Sent: Tuesday, May 09, 2006 2:11 PM
To: Justin Piszcz
Cc: Aaron Mills; veritas-bu AT mailman.eng.auburn DOT edu
Subject: Re: [Veritas-bu] hanging bpbkar process



Often times the Media / Master server encounters and error and exits but
cant notify the Client (they don't even attempt to) It sounds like you
have at least some logs enabled on the Media server since you mention
that you reviewed the Bpbrm log. So review the Media Server logs to see
if you can get more information. What I often have seen is a network
disconnect between the Client and the Media Server. This stops the
backup as far as Bpbrm is concerned but Bpbkar does not know about it,
he just keeps churning through data until he is done.=20

If this only happens on single client then it maybe something like a bad
port, Nic, Driver ect. If it only happens on one Policy but other
policies for that same client fail then you have a mystery on your
hands. If you can post the relevant logs so we can browse them a bit
(Bpbrm, Bpbkar, Bptm to start I would think)=20

Steve
Hope that helps

=20
On 5/9/06, Justin Piszcz <jpiszcz.backup AT gmail DOT com> wrote:=20

        mkdir /usr/openv/netbackup/logs/bpbkar on the client, add
VERBOSE =3D 5
        to the bp.conf and watch the logs, also you can make a lot of
logging=20
        directories on the server as well and tail them when the
problematic
        client is backing up/etc.
=09
=09
        On 5/9/06, Aaron Mills <aaron.mills AT returnpath DOT net> wrote:=20
        >
        >
        > I had a backup timeout yesterday. bpbrm timed out after the
configured
        > interval (3 hours), but when I check to see what happened to
the client
        > process, bpbkar is still running (client/server on the same
box) - it just=20
        > hasn't done anything since three hours before the job timed
out. The bpbkar
        > log doesn't show anything useful. The process hums along and
then just stops
        > logging all of the sudden. This always seems to happen on the
same job,=20
        > though - never any others.
        >
        > Any ideas on where else I should look here?
        >
        >         -Aaron
        >
        > Aaron Mills
        > System Administrator
        > Return Path, Inc.
        > 303.642.4111=20
        > aaron.mills AT returnpath DOT net
        > http://www.returnpath.biz
        >
=09
        _______________________________________________
        Veritas-bu maillist  -  Veritas-bu AT mailman.eng.auburn DOT edu
        http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu=20
=09



------_=_NextPart_001_01C6744F.8FAD4114
Content-Type: text/html;
        charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META http-equiv=3DContent-Type content=3D"text/html; =
charset=3Dus-ascii">
<META content=3D"MSHTML 6.00.2900.2876" name=3DGENERATOR></HEAD>
<BODY>
<DIV dir=3Dltr align=3Dleft><SPAN class=3D773461116-10052006><FONT =
face=3DArial=20
color=3D#0000ff size=3D2>Just because they are the same box does not =
exclude the=20
Network issues, if you are using an NFS mount it is probably over the =
network.=20
It could be a problem with the NFS mount / share going away momentarily =
but its=20
hard to tell from this. Can you post more of the Bpbrm =
log?</FONT></SPAN></DIV>
<DIV dir=3Dltr align=3Dleft><SPAN class=3D773461116-10052006><FONT =
face=3DArial=20
color=3D#0000ff size=3D2></FONT></SPAN>&nbsp;</DIV>
<DIV dir=3Dltr align=3Dleft><SPAN class=3D773461116-10052006><FONT =
face=3DArial=20
color=3D#0000ff size=3D2>o Does this only happen on the NFS=20
backup?</FONT></SPAN></DIV>
<DIV dir=3Dltr align=3Dleft><SPAN class=3D773461116-10052006><FONT =
face=3DArial=20
color=3D#0000ff size=3D2></FONT></SPAN>&nbsp;</DIV>
<DIV dir=3Dltr align=3Dleft><SPAN class=3D773461116-10052006><FONT =
face=3DArial=20
color=3D#0000ff size=3D2>o Is this client able to perform other types of =
backups=20
(Local drives, System_state)?</FONT></SPAN></DIV>
<DIV dir=3Dltr align=3Dleft><SPAN class=3D773461116-10052006><FONT =
face=3DArial=20
color=3D#0000ff size=3D2></FONT></SPAN>&nbsp;</DIV>
<DIV dir=3Dltr align=3Dleft><SPAN class=3D773461116-10052006><FONT =
face=3DArial=20
color=3D#0000ff size=3D2>o Is it possible to backup the slice on the =
local computer=20
vs NFS mounting it to the NBU server?</FONT></SPAN></DIV>
<DIV dir=3Dltr align=3Dleft><SPAN class=3D773461116-10052006><FONT =
face=3DArial=20
color=3D#0000ff size=3D2></FONT></SPAN>&nbsp;</DIV>
<DIV dir=3Dltr align=3Dleft><SPAN class=3D773461116-10052006><FONT =
face=3DArial=20
color=3D#0000ff size=3D2>Steve</FONT></SPAN></DIV>
<DIV dir=3Dltr align=3Dleft><SPAN class=3D773461116-10052006><FONT =
face=3DArial=20
color=3D#0000ff size=3D2></FONT></SPAN>&nbsp;</DIV><FONT face=3DArial =
color=3D#0000ff=20
size=3D2></FONT><BR>
<DIV class=3DOutlookMessageHeader lang=3Den-us dir=3Dltr align=3Dleft>
<HR tabIndex=3D-1>
<FONT face=3DTahoma size=3D2><B>From:</B> =
veritas-bu-admin AT mailman.eng.auburn DOT edu=20
[mailto:veritas-bu-admin AT mailman.eng.auburn DOT edu] <B>On Behalf Of =
</B>Aaron=20
Mills<BR><B>Sent:</B> Tuesday, May 09, 2006 4:59 PM<BR><B>To:</B> Steve =
Cashman;=20
Justin Piszcz<BR><B>Cc:</B> =
veritas-bu AT mailman.eng.auburn DOT edu<BR><B>Subject:</B>=20
RE: [Veritas-bu] hanging bpbkar process<BR></FONT><BR></DIV>
<DIV></DIV>
<DIV dir=3Dltr align=3Dleft><SPAN class=3D626404421-09052006><FONT =
face=3DArial=20
color=3D#0000ff size=3D2>Well the server/client are the same box so =
network issues=20
are out. I'm NFS mounting a slice to the NBU server and then backing up =
it up.=20
After about 3 hours, the client process just stops logging. I've checked =
bpbrm,=20
bptm, and bpbkar logs to no avail.</FONT></SPAN></DIV>
<DIV dir=3Dltr align=3Dleft><SPAN class=3D626404421-09052006><FONT =
face=3DArial=20
color=3D#0000ff size=3D2></FONT></SPAN>&nbsp;</DIV>
<DIV dir=3Dltr align=3Dleft><SPAN class=3D626404421-09052006><FONT =
face=3DArial=20
color=3D#0000ff size=3D2>Could an NFS hicup cause this to=20
happen?</FONT></SPAN></DIV>
<DIV dir=3Dltr align=3Dleft><SPAN class=3D626404421-09052006><FONT =
face=3DArial=20
color=3D#0000ff size=3D2></FONT></SPAN>&nbsp;</DIV>
<DIV dir=3Dltr align=3Dleft><SPAN class=3D626404421-09052006><FONT =
face=3DArial=20
color=3D#0000ff size=3D2>bpbrm.log shows:</FONT></SPAN></DIV>
<DIV dir=3Dltr align=3Dleft><SPAN class=3D626404421-09052006><FONT =
face=3DArial=20
color=3D#0000ff size=3D2></FONT></SPAN>&nbsp;</DIV>
<DIV dir=3Dltr align=3Dleft><SPAN class=3D626404421-09052006><FONT =
face=3DArial=20
color=3D#0000ff size=3D2>06:57:36.696 [13465] &lt;2&gt; bpbrm =
sighandler: signal 14=20
caught by bpbrm<BR>06:57:36.696 [13465] &lt;2&gt; bpbrm sighandler: =
bpbrm=20
timeout after 10800 seconds<BR>06:57:36.696 [13465] &lt;2&gt;=20
clear_held_signals: clearing signal mask stack, mask_stack_depth =3D=20
0<BR>06:57:36.696 [13465] &lt;2&gt; bpbrm kill_child_process:=20
start<BR>06:57:36.697 [13465] &lt;2&gt; bpbrm wait_for_child:=20
start<BR>06:59:10.955 [13465] &lt;2&gt; bpbrm wait_for_child: child =
exit_status=20
=3D 82 signal_status =3D 0<BR>06:59:10.955 [13465] &lt;2&gt;=20
inform_client_of_status: INF - Server status =3D 41</FONT></SPAN></DIV>
<DIV dir=3Dltr align=3Dleft><SPAN class=3D626404421-09052006><FONT =
face=3DArial=20
color=3D#0000ff size=3D2></FONT></SPAN>&nbsp;</DIV>
<DIV dir=3Dltr align=3Dleft><SPAN class=3D626404421-09052006><FONT =
face=3DArial=20
color=3D#0000ff size=3D2>some three hours earlier, the last log from =
bpbkar looks=20
like:</FONT></SPAN></DIV>
<DIV dir=3Dltr align=3Dleft><SPAN class=3D626404421-09052006><FONT =
face=3DArial=20
color=3D#0000ff size=3D2></FONT></SPAN>&nbsp;</DIV>
<DIV dir=3Dltr align=3Dleft><SPAN class=3D626404421-09052006><FONT =
face=3DArial=20
color=3D#0000ff size=3D2>...snip...</FONT></SPAN></DIV>
<DIV dir=3Dltr align=3Dleft><SPAN class=3D626404421-09052006><FONT =
face=3DArial=20
color=3D#0000ff size=3D2>03:58:24.319 [13472] &lt;2&gt; bpbkar =
process_file: INF -=20
/path/to/some/file is sparse: stat.st_size =3D 12, stat.st_blocks * 512 =
=3D=20
0<BR>03:58:24.320 [13472] &lt;2&gt; bpbkar process_file: INF -=20
/path/to/some/file is now size 12<BR>03:58:24.320 [13472] &lt;4&gt; =
bpbkar=20
PrintFile: /path/to/some/file</FONT></SPAN></DIV>
<DIV dir=3Dltr align=3Dleft><SPAN class=3D626404421-09052006><FONT =
face=3DArial=20
color=3D#0000ff size=3D2>03:58:24.320 [13472] &lt;2&gt; bpbkar =
process_file: INF -=20
/path/to/some/file is sparse: stat.st_size =3D 12, stat.st_blocks * 512 =
=3D=20
0<BR>03:58:24.321 [13472] &lt;2&gt; bpbkar process_file: INF -=20
/path/to/some/file is now size 12<BR>03:58:24.322 [13472] &lt;4&gt; =
bpbkar=20
PrintFile: /path/to/some/file<BR>03:58:24.322 [13472] &lt;2&gt; bpbkar=20
process_file: INF - /path/to/some/file is sparse: stat.st_size =3D 12,=20
stat.st_blocks * 512 =3D 0<BR>03:58:24.323 [13472] &lt;2&gt; bpbkar =
process_file:=20
INF - /path/to/some/file is now size 12</FONT></SPAN></DIV>
<DIV dir=3Dltr align=3Dleft><SPAN class=3D626404421-09052006><FONT =
face=3DArial=20
color=3D#0000ff size=3D2></FONT></SPAN>&nbsp;</DIV>
<DIV dir=3Dltr align=3Dleft><SPAN class=3D626404421-09052006><FONT =
face=3DArial=20
color=3D#ff0000 size=3D2>(Is this "is sparse" message what I should be =
worried=20
about?)</FONT></SPAN></DIV>
<DIV dir=3Dltr align=3Dleft><SPAN class=3D626404421-09052006><FONT =
face=3DArial=20
color=3D#0000ff size=3D2></FONT></SPAN>&nbsp;</DIV>
<DIV dir=3Dltr align=3Dleft><SPAN class=3D626404421-09052006><FONT =
face=3DArial=20
color=3D#0000ff size=3D2>then nothing 'till I killed the process some 27 =
hours=20
later:</FONT></SPAN></DIV>
<DIV dir=3Dltr align=3Dleft><SPAN class=3D626404421-09052006><FONT =
face=3DArial=20
color=3D#0000ff size=3D2></FONT></SPAN>&nbsp;</DIV>
<DIV dir=3Dltr align=3Dleft><SPAN class=3D626404421-09052006><FONT =
face=3DArial=20
color=3D#0000ff size=3D2>11:33:35.407 [13472] &lt;16&gt; bpbkar =
sighandler: ERR -=20
bpbkar killed by signal 15</FONT></SPAN></DIV>
<DIV dir=3Dltr align=3Dleft><SPAN class=3D626404421-09052006><FONT =
face=3DArial=20
color=3D#0000ff size=3D2></FONT></SPAN>&nbsp;</DIV>
<DIV dir=3Dltr align=3Dleft><SPAN class=3D626404421-09052006><FONT =
face=3DArial=20
color=3D#0000ff size=3D2></FONT></SPAN>&nbsp;</DIV>
<DIV dir=3Dltr align=3Dleft><SPAN class=3D626404421-09052006><FONT =
face=3DArial=20
color=3D#0000ff size=3D2></FONT></SPAN>&nbsp;</DIV><BR>
<DIV class=3DOutlookMessageHeader lang=3Den-us dir=3Dltr align=3Dleft>
<HR tabIndex=3D-1>
<FONT face=3DTahoma size=3D2><B>From:</B> Steve Cashman =
[mailto:nbu.admin AT gmail DOT com]=20
<BR><B>Sent:</B> Tuesday, May 09, 2006 2:11 PM<BR><B>To:</B> Justin=20
Piszcz<BR><B>Cc:</B> Aaron Mills;=20
veritas-bu AT mailman.eng.auburn DOT edu<BR><B>Subject:</B> Re: [Veritas-bu] =
hanging=20
bpbkar process<BR></FONT><BR></DIV>
<DIV></DIV>
<DIV>
<P>Often times the Media / Master server encounters and error and exits =
but cant=20
notify the Client (they don't even attempt to) It sounds like you have =
at least=20
some logs enabled on the Media server since you mention that you =
reviewed the=20
Bpbrm log. So review the Media Server logs to see if you can get more=20
information. What I often have seen is a network disconnect between the =
Client=20
and the Media Server. This stops the backup as far as Bpbrm is concerned =
but=20
Bpbkar does not know about it, he just keeps churning through data until =
he is=20
done. </P>
<P>If this only happens on single client then it maybe something like a =
bad=20
port, Nic, Driver ect. If it only happens on one Policy but other =
policies for=20
that same client fail then you have a mystery on your hands. If you can =
post the=20
relevant logs so we can browse them a bit (Bpbrm, Bpbkar, Bptm to start =
I would=20
think) </P></DIV>
<DIV>Steve</DIV>
<DIV>Hope that helps<BR><BR>&nbsp;</DIV>
<DIV><SPAN class=3Dgmail_quote>On 5/9/06, <B =
class=3Dgmail_sendername>Justin=20
Piszcz</B> &lt;<A=20
href=3D"mailto:jpiszcz.backup AT gmail DOT com">jpiszcz.backup AT gmail DOT 
com</A>&gt;=
=20
wrote:</SPAN>=20
<BLOCKQUOTE class=3Dgmail_quote=20
style=3D"PADDING-LEFT: 1ex; MARGIN: 0px 0px 0px 0.8ex; BORDER-LEFT: #ccc =
1px solid">mkdir=20
  /usr/openv/netbackup/logs/bpbkar on the client, add VERBOSE =3D =
5<BR>to the=20
  bp.conf and watch the logs, also you can make a lot of logging =
<BR>directories=20
  on the server as well and tail them when the problematic<BR>client is =
backing=20
  up/etc.<BR><BR><BR>On 5/9/06, Aaron Mills &lt;<A=20
  =
href=3D"mailto:aaron.mills AT returnpath DOT net">aaron.mills AT returnpath DOT 
net</A>=
&gt;=20
  wrote: <BR>&gt;<BR>&gt;<BR>&gt; I had a backup timeout yesterday. =
bpbrm timed=20
  out after the configured<BR>&gt; interval (3 hours), but when I check =
to see=20
  what happened to the client<BR>&gt; process, bpbkar is still running=20
  (client/server on the same box) - it just <BR>&gt; hasn't done =
anything since=20
  three hours before the job timed out. The bpbkar<BR>&gt; log doesn't =
show=20
  anything useful. The process hums along and then just stops<BR>&gt; =
logging=20
  all of the sudden. This always seems to happen on the same job, =
<BR>&gt;=20
  though - never any others.<BR>&gt;<BR>&gt; Any ideas on where else I =
should=20
  look =
here?<BR>&gt;<BR>&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;=20
  -Aaron<BR>&gt;<BR>&gt; Aaron Mills<BR>&gt; System =
Administrator<BR>&gt; Return=20
  Path, Inc.<BR>&gt; 303.642.4111 <BR>&gt; <A=20
  =
href=3D"mailto:aaron.mills AT returnpath DOT net">aaron.mills AT returnpath DOT 
net</A>=
<BR>&gt;=20
  <A=20
  =
href=3D"http://www.returnpath.biz";>http://www.returnpath.biz</A><BR>&gt;<=
BR><BR>_______________________________________________<BR>Veritas-bu=20
  maillist&nbsp;&nbsp;-&nbsp;&nbsp;<A=20
  =
href=3D"mailto:Veritas-bu AT mailman.eng.auburn DOT edu">Veritas-bu AT mailman 
DOT eng.=
auburn.edu</A><BR><A=20
  =
href=3D"http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu";>http:/=
/mailman.eng.auburn.edu/mailman/listinfo/veritas-bu=20
  </A><BR></BLOCKQUOTE></DIV><BR></BODY></HTML>

------_=_NextPart_001_01C6744F.8FAD4114--

<Prev in Thread] Current Thread [Next in Thread>