Veritas-bu

Re: [Veritas-bu] (41) network connection timed out

2008-05-06 09:58:30
Subject: Re: [Veritas-bu] (41) network connection timed out
From: "Michael Schmarck" <michael.schmarck AT habmalnefrage DOT de>
To: "WEAVER, Simon (external)" <simon.weaver AT astrium.eads DOT net>, "Jeff Lightner" <jlightner AT water DOT com>
Date: Tue, 6 May 2008 15:39:18 +0200
Simon,

same issue.

I did some more tests, though, and found that I just had to increase
the CLIENT_READ_TIMEOUT value. It needs to be >= BPSTART_TIMEOUT.
(Although I don't think that this is actually true - I rather think, that
CLIENT_READ_TIMEOUT needs to be >= the actual start delay;
but I think it's a good rule of thumb, that it should be >= BPSTART_TIMEOUT).

Jeff, was that error really so extremely obvious? Or what dead horse
did you mean?

On Tue, May 6, 2008 at 3:00 PM, WEAVER, Simon (external)
<simon.weaver AT astrium.eads DOT net> wrote:
>
>  michael
>  as a test, can you freeze this tape, and see if it will mount another
>  tape and re-try ?
>
>
>  -----Original Message-----
>  From: a.skwar AT gmail DOT com [mailto:a.skwar AT gmail DOT com] On Behalf 
> Of Michael
>  Schmarck
>
> Sent: Tuesday, May 06, 2008 10:09 AM
>  To: WEAVER, Simon (external)
>
> Cc: ckstehman AT pepco DOT com; veritas-bu AT mailman.eng.auburn DOT edu;
>  veritas-bu-bounces AT mailman.eng.auburn DOT edu
>
>
> Subject: Re: [Veritas-bu] (41) network connection timed out
>
>  Hello.
>
>  On Tue, May 6, 2008 at 8:08 AM, WEAVER, Simon (external)
>  <simon.weaver AT astrium.eads DOT net> wrote:
>  >
>  >  Michael
>  >  Have you enabled full verbose logging on the client then ?
>
>  I now have, yes. (VERBOSE=5 and ran mklogdir and restarted netbackup
>  services)
>
>  On the client, there's:
>
>  ==> vnetd/log.050608 <==
>  09:14:17.056 [23198] <2> vnet_generate_hashes: vnet_vnetd.c.1733:
>  hash_str1: 98760e9bccc1e15dcce295fe8a2e2988
>  09:14:17.057 [23198] <2> vnet_generate_hashes: vnet_vnetd.c.1734:
>  hash_str2: ecffe5e6e4e09cfd2429495d72ced5cd
>  09:14:17.057 [23198] <2> process_connect_forward_socket: vnetd.c.1901:
>  ipc_string: /tmp/vnet-23198210058057056385000000000-GvaWtT
>  09:14:17.057 [23198] <2> process_connect_forward_socket: vnetd.c.1914:
>  hash_str1: 98760e9bccc1e15dcce295fe8a2e2988
>  09:14:17.093 [23198] <2> vnet_send_network_socket: vnet_vnetd.c.1529:
>  hash_str2: ecffe5e6e4e09cfd2429495d72ced5cd
>  09:14:17.095 [23198] <2> ProcessRequests: vnetd.c.367: status: 0
>  0x00000000
>  09:14:17.095 [23198] <4> main: Terminating with status 0 09:14:17.130
>  [23196] <2> vnet_send_network_socket: vnet_vnetd.c.1529:
>  hash_str2: 7698a562cfb22c368336dc7a63ddcaab
>  09:14:17.132 [23196] <2> ProcessRequests: vnetd.c.367: status: 0
>  0x00000000
>  09:14:17.132 [23196] <4> main: Terminating with status 0
>
>  ==> bpbkar/050608_00001.log <==
>  09:14:17.434 [23201] <4> bpbkar main: INF - setenv BPEND_TIMEOUT=300
>  09:14:17.434 [23201] <4> bpbkar main: INF - setenv RESTARTED=0
>  09:14:17.434 [23201] <4> bpbkar main: INF - setenv
>  BACKUPID=winds01_1210058054
>  09:14:17.434 [23201] <4> bpbkar main: INF - setenv
>  UNIXBACKUPTIME=1210058054
>  09:14:17.434 [23201] <4> bpbkar main: INF - setenv BACKUPTIME=Tue May
>  6 09:14:14 2008
>
>  09:14:17.434 [23201] <4> bpbkar main: INF - Inform when done
>  09:14:17.434 [23201] <4> bpbkar main: INF - Echo keepalives
>  09:14:17.435 [23201] <4> bpbkar notify: INF - START bpstart_notify
>  09:14:17.435 [23201] <4> bpbkar notify:
>  /usr/openv/netbackup/bin/bpstart_notify.RACE winds01 RACE Daily-Online
>  INCR
>
>  ==> bpcd/050608_00001.log <==
>  09:14:17.260 [23201] <2> bpcd main: child_args[32] = -jobgrpid
>  09:14:17.260 [23201] <2> bpcd main: child_args[33] = 4779 09:14:17.260
>  [23201] <2> bpcd main: child_args[34] = -tir 09:14:17.260 [23201] <2>
>  bpcd main: child_args[35] = -tir_plus 09:14:17.260 [23201] <2> bpcd
>  main: child_args[36] = -use_otm 09:14:17.260 [23201] <2> bpcd main:
>  child_args[37] = -use_ofb 09:14:17.260 [23201] <2> bpcd main:
>  child_args[38] = -b 09:14:17.260 [23201] <2> bpcd main: child_args[39] =
>  winds01_1210058054 09:14:17.260 [23201] <2> bpcd main: child_args[40] =
>  -nfsok 09:14:17.260 [23201] <2> bpcd main: Before execvp of command
>
>  On the server, there's of course a lot more.
>
>  ==> ./bptm/050608_00001.log <==
>  10:32:21.109 [26311] <2> parse_resource_strings: Parsed message type 18,
>  version 0, 4 parameters
>  10:32:21.109 [26311] <2> nbjm_media_request: Job control returned to
>  BPTM
>  10:32:21.111 [26311] <2> drivename_open: Called with Create 1, file
>  SEAGATE.ULTRIUM06242-XXX.000
>  10:32:21.117 [26311] <2> drivename_lock: lock established
>  10:32:21.117 [26311] <2> drivename_write: Called with mode 0
>  10:32:21.127 [26311] <2> mount_open_media: Waiting for mount of media id
>  W041L1 (copy 1) on server winds06.
>  10:32:21.128 [26311] <4> create_tpreq_file: symlink to path
>  /dev/rmt/0cbn
>  10:32:21.134 [26311] <2> manage_scsi_reserve: SCSI RESERVE
>  10:32:21.136 [26311] <4> expandpath:
>  /usr/openv/netbackup/db/media/tpreq/drive_SEAGATE.ULTRIUM06242-XXX.000
>  10:32:21.194 [26311] <2> process_brm_msg: no pending message from bpbrm
>
>  ==> ./vnetd/log.050608 <==
>  10:32:21.590 [26317] <2> logparams: /usr/openv/bin/vnetd
>  10:32:21.591 [26317] <2> ProcessRequests: vnetd.c.284: msg: VNETD ACCEPT
>  FROM 10.0.1.26.48478 TO 10.0.1.26.13724 fd = 5
>  10:32:21.592 [26317] <2> vnet_version_accept: vnetd.c.1141:
>  *actual_version: 4 0x00000004
>  10:32:21.592 [26317] <2> ProcessRequests: vnetd.c.344: msg: Request
>  VN_REQUEST_SERVICE_SOCKET(6)
>  10:32:21.592 [26317] <2> process_service_socket: vnetd.c.1523:
>  service_name: tldcd
>  10:32:21.600 [26317] <2> vnet_send_network_socket: vnet_vnetd.c.1529:
>  hash_str2:
>  10:32:21.601 [26317] <2> ProcessRequests: vnetd.c.346: status: 0
>  0x00000000
>  10:32:21.601 [26317] <4> main: Terminating with status 0
>
>  ==> ./bpjobd/log.050608 <==
>  10:32:30.049 [18402] <2> process_active_job: Frame type is 2 (jobid
>  4782) Sock (14)
>  10:32:30.049 [18402] <2> parse_packet: Index 55 Field m_nOperation Value
>  <0>
>  10:32:30.049 [18402] <2> push_monitor_data: Written 24 for Monitor 0
>  10:32:30.049 [18402] <2> process_active_job: Frame type is 3 (jobid
>  4782) Sock (14)
>  10:32:30.049 [18402] <2> process_active_job: Begin to append
>  (JOBTRYFILE) job (4782)
>  10:32:30.049 [18402] <2> process_active_job: MOUNTING 1210062741 W041L1
>
>  Then the tape W041L1 gets mounted.
>
>  10:33:30.054 [18402] <2> process_active_job: MOUNTED 1210062802 W041L1
>  POSITIONING 1210062802 W041L1 25
>
>
>
>  ==> ./bptm/050608_00001.log <==
>  10:40:11.477 [26311] <2> write_backup: write_data() returned,
>  exit_status = 0, CINDEX = 0, TWIN_INDEX = 0, backup_status = -7
>  10:40:11.487 [26311] <2> write_backup: tp.tv_sec = 1210063211,
>  stp.tv_sec = 1210062911, tp.tv_usec = 477263, stp.tv_usec = 75365, et =
>  300401, mpx_total_kbytes[TWIN_INDEX = 0] = 0
>  10:40:11.488 [26311] <2> write_backup: no data written for fragment,
>  backspace before writing tapemark
>  10:40:11.488 [26311] <2> io_ioctl: command (2)MTBSF 1 from
>  (bptm.c.17872) on drive index 0
>
>  ==> ./bpjobd/log.050608 <==
>  10:40:20.129 [18402] <2> process_active_job: Frame type is 1 (jobid
>  4782) Sock (14)
>  10:40:20.149 [18402] <2> parse_packet: Index 33 Field m_nKilobytes Value
>  <0>
>  10:40:20.149 [18402] <2> parse_packet: Index 36 Field m_nKbPerSec Value
>  <0>
>  10:40:20.149 [18402] <2> push_monitor_data: Written 40 for Monitor 0
>  10:40:20.150 [18402] <2> process_active_job: Frame type is 3 (jobid
>  4782) Sock (14)
>  10:40:20.200 [18402] <2> process_active_job: Begin to append
>  (JOBTRYFILE) job (4782)
>  10:40:20.200 [18402] <2> process_active_job: KBW 1210063212 0 0,000
>
>
>  ==> ./bptm/050608_00001.log <==
>  10:40:26.277 [26311] <2> io_ioctl: command (0)MTWEOF 1 from
>  (bptm.c.18281) on drive index 0
>  10:40:47.469 [26311] <2> read_brm_msg: STOP BACKUP winds01_1210062735
>  10:40:47.469 [26311] <2> io_terminate_tape: writing empty backup header,
>  drive index 0, copy 1 10:40:47.470 [26311] <2> io_terminate_tape:
>  reposition to previous tapemark and rewrite header 10:40:47.470 [26311]
>  <2> io_ioctl: command (2)MTBSF 1 from
>  (bptm.c.8164) on drive index 0
>  10:40:51.095 [26311] <2> io_ioctl: command (1)MTFSF 1 from
>  (bptm.c.8167) on drive index 0
>  10:40:51.115 [26311] <2> io_read_back_header: drive index 0, reading
>  backup header
>  10:40:51.286 [26311] <2> io_read_block: read error on media id W041L1,
>  drive index 0 reading header block, No space left on device
>  10:40:51.287 [26311] <2> io_read_back_header: drive index 0, tapemark or
>  ENOSPC encountered
>  10:40:51.287 [26311] <2> io_ioctl: command (2)MTBSF 1 from
>  (bptm.c.8185) on drive index 0
>  10:40:51.307 [26311] <2> io_ioctl: command (0)MTWEOF 1 from
>  (bptm.c.8229) on drive index 0
>
>
>  10:40:51.286: "No space left on device"? What's that about? It's *NOT*
>  about a filesystem on the server or on the client: None of my fs are
>  100% full. Not even close. Or is the tape full? If so, why was it
>  mounted in the first place? And why is it not full, when I backup with a
>  shorter start delay (I verified this to be true)? What does "No space
>  left on device"
>  mean?
>
>  In ./bpbrm/050608_00001.log, there's:
>
>  10:40:11.353 [26315] <2> bpbrm sighandler: signal 14 caught by bpbrm
>  10:40:11.362 [26315] <2> bpbrm sighandler: bpbrm timeout after 300
>  seconds
>  10:40:11.362 [26315] <2> clear_held_signals: clearing signal mask stack,
>  mask_stack_depth = 0
>  10:40:11.362 [26315] <2> inform_client_of_status: INF - Server status =
>  41
>  10:40:11.475 [26301] <2> bpbrm brm_sigcld: SIGCLD caught by bpbrm
>  10:40:11.476 [26301] <2> bpbrm brm_sigcld: bpbrm child 26315 exit_status
>  = 41, signal_status = 0
>  10:40:11.476 [26301] <2> bpbrm brm_sigcld: child 26315 exited with
>  status 41: network connection timed out
>  10:40:11.476 [26301] <2> bpbrm send_status_to_parent: bpbrm child is
>  done, but the media manager child is not.
>  10:40:11.476 [26301] <2> bpbrm tell_mm: sending media manager msg:
>  STOP BACKUP winds01_1210062735
>
>  Hm. Is that "bpbrm timeout after 300 seconds" the root cause? How do I
>  make that timeout be larger? Is it CLIENT_READ_TIMEOUT or
>  CLIENT_CONNECT_TIMEOUT?
>
>
>
>
>  Reg. network settings: I WILL verify them. But what's confusing: Why do
>  I only have problems, when the start delay is large? Why do other
>  policies/schedules work fine (using the same client and same server)?
>  Further, I get the same symptoms, when I backup a different client,
>  which is in the same subnet (ie. no backup to tape with large start
>  delay; backup to tape works with small start delay; backup to disk works
>  with both large & small start delay). Actually, another reason why I
>  doubt network problems: It always used to work fine (and still is).
>  client & server are connected to a 1 Gbit switch. That switch is not
>  overloaded. No changes on the cabling or something like that has been
>  done. ping (~0.4 ms) & traceroute (1 hop) work just fine.
>  The network is not heavily loaded. According to netstat, there are 0
>  Ierrs and 0 Oerrs and also 0 Collis. I don't think that there are
>  network problems - what makes you think, that there are network
>  problems, considering that the fact, that networking works just fine?
>  What other parameters would you think that I should check to verify,
>  that there's indeed no network problem?
>
>  As far as isolating goes: I think I pretty much do have the problem
>  isolated: Backup fails, when: start_delay = large & backup -> tape.
>  It does not depend on the client. So the question is: Why does backup to
>  tape fail with a (41), when I have a large start delay in place?
>
>  You know, what's blowing my mind is, that I have these problems only
>  when I backup to tape with a large delay up front. Why is that so?
>
>  Michael
>
>
>
> This email (including any attachments) may contain confidential and/or
>  privileged information or information otherwise protected from disclosure.
>  If you are not the intended recipient, please notify the sender
>  immediately, do not copy this message or any attachments and do not use it
>  for any purpose or disclose its content to any person, but delete this
>  message and any attachments from your system. Astrium disclaims any and all
>  liability if this email transmission was virus corrupted, altered or
>  falsified.
>  ---------------------------------------------------------------------
>  Astrium Limited, Registered in England and Wales No. 2449259
>  REGISTERED OFFICE:-
>  Gunnels Wood Road, Stevenage, Hertfordshire, SG1 2AS, England
>



-- 
Alexander
_______________________________________________
Veritas-bu maillist  -  Veritas-bu AT mailman.eng.auburn DOT edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu