Veritas-bu

Re: [Veritas-bu] (41) network connection timed out

2008-05-06 10:09:23
Subject: Re: [Veritas-bu] (41) network connection timed out
From: "WEAVER, Simon \(external\)" <simon.weaver AT astrium.eads DOT net>
To: "Michael Schmarck" <michael.schmarck AT habmalnefrage DOT de>, "Jeff Lightner" <jlightner AT water DOT com>
Date: Tue, 6 May 2008 14:46:06 +0100
At least the logging verbose settings has helped. Strange you were
getting an odd message about "No space left on device" - I can only
guess it may have been a contributing factor for the status 41 perhaps ?

Simon

-----Original Message-----
From: a.skwar AT gmail DOT com [mailto:a.skwar AT gmail DOT com] On Behalf Of 
Michael
Schmarck
Sent: Tuesday, May 06, 2008 2:39 PM
To: WEAVER, Simon (external); Jeff Lightner
Cc: ckstehman AT pepco DOT com; Veritas NetBackup Mailinglist
Subject: Re: [Veritas-bu] (41) network connection timed out

Simon,

same issue.

I did some more tests, though, and found that I just had to increase the
CLIENT_READ_TIMEOUT value. It needs to be >= BPSTART_TIMEOUT.
(Although I don't think that this is actually true - I rather think,
that CLIENT_READ_TIMEOUT needs to be >= the actual start delay; but I
think it's a good rule of thumb, that it should be >= BPSTART_TIMEOUT).

Jeff, was that error really so extremely obvious? Or what dead horse did
you mean?

On Tue, May 6, 2008 at 3:00 PM, WEAVER, Simon (external)
<simon.weaver AT astrium.eads DOT net> wrote:
>
>  michael
>  as a test, can you freeze this tape, and see if it will mount another

> tape and re-try ?
>
>
>  -----Original Message-----
>  From: a.skwar AT gmail DOT com [mailto:a.skwar AT gmail DOT com] On Behalf 
> Of 
> Michael  Schmarck
>
> Sent: Tuesday, May 06, 2008 10:09 AM
>  To: WEAVER, Simon (external)
>
> Cc: ckstehman AT pepco DOT com; veritas-bu AT mailman.eng.auburn DOT edu;
>  veritas-bu-bounces AT mailman.eng.auburn DOT edu
>
>
> Subject: Re: [Veritas-bu] (41) network connection timed out
>
>  Hello.
>
>  On Tue, May 6, 2008 at 8:08 AM, WEAVER, Simon (external)  
> <simon.weaver AT astrium.eads DOT net> wrote:
>  >
>  >  Michael
>  >  Have you enabled full verbose logging on the client then ?
>
>  I now have, yes. (VERBOSE=5 and ran mklogdir and restarted netbackup
>  services)
>
>  On the client, there's:
>
>  ==> vnetd/log.050608 <==
>  09:14:17.056 [23198] <2> vnet_generate_hashes: vnet_vnetd.c.1733:
>  hash_str1: 98760e9bccc1e15dcce295fe8a2e2988
>  09:14:17.057 [23198] <2> vnet_generate_hashes: vnet_vnetd.c.1734:
>  hash_str2: ecffe5e6e4e09cfd2429495d72ced5cd
>  09:14:17.057 [23198] <2> process_connect_forward_socket:
vnetd.c.1901:
>  ipc_string: /tmp/vnet-23198210058057056385000000000-GvaWtT
>  09:14:17.057 [23198] <2> process_connect_forward_socket:
vnetd.c.1914:
>  hash_str1: 98760e9bccc1e15dcce295fe8a2e2988
>  09:14:17.093 [23198] <2> vnet_send_network_socket: vnet_vnetd.c.1529:
>  hash_str2: ecffe5e6e4e09cfd2429495d72ced5cd
>  09:14:17.095 [23198] <2> ProcessRequests: vnetd.c.367: status: 0  
> 0x00000000
>  09:14:17.095 [23198] <4> main: Terminating with status 0 09:14:17.130

> [23196] <2> vnet_send_network_socket: vnet_vnetd.c.1529:
>  hash_str2: 7698a562cfb22c368336dc7a63ddcaab
>  09:14:17.132 [23196] <2> ProcessRequests: vnetd.c.367: status: 0  
> 0x00000000
>  09:14:17.132 [23196] <4> main: Terminating with status 0
>
>  ==> bpbkar/050608_00001.log <==
>  09:14:17.434 [23201] <4> bpbkar main: INF - setenv BPEND_TIMEOUT=300
>  09:14:17.434 [23201] <4> bpbkar main: INF - setenv RESTARTED=0
>  09:14:17.434 [23201] <4> bpbkar main: INF - setenv
>  BACKUPID=winds01_1210058054
>  09:14:17.434 [23201] <4> bpbkar main: INF - setenv
>  UNIXBACKUPTIME=1210058054
>  09:14:17.434 [23201] <4> bpbkar main: INF - setenv BACKUPTIME=Tue May
>  6 09:14:14 2008
>
>  09:14:17.434 [23201] <4> bpbkar main: INF - Inform when done
>  09:14:17.434 [23201] <4> bpbkar main: INF - Echo keepalives
>  09:14:17.435 [23201] <4> bpbkar notify: INF - START bpstart_notify
>  09:14:17.435 [23201] <4> bpbkar notify:
>  /usr/openv/netbackup/bin/bpstart_notify.RACE winds01 RACE 
> Daily-Online  INCR
>
>  ==> bpcd/050608_00001.log <==
>  09:14:17.260 [23201] <2> bpcd main: child_args[32] = -jobgrpid  
> 09:14:17.260 [23201] <2> bpcd main: child_args[33] = 4779 09:14:17.260

> [23201] <2> bpcd main: child_args[34] = -tir 09:14:17.260 [23201] <2>

> bpcd main: child_args[35] = -tir_plus 09:14:17.260 [23201] <2> bpcd
>  main: child_args[36] = -use_otm 09:14:17.260 [23201] <2> bpcd main:
>  child_args[37] = -use_ofb 09:14:17.260 [23201] <2> bpcd main:
>  child_args[38] = -b 09:14:17.260 [23201] <2> bpcd main: 
> child_args[39] =
>  winds01_1210058054 09:14:17.260 [23201] <2> bpcd main: child_args[40]

> =  -nfsok 09:14:17.260 [23201] <2> bpcd main: Before execvp of command
>
>  On the server, there's of course a lot more.
>
>  ==> ./bptm/050608_00001.log <==
>  10:32:21.109 [26311] <2> parse_resource_strings: Parsed message type 
> 18,  version 0, 4 parameters
>  10:32:21.109 [26311] <2> nbjm_media_request: Job control returned to

> BPTM
>  10:32:21.111 [26311] <2> drivename_open: Called with Create 1, file  
> SEAGATE.ULTRIUM06242-XXX.000
>  10:32:21.117 [26311] <2> drivename_lock: lock established
>  10:32:21.117 [26311] <2> drivename_write: Called with mode 0
>  10:32:21.127 [26311] <2> mount_open_media: Waiting for mount of media

> id
>  W041L1 (copy 1) on server winds06.
>  10:32:21.128 [26311] <4> create_tpreq_file: symlink to path  
> /dev/rmt/0cbn
>  10:32:21.134 [26311] <2> manage_scsi_reserve: SCSI RESERVE
>  10:32:21.136 [26311] <4> expandpath:
>  
> /usr/openv/netbackup/db/media/tpreq/drive_SEAGATE.ULTRIUM06242-XXX.000
>  10:32:21.194 [26311] <2> process_brm_msg: no pending message from 
> bpbrm
>
>  ==> ./vnetd/log.050608 <==
>  10:32:21.590 [26317] <2> logparams: /usr/openv/bin/vnetd
>  10:32:21.591 [26317] <2> ProcessRequests: vnetd.c.284: msg: VNETD 
> ACCEPT  FROM 10.0.1.26.48478 TO 10.0.1.26.13724 fd = 5
>  10:32:21.592 [26317] <2> vnet_version_accept: vnetd.c.1141:
>  *actual_version: 4 0x00000004
>  10:32:21.592 [26317] <2> ProcessRequests: vnetd.c.344: msg: Request
>  VN_REQUEST_SERVICE_SOCKET(6)
>  10:32:21.592 [26317] <2> process_service_socket: vnetd.c.1523:
>  service_name: tldcd
>  10:32:21.600 [26317] <2> vnet_send_network_socket: vnet_vnetd.c.1529:
>  hash_str2:
>  10:32:21.601 [26317] <2> ProcessRequests: vnetd.c.346: status: 0  
> 0x00000000
>  10:32:21.601 [26317] <4> main: Terminating with status 0
>
>  ==> ./bpjobd/log.050608 <==
>  10:32:30.049 [18402] <2> process_active_job: Frame type is 2 (jobid
>  4782) Sock (14)
>  10:32:30.049 [18402] <2> parse_packet: Index 55 Field m_nOperation 
> Value  <0>
>  10:32:30.049 [18402] <2> push_monitor_data: Written 24 for Monitor 0
>  10:32:30.049 [18402] <2> process_active_job: Frame type is 3 (jobid
>  4782) Sock (14)
>  10:32:30.049 [18402] <2> process_active_job: Begin to append
>  (JOBTRYFILE) job (4782)
>  10:32:30.049 [18402] <2> process_active_job: MOUNTING 1210062741 
> W041L1
>
>  Then the tape W041L1 gets mounted.
>
>  10:33:30.054 [18402] <2> process_active_job: MOUNTED 1210062802 
> W041L1  POSITIONING 1210062802 W041L1 25
>
>
>
>  ==> ./bptm/050608_00001.log <==
>  10:40:11.477 [26311] <2> write_backup: write_data() returned,  
> exit_status = 0, CINDEX = 0, TWIN_INDEX = 0, backup_status = -7
>  10:40:11.487 [26311] <2> write_backup: tp.tv_sec = 1210063211,  
> stp.tv_sec = 1210062911, tp.tv_usec = 477263, stp.tv_usec = 75365, et 
> =  300401, mpx_total_kbytes[TWIN_INDEX = 0] = 0
>  10:40:11.488 [26311] <2> write_backup: no data written for fragment,

> backspace before writing tapemark
>  10:40:11.488 [26311] <2> io_ioctl: command (2)MTBSF 1 from
>  (bptm.c.17872) on drive index 0
>
>  ==> ./bpjobd/log.050608 <==
>  10:40:20.129 [18402] <2> process_active_job: Frame type is 1 (jobid
>  4782) Sock (14)
>  10:40:20.149 [18402] <2> parse_packet: Index 33 Field m_nKilobytes 
> Value  <0>
>  10:40:20.149 [18402] <2> parse_packet: Index 36 Field m_nKbPerSec 
> Value  <0>
>  10:40:20.149 [18402] <2> push_monitor_data: Written 40 for Monitor 0

> 10:40:20.150 [18402] <2> process_active_job: Frame type is 3 (jobid
>  4782) Sock (14)
>  10:40:20.200 [18402] <2> process_active_job: Begin to append
>  (JOBTRYFILE) job (4782)
>  10:40:20.200 [18402] <2> process_active_job: KBW 1210063212 0 0,000
>
>
>  ==> ./bptm/050608_00001.log <==
>  10:40:26.277 [26311] <2> io_ioctl: command (0)MTWEOF 1 from
>  (bptm.c.18281) on drive index 0
>  10:40:47.469 [26311] <2> read_brm_msg: STOP BACKUP winds01_1210062735
>  10:40:47.469 [26311] <2> io_terminate_tape: writing empty backup 
> header,  drive index 0, copy 1 10:40:47.470 [26311] <2>
io_terminate_tape:
>  reposition to previous tapemark and rewrite header 10:40:47.470 
> [26311]  <2> io_ioctl: command (2)MTBSF 1 from
>  (bptm.c.8164) on drive index 0
>  10:40:51.095 [26311] <2> io_ioctl: command (1)MTFSF 1 from
>  (bptm.c.8167) on drive index 0
>  10:40:51.115 [26311] <2> io_read_back_header: drive index 0, reading

> backup header
>  10:40:51.286 [26311] <2> io_read_block: read error on media id 
> W041L1,  drive index 0 reading header block, No space left on device
>  10:40:51.287 [26311] <2> io_read_back_header: drive index 0, tapemark

> or  ENOSPC encountered
>  10:40:51.287 [26311] <2> io_ioctl: command (2)MTBSF 1 from
>  (bptm.c.8185) on drive index 0
>  10:40:51.307 [26311] <2> io_ioctl: command (0)MTWEOF 1 from
>  (bptm.c.8229) on drive index 0
>
>
>  10:40:51.286: "No space left on device"? What's that about? It's 
> *NOT*  about a filesystem on the server or on the client: None of my 
> fs are  100% full. Not even close. Or is the tape full? If so, why was

> it  mounted in the first place? And why is it not full, when I backup 
> with a  shorter start delay (I verified this to be true)? What does 
> "No space  left on device"
>  mean?
>
>  In ./bpbrm/050608_00001.log, there's:
>
>  10:40:11.353 [26315] <2> bpbrm sighandler: signal 14 caught by bpbrm
>  10:40:11.362 [26315] <2> bpbrm sighandler: bpbrm timeout after 300  
> seconds
>  10:40:11.362 [26315] <2> clear_held_signals: clearing signal mask 
> stack,  mask_stack_depth = 0
>  10:40:11.362 [26315] <2> inform_client_of_status: INF - Server status

> =
>  41
>  10:40:11.475 [26301] <2> bpbrm brm_sigcld: SIGCLD caught by bpbrm
>  10:40:11.476 [26301] <2> bpbrm brm_sigcld: bpbrm child 26315 
> exit_status  = 41, signal_status = 0
>  10:40:11.476 [26301] <2> bpbrm brm_sigcld: child 26315 exited with  
> status 41: network connection timed out
>  10:40:11.476 [26301] <2> bpbrm send_status_to_parent: bpbrm child is

> done, but the media manager child is not.
>  10:40:11.476 [26301] <2> bpbrm tell_mm: sending media manager msg:
>  STOP BACKUP winds01_1210062735
>
>  Hm. Is that "bpbrm timeout after 300 seconds" the root cause? How do 
> I  make that timeout be larger? Is it CLIENT_READ_TIMEOUT or  
> CLIENT_CONNECT_TIMEOUT?
>
>
>
>
>  Reg. network settings: I WILL verify them. But what's confusing: Why 
> do  I only have problems, when the start delay is large? Why do other

> policies/schedules work fine (using the same client and same server)?
>  Further, I get the same symptoms, when I backup a different client,  
> which is in the same subnet (ie. no backup to tape with large start  
> delay; backup to tape works with small start delay; backup to disk 
> works  with both large & small start delay). Actually, another reason 
> why I  doubt network problems: It always used to work fine (and still
is).
>  client & server are connected to a 1 Gbit switch. That switch is not

> overloaded. No changes on the cabling or something like that has been

> done. ping (~0.4 ms) & traceroute (1 hop) work just fine.
>  The network is not heavily loaded. According to netstat, there are 0

> Ierrs and 0 Oerrs and also 0 Collis. I don't think that there are  
> network problems - what makes you think, that there are network  
> problems, considering that the fact, that networking works just fine?
>  What other parameters would you think that I should check to verify,

> that there's indeed no network problem?
>
>  As far as isolating goes: I think I pretty much do have the problem
>  isolated: Backup fails, when: start_delay = large & backup -> tape.
>  It does not depend on the client. So the question is: Why does backup

> to  tape fail with a (41), when I have a large start delay in place?
>
>  You know, what's blowing my mind is, that I have these problems only

> when I backup to tape with a large delay up front. Why is that so?
>
>  Michael
>
>
>
> This email (including any attachments) may contain confidential and/or

> privileged information or information otherwise protected from
disclosure.
>  If you are not the intended recipient, please notify the sender  
> immediately, do not copy this message or any attachments and do not 
> use it  for any purpose or disclose its content to any person, but 
> delete this  message and any attachments from your system. Astrium 
> disclaims any and all  liability if this email transmission was virus 
> corrupted, altered or  falsified.
>  ---------------------------------------------------------------------
>  Astrium Limited, Registered in England and Wales No. 2449259  
> REGISTERED OFFICE:-  Gunnels Wood Road, Stevenage, Hertfordshire, SG1 
> 2AS, England
>



--
Alexander

This email (including any attachments) may contain confidential and/or
privileged information or information otherwise protected from disclosure.
If you are not the intended recipient, please notify the sender
immediately, do not copy this message or any attachments and do not use it
for any purpose or disclose its content to any person, but delete this
message and any attachments from your system. Astrium disclaims any and all
liability if this email transmission was virus corrupted, altered or
falsified.
---------------------------------------------------------------------
Astrium Limited, Registered in England and Wales No. 2449259
REGISTERED OFFICE:-
Gunnels Wood Road, Stevenage, Hertfordshire, SG1 2AS, England

_______________________________________________
Veritas-bu maillist  -  Veritas-bu AT mailman.eng.auburn DOT edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu