Veritas-bu

Re: [Veritas-bu] (41) network connection timed out

2008-05-06 10:37:39
Subject: Re: [Veritas-bu] (41) network connection timed out
From: "WEAVER, Simon \(external\)" <simon.weaver AT astrium.eads DOT net>
To: "Michael Schmarck" <michael.schmarck AT habmalnefrage DOT de>, <veritas-bu AT mailman.eng.auburn DOT edu>
Date: Tue, 6 May 2008 15:22:36 +0100
Hi Michael
It is strange. I think this is why status 41 is often seen as a range of
possible problems. From my experience it has been Network, tape, or even
AV related. Sometimes even a reboot of the client can resolve many, if
not all problems.

Glad the logging has helped, it was unfortunate that several errors
seemed to point to a problem that was "not" really a problem, but at
least its been resolved.

Thanks, Simon 

-----Original Message-----
From: veritas-bu-bounces AT mailman.eng.auburn DOT edu
[mailto:veritas-bu-bounces AT mailman.eng.auburn DOT edu] On Behalf Of Michael
Schmarck
Sent: Tuesday, May 06, 2008 3:11 PM
To: veritas-bu AT mailman.eng.auburn DOT edu
Subject: Re: [Veritas-bu] (41) network connection timed out

Simon,

yes, the verbose logging really helped. But the "No space left on
device" message really was wrong. I now made the tape available again
and was able to write on it.

I suppose that some NBU process misinterpreted bprm failing to mean,
that there's no space left on the device.

What I still find strange is, that I do NOT have these problems, when I
backup to disk. It's as if bprm (or whatever process) ignores the
CLIENT_READ_TIMEOUT setting.

Now, that was fine - *for* *me*. But there might be cases, where this
would not be fine. So, why is the setting ignored, when a backup goes to
disk?

Strange, isn't it?

Regards,
Michael

WEAVER, Simon (external) <simon.weaver AT astrium.eads DOT net> wrote:

> 
> At least the logging verbose settings has helped. Strange you were 
> getting an odd message about "No space left on device" - I can only 
> guess it may have been a contributing factor for the status 41 perhaps
?
> 
> Simon
> 
> -----Original Message-----
> From: a.skwar AT gmail DOT com [mailto:a.skwar AT gmail DOT com] On Behalf Of 
> Michael Schmarck
> Sent: Tuesday, May 06, 2008 2:39 PM
> To: WEAVER, Simon (external); Jeff Lightner
> Cc: ckstehman AT pepco DOT com; Veritas NetBackup Mailinglist
> Subject: Re: [Veritas-bu] (41) network connection timed out
> 
> Simon,
> 
> same issue.
> 
> I did some more tests, though, and found that I just had to increase 
> the CLIENT_READ_TIMEOUT value. It needs to be >= BPSTART_TIMEOUT.
> (Although I don't think that this is actually true - I rather think, 
> that CLIENT_READ_TIMEOUT needs to be >= the actual start delay; but I 
> think it's a good rule of thumb, that it should be >=
BPSTART_TIMEOUT).
> 
> Jeff, was that error really so extremely obvious? Or what dead horse 
> did you mean?
> 
> On Tue, May 6, 2008 at 3:00 PM, WEAVER, Simon (external) 
> <simon.weaver AT astrium.eads DOT net> wrote:
>>
>>  michael
>>  as a test, can you freeze this tape, and see if it will mount 
>> another
> 
>> tape and re-try ?
>>
>>
>>  -----Original Message-----
>>  From: a.skwar AT gmail DOT com [mailto:a.skwar AT gmail DOT com] On Behalf 
>> Of 
>> Michael  Schmarck
>>
>> Sent: Tuesday, May 06, 2008 10:09 AM
>>  To: WEAVER, Simon (external)
>>
>> Cc: ckstehman AT pepco DOT com; veritas-bu AT mailman.eng.auburn DOT edu;
>>  veritas-bu-bounces AT mailman.eng.auburn DOT edu
>>
>>
>> Subject: Re: [Veritas-bu] (41) network connection timed out
>>
>>  Hello.
>>
>>  On Tue, May 6, 2008 at 8:08 AM, WEAVER, Simon (external) 
>> <simon.weaver AT astrium.eads DOT net> wrote:
>>  >
>>  >  Michael
>>  >  Have you enabled full verbose logging on the client then ?
>>
>>  I now have, yes. (VERBOSE=5 and ran mklogdir and restarted netbackup
>>  services)
>>
>>  On the client, there's:
>>
>>  ==> vnetd/log.050608 <==
>>  09:14:17.056 [23198] <2> vnet_generate_hashes: vnet_vnetd.c.1733:
>>  hash_str1: 98760e9bccc1e15dcce295fe8a2e2988
>>  09:14:17.057 [23198] <2> vnet_generate_hashes: vnet_vnetd.c.1734:
>>  hash_str2: ecffe5e6e4e09cfd2429495d72ced5cd
>>  09:14:17.057 [23198] <2> process_connect_forward_socket:
> vnetd.c.1901:
>>  ipc_string: /tmp/vnet-23198210058057056385000000000-GvaWtT
>>  09:14:17.057 [23198] <2> process_connect_forward_socket:
> vnetd.c.1914:
>>  hash_str1: 98760e9bccc1e15dcce295fe8a2e2988
>>  09:14:17.093 [23198] <2> vnet_send_network_socket:
vnet_vnetd.c.1529:
>>  hash_str2: ecffe5e6e4e09cfd2429495d72ced5cd
>>  09:14:17.095 [23198] <2> ProcessRequests: vnetd.c.367: status: 0 
>> 0x00000000
>>  09:14:17.095 [23198] <4> main: Terminating with status 0 
>> 09:14:17.130
> 
>> [23196] <2> vnet_send_network_socket: vnet_vnetd.c.1529:
>>  hash_str2: 7698a562cfb22c368336dc7a63ddcaab
>>  09:14:17.132 [23196] <2> ProcessRequests: vnetd.c.367: status: 0 
>> 0x00000000
>>  09:14:17.132 [23196] <4> main: Terminating with status 0
>>
>>  ==> bpbkar/050608_00001.log <==
>>  09:14:17.434 [23201] <4> bpbkar main: INF - setenv BPEND_TIMEOUT=300
>>  09:14:17.434 [23201] <4> bpbkar main: INF - setenv RESTARTED=0
>>  09:14:17.434 [23201] <4> bpbkar main: INF - setenv
>>  BACKUPID=winds01_1210058054
>>  09:14:17.434 [23201] <4> bpbkar main: INF - setenv
>>  UNIXBACKUPTIME=1210058054
>>  09:14:17.434 [23201] <4> bpbkar main: INF - setenv BACKUPTIME=Tue 
>> May
>>  6 09:14:14 2008
>>
>>  09:14:17.434 [23201] <4> bpbkar main: INF - Inform when done
>>  09:14:17.434 [23201] <4> bpbkar main: INF - Echo keepalives
>>  09:14:17.435 [23201] <4> bpbkar notify: INF - START bpstart_notify
>>  09:14:17.435 [23201] <4> bpbkar notify:
>>  /usr/openv/netbackup/bin/bpstart_notify.RACE winds01 RACE 
>> Daily-Online  INCR
>>
>>  ==> bpcd/050608_00001.log <==
>>  09:14:17.260 [23201] <2> bpcd main: child_args[32] = -jobgrpid 
>> 09:14:17.260 [23201] <2> bpcd main: child_args[33] = 4779 
>> 09:14:17.260
> 
>> [23201] <2> bpcd main: child_args[34] = -tir 09:14:17.260 [23201] <2>
> 
>> bpcd main: child_args[35] = -tir_plus 09:14:17.260 [23201] <2> bpcd
>>  main: child_args[36] = -use_otm 09:14:17.260 [23201] <2> bpcd main:
>>  child_args[37] = -use_ofb 09:14:17.260 [23201] <2> bpcd main:
>>  child_args[38] = -b 09:14:17.260 [23201] <2> bpcd main:
>> child_args[39] =
>>  winds01_1210058054 09:14:17.260 [23201] <2> bpcd main: 
>> child_args[40]
> 
>> =  -nfsok 09:14:17.260 [23201] <2> bpcd main: Before execvp of 
>> command
>>
>>  On the server, there's of course a lot more.
>>
>>  ==> ./bptm/050608_00001.log <==
>>  10:32:21.109 [26311] <2> parse_resource_strings: Parsed message type

>> 18,  version 0, 4 parameters
>>  10:32:21.109 [26311] <2> nbjm_media_request: Job control returned to
> 
>> BPTM
>>  10:32:21.111 [26311] <2> drivename_open: Called with Create 1, file 
>> SEAGATE.ULTRIUM06242-XXX.000
>>  10:32:21.117 [26311] <2> drivename_lock: lock established
>>  10:32:21.117 [26311] <2> drivename_write: Called with mode 0
>>  10:32:21.127 [26311] <2> mount_open_media: Waiting for mount of 
>> media
> 
>> id
>>  W041L1 (copy 1) on server winds06.
>>  10:32:21.128 [26311] <4> create_tpreq_file: symlink to path 
>> /dev/rmt/0cbn
>>  10:32:21.134 [26311] <2> manage_scsi_reserve: SCSI RESERVE
>>  10:32:21.136 [26311] <4> expandpath:
>>  
>> /usr/openv/netbackup/db/media/tpreq/drive_SEAGATE.ULTRIUM06242-XXX.00
>> 0
>>  10:32:21.194 [26311] <2> process_brm_msg: no pending message from 
>> bpbrm
>>
>>  ==> ./vnetd/log.050608 <==
>>  10:32:21.590 [26317] <2> logparams: /usr/openv/bin/vnetd
>>  10:32:21.591 [26317] <2> ProcessRequests: vnetd.c.284: msg: VNETD 
>> ACCEPT  FROM 10.0.1.26.48478 TO 10.0.1.26.13724 fd = 5
>>  10:32:21.592 [26317] <2> vnet_version_accept: vnetd.c.1141:
>>  *actual_version: 4 0x00000004
>>  10:32:21.592 [26317] <2> ProcessRequests: vnetd.c.344: msg: Request
>>  VN_REQUEST_SERVICE_SOCKET(6)
>>  10:32:21.592 [26317] <2> process_service_socket: vnetd.c.1523:
>>  service_name: tldcd
>>  10:32:21.600 [26317] <2> vnet_send_network_socket:
vnet_vnetd.c.1529:
>>  hash_str2:
>>  10:32:21.601 [26317] <2> ProcessRequests: vnetd.c.346: status: 0 
>> 0x00000000
>>  10:32:21.601 [26317] <4> main: Terminating with status 0
>>
>>  ==> ./bpjobd/log.050608 <==
>>  10:32:30.049 [18402] <2> process_active_job: Frame type is 2 (jobid
>>  4782) Sock (14)
>>  10:32:30.049 [18402] <2> parse_packet: Index 55 Field m_nOperation 
>> Value  <0>
>>  10:32:30.049 [18402] <2> push_monitor_data: Written 24 for Monitor 0
>>  10:32:30.049 [18402] <2> process_active_job: Frame type is 3 (jobid
>>  4782) Sock (14)
>>  10:32:30.049 [18402] <2> process_active_job: Begin to append
>>  (JOBTRYFILE) job (4782)
>>  10:32:30.049 [18402] <2> process_active_job: MOUNTING 1210062741
>> W041L1
>>
>>  Then the tape W041L1 gets mounted.
>>
>>  10:33:30.054 [18402] <2> process_active_job: MOUNTED 1210062802
>> W041L1  POSITIONING 1210062802 W041L1 25
>>
>>
>>
>>  ==> ./bptm/050608_00001.log <==
>>  10:40:11.477 [26311] <2> write_backup: write_data() returned, 
>> exit_status = 0, CINDEX = 0, TWIN_INDEX = 0, backup_status = -7
>>  10:40:11.487 [26311] <2> write_backup: tp.tv_sec = 1210063211, 
>> stp.tv_sec = 1210062911, tp.tv_usec = 477263, stp.tv_usec = 75365, et

>> =  300401, mpx_total_kbytes[TWIN_INDEX = 0] = 0
>>  10:40:11.488 [26311] <2> write_backup: no data written for fragment,
> 
>> backspace before writing tapemark
>>  10:40:11.488 [26311] <2> io_ioctl: command (2)MTBSF 1 from
>>  (bptm.c.17872) on drive index 0
>>
>>  ==> ./bpjobd/log.050608 <==
>>  10:40:20.129 [18402] <2> process_active_job: Frame type is 1 (jobid
>>  4782) Sock (14)
>>  10:40:20.149 [18402] <2> parse_packet: Index 33 Field m_nKilobytes 
>> Value  <0>
>>  10:40:20.149 [18402] <2> parse_packet: Index 36 Field m_nKbPerSec 
>> Value  <0>
>>  10:40:20.149 [18402] <2> push_monitor_data: Written 40 for Monitor 0
> 
>> 10:40:20.150 [18402] <2> process_active_job: Frame type is 3 (jobid
>>  4782) Sock (14)
>>  10:40:20.200 [18402] <2> process_active_job: Begin to append
>>  (JOBTRYFILE) job (4782)
>>  10:40:20.200 [18402] <2> process_active_job: KBW 1210063212 0 0,000
>>
>>
>>  ==> ./bptm/050608_00001.log <==
>>  10:40:26.277 [26311] <2> io_ioctl: command (0)MTWEOF 1 from
>>  (bptm.c.18281) on drive index 0
>>  10:40:47.469 [26311] <2> read_brm_msg: STOP BACKUP 
>> winds01_1210062735
>>  10:40:47.469 [26311] <2> io_terminate_tape: writing empty backup 
>> header,  drive index 0, copy 1 10:40:47.470 [26311] <2>
> io_terminate_tape:
>>  reposition to previous tapemark and rewrite header 10:40:47.470 
>> [26311]  <2> io_ioctl: command (2)MTBSF 1 from
>>  (bptm.c.8164) on drive index 0
>>  10:40:51.095 [26311] <2> io_ioctl: command (1)MTFSF 1 from
>>  (bptm.c.8167) on drive index 0
>>  10:40:51.115 [26311] <2> io_read_back_header: drive index 0, reading
> 
>> backup header
>>  10:40:51.286 [26311] <2> io_read_block: read error on media id 
>> W041L1,  drive index 0 reading header block, No space left on device
>>  10:40:51.287 [26311] <2> io_read_back_header: drive index 0, 
>> tapemark
> 
>> or  ENOSPC encountered
>>  10:40:51.287 [26311] <2> io_ioctl: command (2)MTBSF 1 from
>>  (bptm.c.8185) on drive index 0
>>  10:40:51.307 [26311] <2> io_ioctl: command (0)MTWEOF 1 from
>>  (bptm.c.8229) on drive index 0
>>
>>
>>  10:40:51.286: "No space left on device"? What's that about? It's
>> *NOT*  about a filesystem on the server or on the client: None of my 
>> fs are  100% full. Not even close. Or is the tape full? If so, why 
>> was
> 
>> it  mounted in the first place? And why is it not full, when I backup

>> with a  shorter start delay (I verified this to be true)? What does 
>> "No space  left on device"
>>  mean?
>>
>>  In ./bpbrm/050608_00001.log, there's:
>>
>>  10:40:11.353 [26315] <2> bpbrm sighandler: signal 14 caught by bpbrm
>>  10:40:11.362 [26315] <2> bpbrm sighandler: bpbrm timeout after 300 
>> seconds
>>  10:40:11.362 [26315] <2> clear_held_signals: clearing signal mask 
>> stack,  mask_stack_depth = 0
>>  10:40:11.362 [26315] <2> inform_client_of_status: INF - Server 
>> status
> 
>> =
>>  41
>>  10:40:11.475 [26301] <2> bpbrm brm_sigcld: SIGCLD caught by bpbrm
>>  10:40:11.476 [26301] <2> bpbrm brm_sigcld: bpbrm child 26315 
>> exit_status  = 41, signal_status = 0
>>  10:40:11.476 [26301] <2> bpbrm brm_sigcld: child 26315 exited with 
>> status 41: network connection timed out
>>  10:40:11.476 [26301] <2> bpbrm send_status_to_parent: bpbrm child is
> 
>> done, but the media manager child is not.
>>  10:40:11.476 [26301] <2> bpbrm tell_mm: sending media manager msg:
>>  STOP BACKUP winds01_1210062735
>>
>>  Hm. Is that "bpbrm timeout after 300 seconds" the root cause? How do

>> I  make that timeout be larger? Is it CLIENT_READ_TIMEOUT or 
>> CLIENT_CONNECT_TIMEOUT?
>>
>>
>>
>>
>>  Reg. network settings: I WILL verify them. But what's confusing: Why

>> do  I only have problems, when the start delay is large? Why do other
> 
>> policies/schedules work fine (using the same client and same server)?
>>  Further, I get the same symptoms, when I backup a different client, 
>> which is in the same subnet (ie. no backup to tape with large start 
>> delay; backup to tape works with small start delay; backup to disk 
>> works  with both large & small start delay). Actually, another reason

>> why I  doubt network problems: It always used to work fine (and still
> is).
>>  client & server are connected to a 1 Gbit switch. That switch is not
> 
>> overloaded. No changes on the cabling or something like that has been
> 
>> done. ping (~0.4 ms) & traceroute (1 hop) work just fine.
>>  The network is not heavily loaded. According to netstat, there are 0
> 
>> Ierrs and 0 Oerrs and also 0 Collis. I don't think that there are 
>> network problems - what makes you think, that there are network 
>> problems, considering that the fact, that networking works just fine?
>>  What other parameters would you think that I should check to verify,
> 
>> that there's indeed no network problem?
>>
>>  As far as isolating goes: I think I pretty much do have the problem
>>  isolated: Backup fails, when: start_delay = large & backup -> tape.
>>  It does not depend on the client. So the question is: Why does 
>> backup
> 
>> to  tape fail with a (41), when I have a large start delay in place?
>>
>>  You know, what's blowing my mind is, that I have these problems only
> 
>> when I backup to tape with a large delay up front. Why is that so?
>>
>>  Michael
>>
>>
>>
>> This email (including any attachments) may contain confidential 
>> and/or
> 
>> privileged information or information otherwise protected from
> disclosure.
>>  If you are not the intended recipient, please notify the sender 
>> immediately, do not copy this message or any attachments and do not 
>> use it  for any purpose or disclose its content to any person, but 
>> delete this  message and any attachments from your system. Astrium 
>> disclaims any and all  liability if this email transmission was virus

>> corrupted, altered or  falsified.
>>  
>> ---------------------------------------------------------------------
>>  Astrium Limited, Registered in England and Wales No. 2449259 
>> REGISTERED OFFICE:-  Gunnels Wood Road, Stevenage, Hertfordshire, SG1

>> 2AS, England
>>
> 
> 
> 
> --
> Alexander
> 
> This email (including any attachments) may contain confidential and/or

> privileged information or information otherwise protected from
disclosure.
> If you are not the intended recipient, please notify the sender 
> immediately, do not copy this message or any attachments and do not 
> use it for any purpose or disclose its content to any person, but 
> delete this message and any attachments from your system. Astrium 
> disclaims any and all liability if this email transmission was virus 
> corrupted, altered or falsified.
> ---------------------------------------------------------------------
> Astrium Limited, Registered in England and Wales No. 2449259 
> REGISTERED OFFICE:- Gunnels Wood Road, Stevenage, Hertfordshire, SG1 
> 2AS, England

--
Michael

_______________________________________________
Veritas-bu maillist  -  Veritas-bu AT mailman.eng.auburn DOT edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu

This email (including any attachments) may contain confidential and/or
privileged information or information otherwise protected from disclosure.
If you are not the intended recipient, please notify the sender
immediately, do not copy this message or any attachments and do not use it
for any purpose or disclose its content to any person, but delete this
message and any attachments from your system. Astrium disclaims any and all
liability if this email transmission was virus corrupted, altered or
falsified.
---------------------------------------------------------------------
Astrium Limited, Registered in England and Wales No. 2449259
REGISTERED OFFICE:-
Gunnels Wood Road, Stevenage, Hertfordshire, SG1 2AS, England

_______________________________________________
Veritas-bu maillist  -  Veritas-bu AT mailman.eng.auburn DOT edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu