Veritas-bu

Re: [Veritas-bu] (41) network connection timed out

2008-05-06 10:40:06
Subject: Re: [Veritas-bu] (41) network connection timed out
From: Michael Schmarck <michael.schmarck AT habmalnefrage DOT de>
To: veritas-bu AT mailman.eng.auburn DOT edu
Date: Tue, 06 May 2008 16:10:33 +0200
Simon,

yes, the verbose logging really helped. But the "No space left on
device" message really was wrong. I now made the tape available again
and was able to write on it.

I suppose that some NBU process misinterpreted bprm failing to mean,
that there's no space left on the device.

What I still find strange is, that I do NOT have these problems,
when I backup to disk. It's as if bprm (or whatever process) ignores
the CLIENT_READ_TIMEOUT setting.

Now, that was fine - *for* *me*. But there might be cases, where
this would not be fine. So, why is the setting ignored, when a backup
goes to disk?

Strange, isn't it?

Regards,
Michael

WEAVER, Simon (external) <simon.weaver AT astrium.eads DOT net> wrote:

> 
> At least the logging verbose settings has helped. Strange you were
> getting an odd message about "No space left on device" - I can only
> guess it may have been a contributing factor for the status 41 perhaps ?
> 
> Simon
> 
> -----Original Message-----
> From: a.skwar AT gmail DOT com [mailto:a.skwar AT gmail DOT com] On Behalf Of 
> Michael
> Schmarck
> Sent: Tuesday, May 06, 2008 2:39 PM
> To: WEAVER, Simon (external); Jeff Lightner
> Cc: ckstehman AT pepco DOT com; Veritas NetBackup Mailinglist
> Subject: Re: [Veritas-bu] (41) network connection timed out
> 
> Simon,
> 
> same issue.
> 
> I did some more tests, though, and found that I just had to increase the
> CLIENT_READ_TIMEOUT value. It needs to be >= BPSTART_TIMEOUT.
> (Although I don't think that this is actually true - I rather think,
> that CLIENT_READ_TIMEOUT needs to be >= the actual start delay; but I
> think it's a good rule of thumb, that it should be >= BPSTART_TIMEOUT).
> 
> Jeff, was that error really so extremely obvious? Or what dead horse did
> you mean?
> 
> On Tue, May 6, 2008 at 3:00 PM, WEAVER, Simon (external)
> <simon.weaver AT astrium.eads DOT net> wrote:
>>
>>  michael
>>  as a test, can you freeze this tape, and see if it will mount another
> 
>> tape and re-try ?
>>
>>
>>  -----Original Message-----
>>  From: a.skwar AT gmail DOT com [mailto:a.skwar AT gmail DOT com] On Behalf 
>> Of
>> Michael  Schmarck
>>
>> Sent: Tuesday, May 06, 2008 10:09 AM
>>  To: WEAVER, Simon (external)
>>
>> Cc: ckstehman AT pepco DOT com; veritas-bu AT mailman.eng.auburn DOT edu;
>>  veritas-bu-bounces AT mailman.eng.auburn DOT edu
>>
>>
>> Subject: Re: [Veritas-bu] (41) network connection timed out
>>
>>  Hello.
>>
>>  On Tue, May 6, 2008 at 8:08 AM, WEAVER, Simon (external)
>> <simon.weaver AT astrium.eads DOT net> wrote:
>>  >
>>  >  Michael
>>  >  Have you enabled full verbose logging on the client then ?
>>
>>  I now have, yes. (VERBOSE=5 and ran mklogdir and restarted netbackup
>>  services)
>>
>>  On the client, there's:
>>
>>  ==> vnetd/log.050608 <==
>>  09:14:17.056 [23198] <2> vnet_generate_hashes: vnet_vnetd.c.1733:
>>  hash_str1: 98760e9bccc1e15dcce295fe8a2e2988
>>  09:14:17.057 [23198] <2> vnet_generate_hashes: vnet_vnetd.c.1734:
>>  hash_str2: ecffe5e6e4e09cfd2429495d72ced5cd
>>  09:14:17.057 [23198] <2> process_connect_forward_socket:
> vnetd.c.1901:
>>  ipc_string: /tmp/vnet-23198210058057056385000000000-GvaWtT
>>  09:14:17.057 [23198] <2> process_connect_forward_socket:
> vnetd.c.1914:
>>  hash_str1: 98760e9bccc1e15dcce295fe8a2e2988
>>  09:14:17.093 [23198] <2> vnet_send_network_socket: vnet_vnetd.c.1529:
>>  hash_str2: ecffe5e6e4e09cfd2429495d72ced5cd
>>  09:14:17.095 [23198] <2> ProcessRequests: vnetd.c.367: status: 0
>> 0x00000000
>>  09:14:17.095 [23198] <4> main: Terminating with status 0 09:14:17.130
> 
>> [23196] <2> vnet_send_network_socket: vnet_vnetd.c.1529:
>>  hash_str2: 7698a562cfb22c368336dc7a63ddcaab
>>  09:14:17.132 [23196] <2> ProcessRequests: vnetd.c.367: status: 0
>> 0x00000000
>>  09:14:17.132 [23196] <4> main: Terminating with status 0
>>
>>  ==> bpbkar/050608_00001.log <==
>>  09:14:17.434 [23201] <4> bpbkar main: INF - setenv BPEND_TIMEOUT=300
>>  09:14:17.434 [23201] <4> bpbkar main: INF - setenv RESTARTED=0
>>  09:14:17.434 [23201] <4> bpbkar main: INF - setenv
>>  BACKUPID=winds01_1210058054
>>  09:14:17.434 [23201] <4> bpbkar main: INF - setenv
>>  UNIXBACKUPTIME=1210058054
>>  09:14:17.434 [23201] <4> bpbkar main: INF - setenv BACKUPTIME=Tue May
>>  6 09:14:14 2008
>>
>>  09:14:17.434 [23201] <4> bpbkar main: INF - Inform when done
>>  09:14:17.434 [23201] <4> bpbkar main: INF - Echo keepalives
>>  09:14:17.435 [23201] <4> bpbkar notify: INF - START bpstart_notify
>>  09:14:17.435 [23201] <4> bpbkar notify:
>>  /usr/openv/netbackup/bin/bpstart_notify.RACE winds01 RACE
>> Daily-Online  INCR
>>
>>  ==> bpcd/050608_00001.log <==
>>  09:14:17.260 [23201] <2> bpcd main: child_args[32] = -jobgrpid
>> 09:14:17.260 [23201] <2> bpcd main: child_args[33] = 4779 09:14:17.260
> 
>> [23201] <2> bpcd main: child_args[34] = -tir 09:14:17.260 [23201] <2>
> 
>> bpcd main: child_args[35] = -tir_plus 09:14:17.260 [23201] <2> bpcd
>>  main: child_args[36] = -use_otm 09:14:17.260 [23201] <2> bpcd main:
>>  child_args[37] = -use_ofb 09:14:17.260 [23201] <2> bpcd main:
>>  child_args[38] = -b 09:14:17.260 [23201] <2> bpcd main:
>> child_args[39] =
>>  winds01_1210058054 09:14:17.260 [23201] <2> bpcd main: child_args[40]
> 
>> =  -nfsok 09:14:17.260 [23201] <2> bpcd main: Before execvp of command
>>
>>  On the server, there's of course a lot more.
>>
>>  ==> ./bptm/050608_00001.log <==
>>  10:32:21.109 [26311] <2> parse_resource_strings: Parsed message type
>> 18,  version 0, 4 parameters
>>  10:32:21.109 [26311] <2> nbjm_media_request: Job control returned to
> 
>> BPTM
>>  10:32:21.111 [26311] <2> drivename_open: Called with Create 1, file
>> SEAGATE.ULTRIUM06242-XXX.000
>>  10:32:21.117 [26311] <2> drivename_lock: lock established
>>  10:32:21.117 [26311] <2> drivename_write: Called with mode 0
>>  10:32:21.127 [26311] <2> mount_open_media: Waiting for mount of media
> 
>> id
>>  W041L1 (copy 1) on server winds06.
>>  10:32:21.128 [26311] <4> create_tpreq_file: symlink to path
>> /dev/rmt/0cbn
>>  10:32:21.134 [26311] <2> manage_scsi_reserve: SCSI RESERVE
>>  10:32:21.136 [26311] <4> expandpath:
>>  
>> /usr/openv/netbackup/db/media/tpreq/drive_SEAGATE.ULTRIUM06242-XXX.000
>>  10:32:21.194 [26311] <2> process_brm_msg: no pending message from
>> bpbrm
>>
>>  ==> ./vnetd/log.050608 <==
>>  10:32:21.590 [26317] <2> logparams: /usr/openv/bin/vnetd
>>  10:32:21.591 [26317] <2> ProcessRequests: vnetd.c.284: msg: VNETD
>> ACCEPT  FROM 10.0.1.26.48478 TO 10.0.1.26.13724 fd = 5
>>  10:32:21.592 [26317] <2> vnet_version_accept: vnetd.c.1141:
>>  *actual_version: 4 0x00000004
>>  10:32:21.592 [26317] <2> ProcessRequests: vnetd.c.344: msg: Request
>>  VN_REQUEST_SERVICE_SOCKET(6)
>>  10:32:21.592 [26317] <2> process_service_socket: vnetd.c.1523:
>>  service_name: tldcd
>>  10:32:21.600 [26317] <2> vnet_send_network_socket: vnet_vnetd.c.1529:
>>  hash_str2:
>>  10:32:21.601 [26317] <2> ProcessRequests: vnetd.c.346: status: 0
>> 0x00000000
>>  10:32:21.601 [26317] <4> main: Terminating with status 0
>>
>>  ==> ./bpjobd/log.050608 <==
>>  10:32:30.049 [18402] <2> process_active_job: Frame type is 2 (jobid
>>  4782) Sock (14)
>>  10:32:30.049 [18402] <2> parse_packet: Index 55 Field m_nOperation
>> Value  <0>
>>  10:32:30.049 [18402] <2> push_monitor_data: Written 24 for Monitor 0
>>  10:32:30.049 [18402] <2> process_active_job: Frame type is 3 (jobid
>>  4782) Sock (14)
>>  10:32:30.049 [18402] <2> process_active_job: Begin to append
>>  (JOBTRYFILE) job (4782)
>>  10:32:30.049 [18402] <2> process_active_job: MOUNTING 1210062741
>> W041L1
>>
>>  Then the tape W041L1 gets mounted.
>>
>>  10:33:30.054 [18402] <2> process_active_job: MOUNTED 1210062802
>> W041L1  POSITIONING 1210062802 W041L1 25
>>
>>
>>
>>  ==> ./bptm/050608_00001.log <==
>>  10:40:11.477 [26311] <2> write_backup: write_data() returned,
>> exit_status = 0, CINDEX = 0, TWIN_INDEX = 0, backup_status = -7
>>  10:40:11.487 [26311] <2> write_backup: tp.tv_sec = 1210063211,
>> stp.tv_sec = 1210062911, tp.tv_usec = 477263, stp.tv_usec = 75365, et
>> =  300401, mpx_total_kbytes[TWIN_INDEX = 0] = 0
>>  10:40:11.488 [26311] <2> write_backup: no data written for fragment,
> 
>> backspace before writing tapemark
>>  10:40:11.488 [26311] <2> io_ioctl: command (2)MTBSF 1 from
>>  (bptm.c.17872) on drive index 0
>>
>>  ==> ./bpjobd/log.050608 <==
>>  10:40:20.129 [18402] <2> process_active_job: Frame type is 1 (jobid
>>  4782) Sock (14)
>>  10:40:20.149 [18402] <2> parse_packet: Index 33 Field m_nKilobytes
>> Value  <0>
>>  10:40:20.149 [18402] <2> parse_packet: Index 36 Field m_nKbPerSec
>> Value  <0>
>>  10:40:20.149 [18402] <2> push_monitor_data: Written 40 for Monitor 0
> 
>> 10:40:20.150 [18402] <2> process_active_job: Frame type is 3 (jobid
>>  4782) Sock (14)
>>  10:40:20.200 [18402] <2> process_active_job: Begin to append
>>  (JOBTRYFILE) job (4782)
>>  10:40:20.200 [18402] <2> process_active_job: KBW 1210063212 0 0,000
>>
>>
>>  ==> ./bptm/050608_00001.log <==
>>  10:40:26.277 [26311] <2> io_ioctl: command (0)MTWEOF 1 from
>>  (bptm.c.18281) on drive index 0
>>  10:40:47.469 [26311] <2> read_brm_msg: STOP BACKUP winds01_1210062735
>>  10:40:47.469 [26311] <2> io_terminate_tape: writing empty backup
>> header,  drive index 0, copy 1 10:40:47.470 [26311] <2>
> io_terminate_tape:
>>  reposition to previous tapemark and rewrite header 10:40:47.470
>> [26311]  <2> io_ioctl: command (2)MTBSF 1 from
>>  (bptm.c.8164) on drive index 0
>>  10:40:51.095 [26311] <2> io_ioctl: command (1)MTFSF 1 from
>>  (bptm.c.8167) on drive index 0
>>  10:40:51.115 [26311] <2> io_read_back_header: drive index 0, reading
> 
>> backup header
>>  10:40:51.286 [26311] <2> io_read_block: read error on media id
>> W041L1,  drive index 0 reading header block, No space left on device
>>  10:40:51.287 [26311] <2> io_read_back_header: drive index 0, tapemark
> 
>> or  ENOSPC encountered
>>  10:40:51.287 [26311] <2> io_ioctl: command (2)MTBSF 1 from
>>  (bptm.c.8185) on drive index 0
>>  10:40:51.307 [26311] <2> io_ioctl: command (0)MTWEOF 1 from
>>  (bptm.c.8229) on drive index 0
>>
>>
>>  10:40:51.286: "No space left on device"? What's that about? It's
>> *NOT*  about a filesystem on the server or on the client: None of my
>> fs are  100% full. Not even close. Or is the tape full? If so, why was
> 
>> it  mounted in the first place? And why is it not full, when I backup
>> with a  shorter start delay (I verified this to be true)? What does
>> "No space  left on device"
>>  mean?
>>
>>  In ./bpbrm/050608_00001.log, there's:
>>
>>  10:40:11.353 [26315] <2> bpbrm sighandler: signal 14 caught by bpbrm
>>  10:40:11.362 [26315] <2> bpbrm sighandler: bpbrm timeout after 300
>> seconds
>>  10:40:11.362 [26315] <2> clear_held_signals: clearing signal mask
>> stack,  mask_stack_depth = 0
>>  10:40:11.362 [26315] <2> inform_client_of_status: INF - Server status
> 
>> =
>>  41
>>  10:40:11.475 [26301] <2> bpbrm brm_sigcld: SIGCLD caught by bpbrm
>>  10:40:11.476 [26301] <2> bpbrm brm_sigcld: bpbrm child 26315
>> exit_status  = 41, signal_status = 0
>>  10:40:11.476 [26301] <2> bpbrm brm_sigcld: child 26315 exited with
>> status 41: network connection timed out
>>  10:40:11.476 [26301] <2> bpbrm send_status_to_parent: bpbrm child is
> 
>> done, but the media manager child is not.
>>  10:40:11.476 [26301] <2> bpbrm tell_mm: sending media manager msg:
>>  STOP BACKUP winds01_1210062735
>>
>>  Hm. Is that "bpbrm timeout after 300 seconds" the root cause? How do
>> I  make that timeout be larger? Is it CLIENT_READ_TIMEOUT or
>> CLIENT_CONNECT_TIMEOUT?
>>
>>
>>
>>
>>  Reg. network settings: I WILL verify them. But what's confusing: Why
>> do  I only have problems, when the start delay is large? Why do other
> 
>> policies/schedules work fine (using the same client and same server)?
>>  Further, I get the same symptoms, when I backup a different client,
>> which is in the same subnet (ie. no backup to tape with large start
>> delay; backup to tape works with small start delay; backup to disk
>> works  with both large & small start delay). Actually, another reason
>> why I  doubt network problems: It always used to work fine (and still
> is).
>>  client & server are connected to a 1 Gbit switch. That switch is not
> 
>> overloaded. No changes on the cabling or something like that has been
> 
>> done. ping (~0.4 ms) & traceroute (1 hop) work just fine.
>>  The network is not heavily loaded. According to netstat, there are 0
> 
>> Ierrs and 0 Oerrs and also 0 Collis. I don't think that there are
>> network problems - what makes you think, that there are network
>> problems, considering that the fact, that networking works just fine?
>>  What other parameters would you think that I should check to verify,
> 
>> that there's indeed no network problem?
>>
>>  As far as isolating goes: I think I pretty much do have the problem
>>  isolated: Backup fails, when: start_delay = large & backup -> tape.
>>  It does not depend on the client. So the question is: Why does backup
> 
>> to  tape fail with a (41), when I have a large start delay in place?
>>
>>  You know, what's blowing my mind is, that I have these problems only
> 
>> when I backup to tape with a large delay up front. Why is that so?
>>
>>  Michael
>>
>>
>>
>> This email (including any attachments) may contain confidential and/or
> 
>> privileged information or information otherwise protected from
> disclosure.
>>  If you are not the intended recipient, please notify the sender
>> immediately, do not copy this message or any attachments and do not
>> use it  for any purpose or disclose its content to any person, but
>> delete this  message and any attachments from your system. Astrium
>> disclaims any and all  liability if this email transmission was virus
>> corrupted, altered or  falsified.
>>  ---------------------------------------------------------------------
>>  Astrium Limited, Registered in England and Wales No. 2449259
>> REGISTERED OFFICE:-  Gunnels Wood Road, Stevenage, Hertfordshire, SG1
>> 2AS, England
>>
> 
> 
> 
> --
> Alexander
> 
> This email (including any attachments) may contain confidential and/or
> privileged information or information otherwise protected from disclosure.
> If you are not the intended recipient, please notify the sender
> immediately, do not copy this message or any attachments and do not use it
> for any purpose or disclose its content to any person, but delete this
> message and any attachments from your system. Astrium disclaims any and
> all liability if this email transmission was virus corrupted, altered or
> falsified.
> ---------------------------------------------------------------------
> Astrium Limited, Registered in England and Wales No. 2449259
> REGISTERED OFFICE:-
> Gunnels Wood Road, Stevenage, Hertfordshire, SG1 2AS, England

-- 
Michael

_______________________________________________
Veritas-bu maillist  -  Veritas-bu AT mailman.eng.auburn DOT edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu