Simon,
yes, the verbose logging really helped. But the "No space left on
device" message really was wrong. I now made the tape available again
and was able to write on it.
I suppose that some NBU process misinterpreted bprm failing to mean,
that there's no space left on the device.
What I still find strange is, that I do NOT have these problems,
when I backup to disk. It's as if bprm (or whatever process) ignores
the CLIENT_READ_TIMEOUT setting.
Now, that was fine - *for* *me*. But there might be cases, where
this would not be fine. So, why is the setting ignored, when a backup
goes to disk?
Strange, isn't it?
Regards,
Michael
WEAVER, Simon (external) <simon.weaver AT astrium.eads DOT net> wrote:
>
> At least the logging verbose settings has helped. Strange you were
> getting an odd message about "No space left on device" - I can only
> guess it may have been a contributing factor for the status 41 perhaps ?
>
> Simon
>
> -----Original Message-----
> From: a.skwar AT gmail DOT com [mailto:a.skwar AT gmail DOT com] On Behalf Of
> Michael
> Schmarck
> Sent: Tuesday, May 06, 2008 2:39 PM
> To: WEAVER, Simon (external); Jeff Lightner
> Cc: ckstehman AT pepco DOT com; Veritas NetBackup Mailinglist
> Subject: Re: [Veritas-bu] (41) network connection timed out
>
> Simon,
>
> same issue.
>
> I did some more tests, though, and found that I just had to increase the
> CLIENT_READ_TIMEOUT value. It needs to be >= BPSTART_TIMEOUT.
> (Although I don't think that this is actually true - I rather think,
> that CLIENT_READ_TIMEOUT needs to be >= the actual start delay; but I
> think it's a good rule of thumb, that it should be >= BPSTART_TIMEOUT).
>
> Jeff, was that error really so extremely obvious? Or what dead horse did
> you mean?
>
> On Tue, May 6, 2008 at 3:00 PM, WEAVER, Simon (external)
> <simon.weaver AT astrium.eads DOT net> wrote:
>>
>> michael
>> as a test, can you freeze this tape, and see if it will mount another
>
>> tape and re-try ?
>>
>>
>> -----Original Message-----
>> From: a.skwar AT gmail DOT com [mailto:a.skwar AT gmail DOT com] On Behalf
>> Of
>> Michael Schmarck
>>
>> Sent: Tuesday, May 06, 2008 10:09 AM
>> To: WEAVER, Simon (external)
>>
>> Cc: ckstehman AT pepco DOT com; veritas-bu AT mailman.eng.auburn DOT edu;
>> veritas-bu-bounces AT mailman.eng.auburn DOT edu
>>
>>
>> Subject: Re: [Veritas-bu] (41) network connection timed out
>>
>> Hello.
>>
>> On Tue, May 6, 2008 at 8:08 AM, WEAVER, Simon (external)
>> <simon.weaver AT astrium.eads DOT net> wrote:
>> >
>> > Michael
>> > Have you enabled full verbose logging on the client then ?
>>
>> I now have, yes. (VERBOSE=5 and ran mklogdir and restarted netbackup
>> services)
>>
>> On the client, there's:
>>
>> ==> vnetd/log.050608 <==
>> 09:14:17.056 [23198] <2> vnet_generate_hashes: vnet_vnetd.c.1733:
>> hash_str1: 98760e9bccc1e15dcce295fe8a2e2988
>> 09:14:17.057 [23198] <2> vnet_generate_hashes: vnet_vnetd.c.1734:
>> hash_str2: ecffe5e6e4e09cfd2429495d72ced5cd
>> 09:14:17.057 [23198] <2> process_connect_forward_socket:
> vnetd.c.1901:
>> ipc_string: /tmp/vnet-23198210058057056385000000000-GvaWtT
>> 09:14:17.057 [23198] <2> process_connect_forward_socket:
> vnetd.c.1914:
>> hash_str1: 98760e9bccc1e15dcce295fe8a2e2988
>> 09:14:17.093 [23198] <2> vnet_send_network_socket: vnet_vnetd.c.1529:
>> hash_str2: ecffe5e6e4e09cfd2429495d72ced5cd
>> 09:14:17.095 [23198] <2> ProcessRequests: vnetd.c.367: status: 0
>> 0x00000000
>> 09:14:17.095 [23198] <4> main: Terminating with status 0 09:14:17.130
>
>> [23196] <2> vnet_send_network_socket: vnet_vnetd.c.1529:
>> hash_str2: 7698a562cfb22c368336dc7a63ddcaab
>> 09:14:17.132 [23196] <2> ProcessRequests: vnetd.c.367: status: 0
>> 0x00000000
>> 09:14:17.132 [23196] <4> main: Terminating with status 0
>>
>> ==> bpbkar/050608_00001.log <==
>> 09:14:17.434 [23201] <4> bpbkar main: INF - setenv BPEND_TIMEOUT=300
>> 09:14:17.434 [23201] <4> bpbkar main: INF - setenv RESTARTED=0
>> 09:14:17.434 [23201] <4> bpbkar main: INF - setenv
>> BACKUPID=winds01_1210058054
>> 09:14:17.434 [23201] <4> bpbkar main: INF - setenv
>> UNIXBACKUPTIME=1210058054
>> 09:14:17.434 [23201] <4> bpbkar main: INF - setenv BACKUPTIME=Tue May
>> 6 09:14:14 2008
>>
>> 09:14:17.434 [23201] <4> bpbkar main: INF - Inform when done
>> 09:14:17.434 [23201] <4> bpbkar main: INF - Echo keepalives
>> 09:14:17.435 [23201] <4> bpbkar notify: INF - START bpstart_notify
>> 09:14:17.435 [23201] <4> bpbkar notify:
>> /usr/openv/netbackup/bin/bpstart_notify.RACE winds01 RACE
>> Daily-Online INCR
>>
>> ==> bpcd/050608_00001.log <==
>> 09:14:17.260 [23201] <2> bpcd main: child_args[32] = -jobgrpid
>> 09:14:17.260 [23201] <2> bpcd main: child_args[33] = 4779 09:14:17.260
>
>> [23201] <2> bpcd main: child_args[34] = -tir 09:14:17.260 [23201] <2>
>
>> bpcd main: child_args[35] = -tir_plus 09:14:17.260 [23201] <2> bpcd
>> main: child_args[36] = -use_otm 09:14:17.260 [23201] <2> bpcd main:
>> child_args[37] = -use_ofb 09:14:17.260 [23201] <2> bpcd main:
>> child_args[38] = -b 09:14:17.260 [23201] <2> bpcd main:
>> child_args[39] =
>> winds01_1210058054 09:14:17.260 [23201] <2> bpcd main: child_args[40]
>
>> = -nfsok 09:14:17.260 [23201] <2> bpcd main: Before execvp of command
>>
>> On the server, there's of course a lot more.
>>
>> ==> ./bptm/050608_00001.log <==
>> 10:32:21.109 [26311] <2> parse_resource_strings: Parsed message type
>> 18, version 0, 4 parameters
>> 10:32:21.109 [26311] <2> nbjm_media_request: Job control returned to
>
>> BPTM
>> 10:32:21.111 [26311] <2> drivename_open: Called with Create 1, file
>> SEAGATE.ULTRIUM06242-XXX.000
>> 10:32:21.117 [26311] <2> drivename_lock: lock established
>> 10:32:21.117 [26311] <2> drivename_write: Called with mode 0
>> 10:32:21.127 [26311] <2> mount_open_media: Waiting for mount of media
>
>> id
>> W041L1 (copy 1) on server winds06.
>> 10:32:21.128 [26311] <4> create_tpreq_file: symlink to path
>> /dev/rmt/0cbn
>> 10:32:21.134 [26311] <2> manage_scsi_reserve: SCSI RESERVE
>> 10:32:21.136 [26311] <4> expandpath:
>>
>> /usr/openv/netbackup/db/media/tpreq/drive_SEAGATE.ULTRIUM06242-XXX.000
>> 10:32:21.194 [26311] <2> process_brm_msg: no pending message from
>> bpbrm
>>
>> ==> ./vnetd/log.050608 <==
>> 10:32:21.590 [26317] <2> logparams: /usr/openv/bin/vnetd
>> 10:32:21.591 [26317] <2> ProcessRequests: vnetd.c.284: msg: VNETD
>> ACCEPT FROM 10.0.1.26.48478 TO 10.0.1.26.13724 fd = 5
>> 10:32:21.592 [26317] <2> vnet_version_accept: vnetd.c.1141:
>> *actual_version: 4 0x00000004
>> 10:32:21.592 [26317] <2> ProcessRequests: vnetd.c.344: msg: Request
>> VN_REQUEST_SERVICE_SOCKET(6)
>> 10:32:21.592 [26317] <2> process_service_socket: vnetd.c.1523:
>> service_name: tldcd
>> 10:32:21.600 [26317] <2> vnet_send_network_socket: vnet_vnetd.c.1529:
>> hash_str2:
>> 10:32:21.601 [26317] <2> ProcessRequests: vnetd.c.346: status: 0
>> 0x00000000
>> 10:32:21.601 [26317] <4> main: Terminating with status 0
>>
>> ==> ./bpjobd/log.050608 <==
>> 10:32:30.049 [18402] <2> process_active_job: Frame type is 2 (jobid
>> 4782) Sock (14)
>> 10:32:30.049 [18402] <2> parse_packet: Index 55 Field m_nOperation
>> Value <0>
>> 10:32:30.049 [18402] <2> push_monitor_data: Written 24 for Monitor 0
>> 10:32:30.049 [18402] <2> process_active_job: Frame type is 3 (jobid
>> 4782) Sock (14)
>> 10:32:30.049 [18402] <2> process_active_job: Begin to append
>> (JOBTRYFILE) job (4782)
>> 10:32:30.049 [18402] <2> process_active_job: MOUNTING 1210062741
>> W041L1
>>
>> Then the tape W041L1 gets mounted.
>>
>> 10:33:30.054 [18402] <2> process_active_job: MOUNTED 1210062802
>> W041L1 POSITIONING 1210062802 W041L1 25
>>
>>
>>
>> ==> ./bptm/050608_00001.log <==
>> 10:40:11.477 [26311] <2> write_backup: write_data() returned,
>> exit_status = 0, CINDEX = 0, TWIN_INDEX = 0, backup_status = -7
>> 10:40:11.487 [26311] <2> write_backup: tp.tv_sec = 1210063211,
>> stp.tv_sec = 1210062911, tp.tv_usec = 477263, stp.tv_usec = 75365, et
>> = 300401, mpx_total_kbytes[TWIN_INDEX = 0] = 0
>> 10:40:11.488 [26311] <2> write_backup: no data written for fragment,
>
>> backspace before writing tapemark
>> 10:40:11.488 [26311] <2> io_ioctl: command (2)MTBSF 1 from
>> (bptm.c.17872) on drive index 0
>>
>> ==> ./bpjobd/log.050608 <==
>> 10:40:20.129 [18402] <2> process_active_job: Frame type is 1 (jobid
>> 4782) Sock (14)
>> 10:40:20.149 [18402] <2> parse_packet: Index 33 Field m_nKilobytes
>> Value <0>
>> 10:40:20.149 [18402] <2> parse_packet: Index 36 Field m_nKbPerSec
>> Value <0>
>> 10:40:20.149 [18402] <2> push_monitor_data: Written 40 for Monitor 0
>
>> 10:40:20.150 [18402] <2> process_active_job: Frame type is 3 (jobid
>> 4782) Sock (14)
>> 10:40:20.200 [18402] <2> process_active_job: Begin to append
>> (JOBTRYFILE) job (4782)
>> 10:40:20.200 [18402] <2> process_active_job: KBW 1210063212 0 0,000
>>
>>
>> ==> ./bptm/050608_00001.log <==
>> 10:40:26.277 [26311] <2> io_ioctl: command (0)MTWEOF 1 from
>> (bptm.c.18281) on drive index 0
>> 10:40:47.469 [26311] <2> read_brm_msg: STOP BACKUP winds01_1210062735
>> 10:40:47.469 [26311] <2> io_terminate_tape: writing empty backup
>> header, drive index 0, copy 1 10:40:47.470 [26311] <2>
> io_terminate_tape:
>> reposition to previous tapemark and rewrite header 10:40:47.470
>> [26311] <2> io_ioctl: command (2)MTBSF 1 from
>> (bptm.c.8164) on drive index 0
>> 10:40:51.095 [26311] <2> io_ioctl: command (1)MTFSF 1 from
>> (bptm.c.8167) on drive index 0
>> 10:40:51.115 [26311] <2> io_read_back_header: drive index 0, reading
>
>> backup header
>> 10:40:51.286 [26311] <2> io_read_block: read error on media id
>> W041L1, drive index 0 reading header block, No space left on device
>> 10:40:51.287 [26311] <2> io_read_back_header: drive index 0, tapemark
>
>> or ENOSPC encountered
>> 10:40:51.287 [26311] <2> io_ioctl: command (2)MTBSF 1 from
>> (bptm.c.8185) on drive index 0
>> 10:40:51.307 [26311] <2> io_ioctl: command (0)MTWEOF 1 from
>> (bptm.c.8229) on drive index 0
>>
>>
>> 10:40:51.286: "No space left on device"? What's that about? It's
>> *NOT* about a filesystem on the server or on the client: None of my
>> fs are 100% full. Not even close. Or is the tape full? If so, why was
>
>> it mounted in the first place? And why is it not full, when I backup
>> with a shorter start delay (I verified this to be true)? What does
>> "No space left on device"
>> mean?
>>
>> In ./bpbrm/050608_00001.log, there's:
>>
>> 10:40:11.353 [26315] <2> bpbrm sighandler: signal 14 caught by bpbrm
>> 10:40:11.362 [26315] <2> bpbrm sighandler: bpbrm timeout after 300
>> seconds
>> 10:40:11.362 [26315] <2> clear_held_signals: clearing signal mask
>> stack, mask_stack_depth = 0
>> 10:40:11.362 [26315] <2> inform_client_of_status: INF - Server status
>
>> =
>> 41
>> 10:40:11.475 [26301] <2> bpbrm brm_sigcld: SIGCLD caught by bpbrm
>> 10:40:11.476 [26301] <2> bpbrm brm_sigcld: bpbrm child 26315
>> exit_status = 41, signal_status = 0
>> 10:40:11.476 [26301] <2> bpbrm brm_sigcld: child 26315 exited with
>> status 41: network connection timed out
>> 10:40:11.476 [26301] <2> bpbrm send_status_to_parent: bpbrm child is
>
>> done, but the media manager child is not.
>> 10:40:11.476 [26301] <2> bpbrm tell_mm: sending media manager msg:
>> STOP BACKUP winds01_1210062735
>>
>> Hm. Is that "bpbrm timeout after 300 seconds" the root cause? How do
>> I make that timeout be larger? Is it CLIENT_READ_TIMEOUT or
>> CLIENT_CONNECT_TIMEOUT?
>>
>>
>>
>>
>> Reg. network settings: I WILL verify them. But what's confusing: Why
>> do I only have problems, when the start delay is large? Why do other
>
>> policies/schedules work fine (using the same client and same server)?
>> Further, I get the same symptoms, when I backup a different client,
>> which is in the same subnet (ie. no backup to tape with large start
>> delay; backup to tape works with small start delay; backup to disk
>> works with both large & small start delay). Actually, another reason
>> why I doubt network problems: It always used to work fine (and still
> is).
>> client & server are connected to a 1 Gbit switch. That switch is not
>
>> overloaded. No changes on the cabling or something like that has been
>
>> done. ping (~0.4 ms) & traceroute (1 hop) work just fine.
>> The network is not heavily loaded. According to netstat, there are 0
>
>> Ierrs and 0 Oerrs and also 0 Collis. I don't think that there are
>> network problems - what makes you think, that there are network
>> problems, considering that the fact, that networking works just fine?
>> What other parameters would you think that I should check to verify,
>
>> that there's indeed no network problem?
>>
>> As far as isolating goes: I think I pretty much do have the problem
>> isolated: Backup fails, when: start_delay = large & backup -> tape.
>> It does not depend on the client. So the question is: Why does backup
>
>> to tape fail with a (41), when I have a large start delay in place?
>>
>> You know, what's blowing my mind is, that I have these problems only
>
>> when I backup to tape with a large delay up front. Why is that so?
>>
>> Michael
>>
>>
>>
>> This email (including any attachments) may contain confidential and/or
>
>> privileged information or information otherwise protected from
> disclosure.
>> If you are not the intended recipient, please notify the sender
>> immediately, do not copy this message or any attachments and do not
>> use it for any purpose or disclose its content to any person, but
>> delete this message and any attachments from your system. Astrium
>> disclaims any and all liability if this email transmission was virus
>> corrupted, altered or falsified.
>> ---------------------------------------------------------------------
>> Astrium Limited, Registered in England and Wales No. 2449259
>> REGISTERED OFFICE:- Gunnels Wood Road, Stevenage, Hertfordshire, SG1
>> 2AS, England
>>
>
>
>
> --
> Alexander
>
> This email (including any attachments) may contain confidential and/or
> privileged information or information otherwise protected from disclosure.
> If you are not the intended recipient, please notify the sender
> immediately, do not copy this message or any attachments and do not use it
> for any purpose or disclose its content to any person, but delete this
> message and any attachments from your system. Astrium disclaims any and
> all liability if this email transmission was virus corrupted, altered or
> falsified.
> ---------------------------------------------------------------------
> Astrium Limited, Registered in England and Wales No. 2449259
> REGISTERED OFFICE:-
> Gunnels Wood Road, Stevenage, Hertfordshire, SG1 2AS, England
--
Michael
_______________________________________________
Veritas-bu maillist - Veritas-bu AT mailman.eng.auburn DOT edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
|