Veritas-bu

Re: [Veritas-bu] (41) network connection timed out

2008-05-06 09:28:23
Subject: Re: [Veritas-bu] (41) network connection timed out
From: "WEAVER, Simon \(external\)" <simon.weaver AT astrium.eads DOT net>
To: "Michael Schmarck" <michael.schmarck AT habmalnefrage DOT de>
Date: Tue, 6 May 2008 14:00:34 +0100
michael
as a test, can you freeze this tape, and see if it will mount another
tape and re-try ? 

-----Original Message-----
From: a.skwar AT gmail DOT com [mailto:a.skwar AT gmail DOT com] On Behalf Of 
Michael
Schmarck
Sent: Tuesday, May 06, 2008 10:09 AM
To: WEAVER, Simon (external)
Cc: ckstehman AT pepco DOT com; veritas-bu AT mailman.eng.auburn DOT edu;
veritas-bu-bounces AT mailman.eng.auburn DOT edu
Subject: Re: [Veritas-bu] (41) network connection timed out

Hello.

On Tue, May 6, 2008 at 8:08 AM, WEAVER, Simon (external)
<simon.weaver AT astrium.eads DOT net> wrote:
>
>  Michael
>  Have you enabled full verbose logging on the client then ?

I now have, yes. (VERBOSE=5 and ran mklogdir and restarted netbackup
services)

On the client, there's:

==> vnetd/log.050608 <==
09:14:17.056 [23198] <2> vnet_generate_hashes: vnet_vnetd.c.1733:
hash_str1: 98760e9bccc1e15dcce295fe8a2e2988
09:14:17.057 [23198] <2> vnet_generate_hashes: vnet_vnetd.c.1734:
hash_str2: ecffe5e6e4e09cfd2429495d72ced5cd
09:14:17.057 [23198] <2> process_connect_forward_socket: vnetd.c.1901:
ipc_string: /tmp/vnet-23198210058057056385000000000-GvaWtT
09:14:17.057 [23198] <2> process_connect_forward_socket: vnetd.c.1914:
hash_str1: 98760e9bccc1e15dcce295fe8a2e2988
09:14:17.093 [23198] <2> vnet_send_network_socket: vnet_vnetd.c.1529:
hash_str2: ecffe5e6e4e09cfd2429495d72ced5cd
09:14:17.095 [23198] <2> ProcessRequests: vnetd.c.367: status: 0
0x00000000
09:14:17.095 [23198] <4> main: Terminating with status 0 09:14:17.130
[23196] <2> vnet_send_network_socket: vnet_vnetd.c.1529:
hash_str2: 7698a562cfb22c368336dc7a63ddcaab
09:14:17.132 [23196] <2> ProcessRequests: vnetd.c.367: status: 0
0x00000000
09:14:17.132 [23196] <4> main: Terminating with status 0

==> bpbkar/050608_00001.log <==
09:14:17.434 [23201] <4> bpbkar main: INF - setenv BPEND_TIMEOUT=300
09:14:17.434 [23201] <4> bpbkar main: INF - setenv RESTARTED=0
09:14:17.434 [23201] <4> bpbkar main: INF - setenv
BACKUPID=winds01_1210058054
09:14:17.434 [23201] <4> bpbkar main: INF - setenv
UNIXBACKUPTIME=1210058054
09:14:17.434 [23201] <4> bpbkar main: INF - setenv BACKUPTIME=Tue May
6 09:14:14 2008

09:14:17.434 [23201] <4> bpbkar main: INF - Inform when done
09:14:17.434 [23201] <4> bpbkar main: INF - Echo keepalives
09:14:17.435 [23201] <4> bpbkar notify: INF - START bpstart_notify
09:14:17.435 [23201] <4> bpbkar notify:
/usr/openv/netbackup/bin/bpstart_notify.RACE winds01 RACE Daily-Online
INCR

==> bpcd/050608_00001.log <==
09:14:17.260 [23201] <2> bpcd main: child_args[32] = -jobgrpid
09:14:17.260 [23201] <2> bpcd main: child_args[33] = 4779 09:14:17.260
[23201] <2> bpcd main: child_args[34] = -tir 09:14:17.260 [23201] <2>
bpcd main: child_args[35] = -tir_plus 09:14:17.260 [23201] <2> bpcd
main: child_args[36] = -use_otm 09:14:17.260 [23201] <2> bpcd main:
child_args[37] = -use_ofb 09:14:17.260 [23201] <2> bpcd main:
child_args[38] = -b 09:14:17.260 [23201] <2> bpcd main: child_args[39] =
winds01_1210058054 09:14:17.260 [23201] <2> bpcd main: child_args[40] =
-nfsok 09:14:17.260 [23201] <2> bpcd main: Before execvp of command

On the server, there's of course a lot more.

==> ./bptm/050608_00001.log <==
10:32:21.109 [26311] <2> parse_resource_strings: Parsed message type 18,
version 0, 4 parameters
10:32:21.109 [26311] <2> nbjm_media_request: Job control returned to
BPTM
10:32:21.111 [26311] <2> drivename_open: Called with Create 1, file
SEAGATE.ULTRIUM06242-XXX.000
10:32:21.117 [26311] <2> drivename_lock: lock established
10:32:21.117 [26311] <2> drivename_write: Called with mode 0
10:32:21.127 [26311] <2> mount_open_media: Waiting for mount of media id
W041L1 (copy 1) on server winds06.
10:32:21.128 [26311] <4> create_tpreq_file: symlink to path
/dev/rmt/0cbn
10:32:21.134 [26311] <2> manage_scsi_reserve: SCSI RESERVE
10:32:21.136 [26311] <4> expandpath:
/usr/openv/netbackup/db/media/tpreq/drive_SEAGATE.ULTRIUM06242-XXX.000
10:32:21.194 [26311] <2> process_brm_msg: no pending message from bpbrm

==> ./vnetd/log.050608 <==
10:32:21.590 [26317] <2> logparams: /usr/openv/bin/vnetd
10:32:21.591 [26317] <2> ProcessRequests: vnetd.c.284: msg: VNETD ACCEPT
FROM 10.0.1.26.48478 TO 10.0.1.26.13724 fd = 5
10:32:21.592 [26317] <2> vnet_version_accept: vnetd.c.1141:
*actual_version: 4 0x00000004
10:32:21.592 [26317] <2> ProcessRequests: vnetd.c.344: msg: Request
VN_REQUEST_SERVICE_SOCKET(6)
10:32:21.592 [26317] <2> process_service_socket: vnetd.c.1523:
service_name: tldcd
10:32:21.600 [26317] <2> vnet_send_network_socket: vnet_vnetd.c.1529:
hash_str2:
10:32:21.601 [26317] <2> ProcessRequests: vnetd.c.346: status: 0
0x00000000
10:32:21.601 [26317] <4> main: Terminating with status 0

==> ./bpjobd/log.050608 <==
10:32:30.049 [18402] <2> process_active_job: Frame type is 2 (jobid
4782) Sock (14)
10:32:30.049 [18402] <2> parse_packet: Index 55 Field m_nOperation Value
<0>
10:32:30.049 [18402] <2> push_monitor_data: Written 24 for Monitor 0
10:32:30.049 [18402] <2> process_active_job: Frame type is 3 (jobid
4782) Sock (14)
10:32:30.049 [18402] <2> process_active_job: Begin to append
(JOBTRYFILE) job (4782)
10:32:30.049 [18402] <2> process_active_job: MOUNTING 1210062741 W041L1

Then the tape W041L1 gets mounted.

10:33:30.054 [18402] <2> process_active_job: MOUNTED 1210062802 W041L1
POSITIONING 1210062802 W041L1 25



==> ./bptm/050608_00001.log <==
10:40:11.477 [26311] <2> write_backup: write_data() returned,
exit_status = 0, CINDEX = 0, TWIN_INDEX = 0, backup_status = -7
10:40:11.487 [26311] <2> write_backup: tp.tv_sec = 1210063211,
stp.tv_sec = 1210062911, tp.tv_usec = 477263, stp.tv_usec = 75365, et =
300401, mpx_total_kbytes[TWIN_INDEX = 0] = 0
10:40:11.488 [26311] <2> write_backup: no data written for fragment,
backspace before writing tapemark
10:40:11.488 [26311] <2> io_ioctl: command (2)MTBSF 1 from
(bptm.c.17872) on drive index 0

==> ./bpjobd/log.050608 <==
10:40:20.129 [18402] <2> process_active_job: Frame type is 1 (jobid
4782) Sock (14)
10:40:20.149 [18402] <2> parse_packet: Index 33 Field m_nKilobytes Value
<0>
10:40:20.149 [18402] <2> parse_packet: Index 36 Field m_nKbPerSec Value
<0>
10:40:20.149 [18402] <2> push_monitor_data: Written 40 for Monitor 0
10:40:20.150 [18402] <2> process_active_job: Frame type is 3 (jobid
4782) Sock (14)
10:40:20.200 [18402] <2> process_active_job: Begin to append
(JOBTRYFILE) job (4782)
10:40:20.200 [18402] <2> process_active_job: KBW 1210063212 0 0,000


==> ./bptm/050608_00001.log <==
10:40:26.277 [26311] <2> io_ioctl: command (0)MTWEOF 1 from
(bptm.c.18281) on drive index 0
10:40:47.469 [26311] <2> read_brm_msg: STOP BACKUP winds01_1210062735
10:40:47.469 [26311] <2> io_terminate_tape: writing empty backup header,
drive index 0, copy 1 10:40:47.470 [26311] <2> io_terminate_tape:
reposition to previous tapemark and rewrite header 10:40:47.470 [26311]
<2> io_ioctl: command (2)MTBSF 1 from
(bptm.c.8164) on drive index 0
10:40:51.095 [26311] <2> io_ioctl: command (1)MTFSF 1 from
(bptm.c.8167) on drive index 0
10:40:51.115 [26311] <2> io_read_back_header: drive index 0, reading
backup header
10:40:51.286 [26311] <2> io_read_block: read error on media id W041L1,
drive index 0 reading header block, No space left on device
10:40:51.287 [26311] <2> io_read_back_header: drive index 0, tapemark or
ENOSPC encountered
10:40:51.287 [26311] <2> io_ioctl: command (2)MTBSF 1 from
(bptm.c.8185) on drive index 0
10:40:51.307 [26311] <2> io_ioctl: command (0)MTWEOF 1 from
(bptm.c.8229) on drive index 0


10:40:51.286: "No space left on device"? What's that about? It's *NOT*
about a filesystem on the server or on the client: None of my fs are
100% full. Not even close. Or is the tape full? If so, why was it
mounted in the first place? And why is it not full, when I backup with a
shorter start delay (I verified this to be true)? What does "No space
left on device"
mean?

In ./bpbrm/050608_00001.log, there's:

10:40:11.353 [26315] <2> bpbrm sighandler: signal 14 caught by bpbrm
10:40:11.362 [26315] <2> bpbrm sighandler: bpbrm timeout after 300
seconds
10:40:11.362 [26315] <2> clear_held_signals: clearing signal mask stack,
mask_stack_depth = 0
10:40:11.362 [26315] <2> inform_client_of_status: INF - Server status =
41
10:40:11.475 [26301] <2> bpbrm brm_sigcld: SIGCLD caught by bpbrm
10:40:11.476 [26301] <2> bpbrm brm_sigcld: bpbrm child 26315 exit_status
= 41, signal_status = 0
10:40:11.476 [26301] <2> bpbrm brm_sigcld: child 26315 exited with
status 41: network connection timed out
10:40:11.476 [26301] <2> bpbrm send_status_to_parent: bpbrm child is
done, but the media manager child is not.
10:40:11.476 [26301] <2> bpbrm tell_mm: sending media manager msg:
STOP BACKUP winds01_1210062735

Hm. Is that "bpbrm timeout after 300 seconds" the root cause? How do I
make that timeout be larger? Is it CLIENT_READ_TIMEOUT or
CLIENT_CONNECT_TIMEOUT?




Reg. network settings: I WILL verify them. But what's confusing: Why do
I only have problems, when the start delay is large? Why do other
policies/schedules work fine (using the same client and same server)?
Further, I get the same symptoms, when I backup a different client,
which is in the same subnet (ie. no backup to tape with large start
delay; backup to tape works with small start delay; backup to disk works
with both large & small start delay). Actually, another reason why I
doubt network problems: It always used to work fine (and still is).
client & server are connected to a 1 Gbit switch. That switch is not
overloaded. No changes on the cabling or something like that has been
done. ping (~0.4 ms) & traceroute (1 hop) work just fine.
The network is not heavily loaded. According to netstat, there are 0
Ierrs and 0 Oerrs and also 0 Collis. I don't think that there are
network problems - what makes you think, that there are network
problems, considering that the fact, that networking works just fine?
What other parameters would you think that I should check to verify,
that there's indeed no network problem?

As far as isolating goes: I think I pretty much do have the problem
isolated: Backup fails, when: start_delay = large & backup -> tape.
It does not depend on the client. So the question is: Why does backup to
tape fail with a (41), when I have a large start delay in place?

You know, what's blowing my mind is, that I have these problems only
when I backup to tape with a large delay up front. Why is that so?

Michael

This email (including any attachments) may contain confidential and/or
privileged information or information otherwise protected from disclosure.
If you are not the intended recipient, please notify the sender
immediately, do not copy this message or any attachments and do not use it
for any purpose or disclose its content to any person, but delete this
message and any attachments from your system. Astrium disclaims any and all
liability if this email transmission was virus corrupted, altered or
falsified.
---------------------------------------------------------------------
Astrium Limited, Registered in England and Wales No. 2449259
REGISTERED OFFICE:-
Gunnels Wood Road, Stevenage, Hertfordshire, SG1 2AS, England

_______________________________________________
Veritas-bu maillist  -  Veritas-bu AT mailman.eng.auburn DOT edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu