Re: [Veritas-bu] (41) network connection timed out

Hello.

On Tue, May 6, 2008 at 8:08 AM, WEAVER, Simon (external)
<simon.weaver AT astrium.eads DOT net> wrote:
>
>  Michael
>  Have you enabled full verbose logging on the client then ?

I now have, yes. (VERBOSE=5 and ran mklogdir and restarted netbackup services)

On the client, there's:

==> vnetd/log.050608 <==
09:14:17.056 [23198] <2> vnet_generate_hashes: vnet_vnetd.c.1733:
hash_str1: 98760e9bccc1e15dcce295fe8a2e2988
09:14:17.057 [23198] <2> vnet_generate_hashes: vnet_vnetd.c.1734:
hash_str2: ecffe5e6e4e09cfd2429495d72ced5cd
09:14:17.057 [23198] <2> process_connect_forward_socket: vnetd.c.1901:
ipc_string: /tmp/vnet-23198210058057056385000000000-GvaWtT
09:14:17.057 [23198] <2> process_connect_forward_socket: vnetd.c.1914:
hash_str1: 98760e9bccc1e15dcce295fe8a2e2988
09:14:17.093 [23198] <2> vnet_send_network_socket: vnet_vnetd.c.1529:
hash_str2: ecffe5e6e4e09cfd2429495d72ced5cd
09:14:17.095 [23198] <2> ProcessRequests: vnetd.c.367: status: 0 0x00000000
09:14:17.095 [23198] <4> main: Terminating with status 0
09:14:17.130 [23196] <2> vnet_send_network_socket: vnet_vnetd.c.1529:
hash_str2: 7698a562cfb22c368336dc7a63ddcaab
09:14:17.132 [23196] <2> ProcessRequests: vnetd.c.367: status: 0 0x00000000
09:14:17.132 [23196] <4> main: Terminating with status 0

==> bpbkar/050608_00001.log <==
09:14:17.434 [23201] <4> bpbkar main: INF - setenv BPEND_TIMEOUT=300
09:14:17.434 [23201] <4> bpbkar main: INF - setenv RESTARTED=0
09:14:17.434 [23201] <4> bpbkar main: INF - setenv BACKUPID=winds01_1210058054
09:14:17.434 [23201] <4> bpbkar main: INF - setenv UNIXBACKUPTIME=1210058054
09:14:17.434 [23201] <4> bpbkar main: INF - setenv BACKUPTIME=Tue May
6 09:14:14 2008

09:14:17.434 [23201] <4> bpbkar main: INF - Inform when done
09:14:17.434 [23201] <4> bpbkar main: INF - Echo keepalives
09:14:17.435 [23201] <4> bpbkar notify: INF - START bpstart_notify
09:14:17.435 [23201] <4> bpbkar notify:
/usr/openv/netbackup/bin/bpstart_notify.RACE winds01 RACE Daily-Online
INCR

==> bpcd/050608_00001.log <==
09:14:17.260 [23201] <2> bpcd main: child_args[32] = -jobgrpid
09:14:17.260 [23201] <2> bpcd main: child_args[33] = 4779
09:14:17.260 [23201] <2> bpcd main: child_args[34] = -tir
09:14:17.260 [23201] <2> bpcd main: child_args[35] = -tir_plus
09:14:17.260 [23201] <2> bpcd main: child_args[36] = -use_otm
09:14:17.260 [23201] <2> bpcd main: child_args[37] = -use_ofb
09:14:17.260 [23201] <2> bpcd main: child_args[38] = -b
09:14:17.260 [23201] <2> bpcd main: child_args[39] = winds01_1210058054
09:14:17.260 [23201] <2> bpcd main: child_args[40] = -nfsok
09:14:17.260 [23201] <2> bpcd main: Before execvp of command

On the server, there's of course a lot more.

==> ./bptm/050608_00001.log <==
10:32:21.109 [26311] <2> parse_resource_strings: Parsed message type
18, version 0, 4 parameters
10:32:21.109 [26311] <2> nbjm_media_request: Job control returned to BPTM
10:32:21.111 [26311] <2> drivename_open: Called with Create 1, file
SEAGATE.ULTRIUM06242-XXX.000
10:32:21.117 [26311] <2> drivename_lock: lock established
10:32:21.117 [26311] <2> drivename_write: Called with mode 0
10:32:21.127 [26311] <2> mount_open_media: Waiting for mount of media
id W041L1 (copy 1) on server winds06.
10:32:21.128 [26311] <4> create_tpreq_file: symlink to path /dev/rmt/0cbn
10:32:21.134 [26311] <2> manage_scsi_reserve: SCSI RESERVE
10:32:21.136 [26311] <4> expandpath:
/usr/openv/netbackup/db/media/tpreq/drive_SEAGATE.ULTRIUM06242-XXX.000
10:32:21.194 [26311] <2> process_brm_msg: no pending message from bpbrm

==> ./vnetd/log.050608 <==
10:32:21.590 [26317] <2> logparams: /usr/openv/bin/vnetd
10:32:21.591 [26317] <2> ProcessRequests: vnetd.c.284: msg: VNETD
ACCEPT FROM 10.0.1.26.48478 TO 10.0.1.26.13724 fd = 5
10:32:21.592 [26317] <2> vnet_version_accept: vnetd.c.1141:
*actual_version: 4 0x00000004
10:32:21.592 [26317] <2> ProcessRequests: vnetd.c.344: msg: Request
VN_REQUEST_SERVICE_SOCKET(6)
10:32:21.592 [26317] <2> process_service_socket: vnetd.c.1523:
service_name: tldcd
10:32:21.600 [26317] <2> vnet_send_network_socket: vnet_vnetd.c.1529:
hash_str2:
10:32:21.601 [26317] <2> ProcessRequests: vnetd.c.346: status: 0 0x00000000
10:32:21.601 [26317] <4> main: Terminating with status 0

==> ./bpjobd/log.050608 <==
10:32:30.049 [18402] <2> process_active_job: Frame type is 2 (jobid
4782) Sock (14)
10:32:30.049 [18402] <2> parse_packet: Index 55 Field m_nOperation Value <0>
10:32:30.049 [18402] <2> push_monitor_data: Written 24 for Monitor 0
10:32:30.049 [18402] <2> process_active_job: Frame type is 3 (jobid
4782) Sock (14)
10:32:30.049 [18402] <2> process_active_job: Begin to append
(JOBTRYFILE) job (4782)
10:32:30.049 [18402] <2> process_active_job: MOUNTING 1210062741 W041L1

Then the tape W041L1 gets mounted.

10:33:30.054 [18402] <2> process_active_job: MOUNTED 1210062802 W041L1
POSITIONING 1210062802 W041L1 25



==> ./bptm/050608_00001.log <==
10:40:11.477 [26311] <2> write_backup: write_data() returned,
exit_status = 0, CINDEX = 0, TWIN_INDEX = 0, backup_status = -7
10:40:11.487 [26311] <2> write_backup: tp.tv_sec = 1210063211,
stp.tv_sec = 1210062911, tp.tv_usec = 477263, stp.tv_usec = 75365, et
= 300401, mpx_total_kbytes[TWIN_INDEX = 0] = 0
10:40:11.488 [26311] <2> write_backup: no data written for fragment,
backspace before writing tapemark
10:40:11.488 [26311] <2> io_ioctl: command (2)MTBSF 1 from
(bptm.c.17872) on drive index 0

==> ./bpjobd/log.050608 <==
10:40:20.129 [18402] <2> process_active_job: Frame type is 1 (jobid
4782) Sock (14)
10:40:20.149 [18402] <2> parse_packet: Index 33 Field m_nKilobytes Value <0>
10:40:20.149 [18402] <2> parse_packet: Index 36 Field m_nKbPerSec Value <0>
10:40:20.149 [18402] <2> push_monitor_data: Written 40 for Monitor 0
10:40:20.150 [18402] <2> process_active_job: Frame type is 3 (jobid
4782) Sock (14)
10:40:20.200 [18402] <2> process_active_job: Begin to append
(JOBTRYFILE) job (4782)
10:40:20.200 [18402] <2> process_active_job: KBW 1210063212 0 0,000


==> ./bptm/050608_00001.log <==
10:40:26.277 [26311] <2> io_ioctl: command (0)MTWEOF 1 from
(bptm.c.18281) on drive index 0
10:40:47.469 [26311] <2> read_brm_msg: STOP BACKUP winds01_1210062735
10:40:47.469 [26311] <2> io_terminate_tape: writing empty backup
header, drive index 0, copy 1
10:40:47.470 [26311] <2> io_terminate_tape: reposition to previous
tapemark and rewrite header
10:40:47.470 [26311] <2> io_ioctl: command (2)MTBSF 1 from
(bptm.c.8164) on drive index 0
10:40:51.095 [26311] <2> io_ioctl: command (1)MTFSF 1 from
(bptm.c.8167) on drive index 0
10:40:51.115 [26311] <2> io_read_back_header: drive index 0, reading
backup header
10:40:51.286 [26311] <2> io_read_block: read error on media id W041L1,
drive index 0 reading header block, No space left on device
10:40:51.287 [26311] <2> io_read_back_header: drive index 0, tapemark
or ENOSPC encountered
10:40:51.287 [26311] <2> io_ioctl: command (2)MTBSF 1 from
(bptm.c.8185) on drive index 0
10:40:51.307 [26311] <2> io_ioctl: command (0)MTWEOF 1 from
(bptm.c.8229) on drive index 0


10:40:51.286: "No space left on device"? What's that about? It's *NOT*
about a filesystem on the server or on the client: None of my fs are
100% full. Not even close. Or is the tape full? If so, why was it mounted
in the first place? And why is it not full, when I backup with a shorter
start delay (I verified this to be true)? What does "No space left on device"
mean?

In ./bpbrm/050608_00001.log, there's:

10:40:11.353 [26315] <2> bpbrm sighandler: signal 14 caught by bpbrm
10:40:11.362 [26315] <2> bpbrm sighandler: bpbrm timeout after 300
seconds
10:40:11.362 [26315] <2> clear_held_signals: clearing signal mask
stack, mask_stack_depth = 0
10:40:11.362 [26315] <2> inform_client_of_status: INF - Server status = 41
10:40:11.475 [26301] <2> bpbrm brm_sigcld: SIGCLD caught by bpbrm
10:40:11.476 [26301] <2> bpbrm brm_sigcld: bpbrm child 26315
exit_status = 41, signal_status = 0
10:40:11.476 [26301] <2> bpbrm brm_sigcld: child 26315 exited with
status 41: network connection timed out
10:40:11.476 [26301] <2> bpbrm send_status_to_parent: bpbrm child is
done, but the media manager child is not.
10:40:11.476 [26301] <2> bpbrm tell_mm: sending media manager msg:
STOP BACKUP winds01_1210062735

Hm. Is that "bpbrm timeout after 300 seconds" the root cause? How do
I make that timeout be larger? Is it CLIENT_READ_TIMEOUT or
CLIENT_CONNECT_TIMEOUT?




Reg. network settings: I WILL verify them. But what's confusing: Why do
I only have problems, when the start delay is large? Why do other
policies/schedules work fine (using the same client and same server)?
Further, I get the same symptoms, when I backup a different client,
which is in the same subnet (ie. no backup to tape with large start
delay; backup to tape works with small start delay; backup to disk
works with both large & small start delay). Actually, another reason
why I doubt network problems: It always used to work fine (and still
is). client & server are connected to a 1 Gbit switch. That switch
is not overloaded. No changes on the cabling or something like that
has been done. ping (~0.4 ms) & traceroute (1 hop) work just fine.
The network is not heavily loaded. According to netstat, there are
0 Ierrs and 0 Oerrs and also 0 Collis. I don't think that there are
network problems - what makes you think, that there are network
problems, considering that the fact, that networking works just
fine? What other parameters would you think that I should check
to verify, that there's indeed no network problem?

As far as isolating goes: I think I pretty much do have the problem
isolated: Backup fails, when: start_delay = large & backup -> tape.
It does not depend on the client. So the question is: Why does
backup to tape fail with a (41), when I have a large start delay in
place?

You know, what's blowing my mind is, that I have these problems
only when I backup to tape with a large delay up front. Why is that
so?

Michael
_______________________________________________
Veritas-bu maillist  -  Veritas-bu AT mailman.eng.auburn DOT edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu