Veritas-bu

[Veritas-bu] Informix backup aborting in the middle

2001-01-26 08:02:37
Subject: [Veritas-bu] Informix backup aborting in the middle
From: Gustavo Leite de Mendonca Chaves gustavo AT cpqd.com DOT br
Date: Fri, 26 Jan 2001 11:02:37 -0200
Dear managers,

I'm backing up a large Informix database using the NBU informix
agent.  The backup master is running NBU 3.2 over Solaris 2.6 and the
client is running Informix 7.31.UC4 over Solaris 2.6.

The database consists of 62 dbspaces and the (daily) scheduled backups
are started with the following command:

        onbar -b -L [0|1] dbspaces...

Lately, more often than not, these backups have aborted after having
backed up about 60-70% of the dbspaces.  The error doesn't happen
always in the same place and sometimes the backups finishes
successfully.  I even changed the backup start time but to no avail.

I've digged through all the logs I could find and I'm still in the
dark.  I've noticed some messages like 'network connection timed out'
but I don't understand who's waiting for whom and if there's some
configurable *TIMEOUT variable that I could increase to avoid the
problems.

I'm sending small parts of the logs I read in the hope that
some of you can spot the real problem.

Thank you for any help.

Gustavo.

========== bar_act.log on client
2001-01-26 02:10:04 1666  1661 /home/informix/bin/onbar_d -b -L 1 rootdbs 
logdbs physdbs maindbs regfdbs01 regfdbs02 regfdbs03 regfdbs04 regfdbs05 
regfdbs06 regfdbs07 regfdbs08 regfdbs09 regfdbs10 regfdbs11 regfdbs12 regfdbs13 
regfdbs14 regfdbs15 regfdbs16 regfdbs17 regfdbs18 regfdbs19 regfidx01 dercdbs01 
dercdbs02 dercdbs03 dercdbs04 dercdbs05 dercdbs06 dercdbs07 dercdbs08 dercdbs09 
dercdbs10 dercdbs11 dercdbs12 dercdbs13 dercidx01 hashdbs01 hashdbs02 hashdbs03 
hashdbs04 expecd0dbs01 expecd0dbs02 expecd0idx01 remudia0dbs01 remudia0dbs02 
remudia0idx01 remudiadbs01 remudiaidx01 expecdespdbs01 expecdespidx01 
dadosdbs01 dadosdbs02 dadosdbs03 dadosdbs04 dadosdbs05 dadosidx01 hashdbs05 
hashdbs06 hashidx01 flatdbs01
2001-01-26 02:10:14 1666  1661 WARNING: BAR_BSALIB_PATH is undefined; trying 
default path.
2001-01-26 02:30:24 12313  1666 XBSA Error (BSACreateObject): A system error 
occurred. Aborting XBSA session.
2001-01-26 02:30:24 12313  1666 XBSA Error (BSATerminate): The transaction was 
aborted.
2001-01-26 02:30:25 1666  1661 The ON-Bar process 12313 exited with a problem 
(exit code 3 (0x3), signal 0).
2001-01-26 03:52:44 11050  1666 XBSA Error (BSASendData): A system error 
occurred. Aborting XBSA session.

========== online.log on client
03:52:46  Archive on regfdbs13 ABORTED.
03:52:46  Aborted by client.
03:52:46  Archive on regfdbs14 ABORTED.
03:52:46  Aborted by client.
03:52:46  Archive on regfdbs12 ABORTED.
03:52:46  Aborted by client.
03:52:48  Archive on hashidx01 ABORTED.
03:52:48  Aborted by client.

========== bperror output on server:
01/26/01 03:50:41 poseidon merlin  backup by informix on client merlin using 
class informix-merlin:  network connection timed out
01/26/01 03:52:44 poseidon merlin  client merlin aborted
01/26/01 03:54:02 poseidon merlin  backup by informix on client merlin using 
class informix-merlin:  client did not start

========== bpbrm.log on server:
03:50:32 [27217] <2> getsockconnected: host=merlin service=bpcd 
address=10.202.72.105 protocol=tcp reserved port=13782
03:50:32 [27217] <2> getsockconnected: Connect to merlin on port 996
03:50:32 [27217] <2> bpcr_get_peername_rqst: Server peername length = 8
03:50:32 [27217] <4> bpbrm write_msg_to_progress_file: INF - Data socket = 
poseidon.4989
03:50:32 [27217] <4> bpbrm write_msg_to_progress_file: INF - Name socket = 
poseidon.4987
03:50:32 [27217] <4> bpbrm write_msg_to_progress_file: INF - New data socket = 
poseidon.4993
03:50:32 [27217] <4> bpbrm write_msg_to_progress_file: INF - Backup id = 
merlin_0980488227
03:50:32 [27217] <4> bpbrm write_msg_to_progress_file: INF - Compression = 0
03:50:32 [27217] <4> bpbrm write_msg_to_progress_file: INF - Multiplexing = 8
03:50:32 [27217] <4> bpbrm write_msg_to_progress_file: INF - Client read 
timeout = 900
03:50:33 [27217] <4> bpbrm write_msg_to_progress_file: INF - Media mount 
timeout = 0
03:52:44 [27217] <16> bpbrm read_backup_start: client merlin aborted

========== bprd.log on server
03:50:22 [27146] <4> connected_peer: Connection from host merlin, 
10.202.72.105, on non-reserved port 60805
03:50:22 [27146] <2> getsockconnected: host=poseidon service=bpdbm 
address=10.202.72.97 protocol=tcp non-reserved port=13721
03:50:22 [27146] <4> get_type_of_client_port: db_getCLIENT() failed: no entity 
was found (227)
03:50:22 [27146] <2> getsockconnected: host=poseidon service=bpdbm 
address=10.202.72.97 protocol=tcp non-reserved port=13721
03:50:22 [27146] <2> db_getCLIENT_by_hostname: db_CLIENTreceive: no entity was 
found 227
03:50:22 [27146] <2> logparams: /usr/openv/netbackup/bin/bpsched bpsched -ru 
informix -rg informix -ct 6 -client merlin -class informix-merlin -cv -pl 
/usr/openv/netbackup/ext/db_ext/informix/infx_db_comm/infxbsa.19426.1 -UB 
-clnt_lc_messages C -clnt_lc_time C -clnt_lc_ctype C -clnt_lc_collate C 
-clnt_lc_numeric C 
03:50:22 [27146] <4> wait_for_child: start
03:50:42 [25678] <2> wait_for_child: child exit_status = 41 signal_status = 0
03:50:42 [25678] <16> bkarfiles: User backup failed (client = merlin user = 
informix group = informix): network connection timed out
03:50:42 [25678] <2> mail_msg: entered; status = 41
03:50:42 [25678] <2> mail_msg: Attempting to send mail to informix on merlin
03:50:42 [25678] <2> mail_msg: Getservbyname returned port 13782
03:50:42 [25678] <2> getsockconnected: host=merlin service=bpcd 
address=10.202.72.105 protocol=tcp reserved port=13782
03:50:42 [25678] <2> getsockconnected: Connect to merlin on port 992
03:50:43 [25678] <2> mail_msg: CMD_SOCK from bpcr = 5
03:50:43 [25678] <2> mail_msg: STAT_SOCK from bpcr = 8
03:50:43 [25678] <2> bpcr_get_version_rqst: bpcd version: 03200000
03:50:43 [25678] <2> bpcr_get_version_rqst: bpcd version: 03200000
03:50:43 [25678] <16> bpcr_send_mail_rqst: couldn't read the user's (informix) 
configuration file
03:50:43 [25678] <16> mail_msg: BACKUP EXIT STATUS = 41
03:50:43 [25678] <2> getsockconnected: host=merlin service=bpcd 
address=10.202.72.105 protocol=tcp reserved port=13782
03:50:43 [25678] <2> getsockconnected: Connect to merlin on port 996
03:50:45 [25678] <4> process_request: EXIT STATUS 41
03:50:45 [20736] <2> getsockconnected: host=poseidon service=bpdbm 
address=10.202.72.97 protocol=tcp non-reserved port=13721

========== bpsched.log on server:
03:50:24 [21213] <4> read_worklists_from_file: reading 
/usr/openv/netbackup/bin/bpsched.d/worklist.27157 for work to do
03:50:24 [21213] <4> log_in_errorDB: added backup job for client merlin, class 
informix-merlin, schedule Default-Policy part 0 to NetBackup scheduler work 
queue
03:50:25 [21213] <2> getsockconnected: Connect to poseidon on port 1004
03:50:27 [21213] <4> available_media: (BPTM) 9714 -- num active = 1, num 
available = 8, num inuse = 1
03:50:27 [21213] <4> ?: ------------------------------------
03:50:27 [21213] <4> ?: >>  CLIENT: merlin (informix) BIRTH: 980488227  
PRIORITY: 110
03:50:27 [21213] <4> ?:     CLASS: informix-merlin  SCHED: Default-Policy(2)   
03:50:27 [21213] <4> ?:     STUNIT: 9714 (poseidon)
03:50:27 [21213] <4> ?:     PID: 27189  STATUS: -1  TRIES: 0  SPID: 27157 (UB)
03:50:27 [21213] <4> ?: ------------------------------------
03:50:27 [21213] <4> log_in_errorDB: started backup job for client merlin, 
class informix-merlin, schedule Default-Policy on storage unit 9714
03:50:29 [27189] <2> getsockconnected: Connect to poseidon on port 1003
03:50:29 [27189] <2> getsockconnected: Connect to merlin on port 997
03:50:33 [21213] <2> unmount_all_tapes: initiating bptm -U on poseidon
03:50:33 [21213] <2> bpcr_ex_client_cmd: getservbyname returned port 13782
03:50:33 [21213] <2> getsockconnected: Connect to poseidon on port 996
03:50:33 [21213] <2> bpcr_ex_client_cmd: CLIENT_CMD_SOCK from bpcr = 2
03:50:33 [21213] <2> bpcr_ex_client_cmd: CLIENT_STAT_SOCK from bpcr = 5
03:50:33 [21213] <16> get_mesgQ: msgget failed (-1):  No such file or directory 
(2) 
03:50:34 [27189] <2> getsockconnected: Connect to merlin on port 992
03:50:40 [21841] <2> getsockconnected: Connect to merlin on port 992
03:50:41 [21213] <4> ?: ------------------------------------
03:50:41 [21213] <4> ?: <<  CLIENT: merlin (informix) BIRTH: 980484642  
PRIORITY: 110
03:50:41 [21213] <4> ?:     CLASS: informix-merlin  SCHED: Default-Policy(2)   
03:50:41 [21213] <4> ?:     STUNIT: 9714 (poseidon)
03:50:41 [21213] <4> ?:     PID: 21841  STATUS: 41  TRIES: 1  SPID: 25686 (UB)
03:50:41 [21213] <4> ?: ------------------------------------
03:50:41 [21213] <4> ?: merlin exited with status 41 (network connection timed 
out)
03:50:41 [21213] <2> unmount_all_tapes: initiating bptm -U on zeus
03:50:41 [21213] <2> bpcr_ex_client_cmd: getservbyname returned port 13782
03:50:41 [21213] <2> getsockconnected: Connect to zeus on port 992
03:50:41 [25686] <16> log_in_errorDB: backup by informix on client merlin using 
class informix-merlin:  network connection timed out
03:50:41 [21213] <2> bpcr_ex_client_cmd: CLIENT_CMD_SOCK from bpcr = 2
03:50:41 [21213] <2> bpcr_ex_client_cmd: CLIENT_STAT_SOCK from bpcr = 5
03:50:42 [25686] <16> bpsched: backup by informix on client merlin using class 
informix-merlin:  network connection timed out
03:50:42 [25686] <2> getsockconnected: Connect to merlin on port 992

========== bptm.log on server:
03:50:30 [27211] <2> bptm: INITIATING: -pid 27206 -den 13 -rt 8 -rn 1 -stunit 
9714 -cj 2 -mpx 8 -p diario -maxfrag 500 -mediasvr poseidon 
03:50:30 [27211] <2> send_brm_msg: PID of bpxm = 27211
03:50:30 [27211] <2> process_mpx_protocol: wait for initial START BACKUP 
message from bpbrm
03:50:31 [27211] <2> read_brm_msg: START BACKUP -b merlin_0980488227 -bt 
980488227 -c merlin -cl informix-merlin -st 2 -rl 2 -rp 2592000 -sl 
Default-Policy -ct 6 -rclnt merlin -rclnthostname merlin -L 
/usr/openv/netbackup/ext/db_ext/informix/infx_db_comm/infxbsa.19426.1 -ru 
informix -lcmsg C -lctime C -jobid 132149
03:50:32 [27211] <2> validate_parameters: maximum fragment size (in Kbytes) is 
500000
03:50:32 [27211] <2> io_init: using 64512 data buffer size
03:50:32 [27211] <2> io_init: CINDEX 0, sched Kbytes for monitoring = 10000
03:50:32 [27211] <2> io_init: using 4 data buffers
03:50:32 [27211] <2> io_init: child delay = 20, parent delay = 30 (milliseconds)
03:50:32 [27211] <2> mpx_setup_shm: buffer control shared memory address is 
0xef750000, size is 800, shmid is 31114
03:50:32 [27211] <2> mpx_setup_shm: buf control for CINDEX 0 is 0xef750000
03:50:32 [27211] <2> mpx_setup_shm: shared memory address for group 0 is 
0xef400000, size is 1032192, shmid is 9815
03:50:32 [27211] <2> mpx_setup_shm: shared memory address for CINDEX 0 is 
0xef400000, group 0, num_active 1
03:50:32 [27211] <2> mpx_start_child: backup child for merlin_0980488227 is pid 
27215, CINDEX = 0
03:50:32 [27211] <2> send_brm_msg: DATASOCKET merlin_0980488227 4993
03:50:32 [27211] <2> add_to_vmhost_list: added poseidon to vmhost list
03:50:33 [27211] <2> select_media: consider allowing retention level 2 on media 
(T00332) that is 5 since its enabled by bp.conf file
03:50:33 [27211] <2> select_media: consider allowing retention level 2 on media 
(T00128) that is 5 since its enabled by bp.conf file
03:50:33 [27211] <2> select_media: consider allowing retention level 2 on media 
(T00338) that is 8 since its enabled by bp.conf file
03:50:33 [27211] <2> select_media: consider allowing retention level 2 on media 
(T00149) that is 8 since its enabled by bp.conf file
03:50:33 [27211] <2> select_media: consider allowing retention level 2 on media 
(T00107) that is 5 since its enabled by bp.conf file
03:50:33 [27211] <2> select_media: consider allowing retention level 2 on media 
(T00410) that is 8 since its enabled by bp.conf file
03:50:33 [27211] <2> db_lock_media: unable to lock media at offset 322 (T00422)
03:50:33 [27211] <2> check_available_drives: checking drives, about to request 
media id T00421
03:50:33 [27211] <2> db_byid: search for media id T00422
03:50:33 [27211] <2> db_byid: T00422 found at offset 322
03:50:33 [27211] <2> query_one_volume: server returned:  1 T00422 ------ 11 
T00422 -------- 8 1 poseidon 00_001_TLD 68 0 0 0 0 root root 5 diario free_pool 
980377665 980464934 980473271 980473271 0 0 1 0 0 - 0 0 32 0 0 0 0 0 - - ------
03:50:33 [27211] <2> db_lock_media: unable to lock media at offset 322 (T00422)
03:50:33 [27211] <2> select_media: selected media id T00421 for backup, 
merlin(rl = 2) <----------
03:50:33 [27211] <2> mount_open_media: Waiting for mount of media id T00421 on 
server poseidon.
03:50:33 [27221] <2> bptm: INITIATING: -U 
03:50:33 [27221] <2> db_byid: search for media id T00422
03:50:33 [27221] <2> db_byid: T00422 found at offset 322
03:50:33 [27221] <2> db_lock_media: unable to lock media at offset 322 (T00422)
03:50:33 [27221] <2> bptm: EXITING with status 0 <----------
03:50:34 [27211] <2> process_brm_msg: no pending message from bpbrm
03:50:37 [27215] <2> io_set_recvbuf: setting receive network buffer to 32032 
bytes
03:50:37 [27215] <2> fill_buffer: VSMInit () failed: 2d



<Prev in Thread] Current Thread [Next in Thread>