Veritas-bu

[Veritas-bu] NDMP Fulls fail w/error 99

2010-06-23 15:29:22
Subject: [Veritas-bu] NDMP Fulls fail w/error 99
From: Roy McMorran <mcmorran AT mdibl DOT org>
To: Veritas-bu AT mailman.eng.auburn DOT edu
Date: Wed, 23 Jun 2010 15:29:11 -0400
I have a case open with Symantec, but I thought I'd try to tap the 
collective wisdom of this list to see if anyone has encountered this before.

I am consistently seeing a failure of full (only full) NDMP backups on 
one volume on a NetApp filer.  The last successful full backup of this 
volume was three weekends ago.  However, full backups of four other 
(smaller) volumes on the same filer run successfully, as do all 
incrementals on all volumes.

This is what we are seeing in the detailed status report:

06/16/2010 04:27:43 - Error bpbrm (pid=12985) db_FLISTsend failed: 
premature eof encountered (233)
06/16/2010 04:27:44 - Error ndmpagent (pid=12992) write socket err 4 
(Interrupted system call)
06/16/2010 04:27:45 - Error ndmpagent (pid=12992) FH send to BRM failed 
- Ino - 2884 0 2884 0 2454527 0 0 8 33261 mcmorran staff 2884 1267655394 
1195639440 1239117488 0000020e1df4040000000000019fda9b 27253403
06/16/2010 04:27:45 - Error ndmpagent (pid=12992) NDMP backup failed, 
path = /vol/vol4/
06/16/2010 04:28:28 - Error bptm (pid=12993) io_ioctl_ndmp (MTBSF) 
failed on media id DC4000, drive index 5, return code 18 
(NDMP_XDR_DECODE_ERR) (bptm.c.8703)
06/16/2010 04:28:29 - end writing; write time: 2:56:58
NDMP backup failure  (99)


So far we have increased MAX_FILES_PER_ADD up to 25000 and then 40000 
but no joy there.

The LAN and fibre channel connections look to be OK.  No errors or 
collisions visible on any interface or switch.

There are no errors in the messages files of the server or the filer.  
Symantec has verbose logs for bptm, bpbrm, ndmp, ndmpagent, etc. and the 
ndmpd log from the filer, but so far hasn't found anything helpful.  
I've included some excerpts below.

A test of dump to null (from the command line) on the filer ran to 
completion without errors.

The filer is a NetApp 3020 running OnTap 7.2.5.  It has one LTO4 drive 
fibre-attached.  In the same (SL-500) library there are three other 
drives that are attached to the master/media server.  Netbackup is 
Version 7.  The Master/media server is running Solaris 10.

Any ideas, or tips for troubleshooting?  Thanks much.

Some log excerpts:
bpbrm log:
04:27:03.230 [12985] <2> db_end: Need to collect reply
04:27:42.832 [12985] <2> get_long: (2) premature end of file (byte 1)
04:27:42.952 [12985] <2> db_getdata: get_string() failed: I/O error (5) 
premature end of file encountered (-9)
04:27:42.953 [12985] <2> db_end: no DONE from db_getreply(): premature 
eof encountered
04:27:42.953 [12985] <2> db_FLISTsend: db_end_sts() failed: premature 
eof encountered (233)
04:27:42.987 [12985] <16> bpbrm main: db_FLISTsend failed: premature eof 
encountered (233)
04:27:43.348 [12985] <2> vnet_vnetd_service_socket: 
../../libvlibs/vnet_vnetd.c.2054: VN_REQUEST_SERVICE_SOCKET: 6 0x00000006
04:27:43.348 [12985] <2> vnet_vnetd_service_socket: 
../../libvlibs/vnet_vnetd.c.2068: service: bpjobd
04:27:43.363 [12985] <2> job_connect: SO_KEEPALIVE set on socket 5 for 
client nbusvr
04:27:43.363 [12985] <2> logconnections: BPJOBD CONNECT FROM 
209.222.195.241.37331 TO 209.222.195.241.13724
04:27:43.363 [12985] <2> job_authenticate_connection: ignoring VxSS 
authentication check for now...
04:27:43.364 [12985] <2> job_connect: Connected to the host nbusvr 
contype 10 jobid <143965> socket <5>
04:27:43.364 [12985] <2> job_connect: Connected on port 37331
04:27:43.484 [12985] <2> job_monitoring_exex: ACK disconnect
04:27:43.484 [12985] <2> job_disconnect: Disconnected
04:27:43.508 [12985] <4> db_error_add_to_file: db_FLISTsend failed: 
premature eof encountered (233)
04:27:43.551 [12985] <2> signal_ndmpagent: sending signal=1 to ndmpagent 
on nbusvr, client_pid=12992
04:27:43.551 [12985] <2> bpcr_send_signal: Ignoring connect_opts = 
0x01030202
04:27:43.551 [12985] <2> read_client: ?
04:27:43.551 [12985] <2> read_client: opendir() failed: nbusvr: No such 
file or directory (2)

bptm log:
04:27:25.432 [12993] <2> NdmpSession: [1164] Sending 15 (GET_KBYTES) ""
04:27:45.068 [12993] <2> NdmpSession: [1164] Reply 15 "786532096 0 
12289564", error = 0
04:28:15.072 [12993] <2> NdmpSession: [1165] Sending 15 (GET_KBYTES) ""
04:28:28.005 [12993] <2> NdmpSession: [1166] Received 8 
(FAILED_ALL_PATHS) ""
04:28:28.005 [12993] <2> NdmpSession: [1166] Replying error = 0
04:28:28.006 [12993] <2> NdmpSession: [1165] Reply 15 "791310912 0 
12364233", error = 0
04:28:28.006 [12993] <16> write_data: ndmpagent reports failure - all paths
04:28:28.006 [12993] <2> write_backup: write_data() returned, 
exit_status = 99, CINDEX = 0, TWIN_INDEX = 0, backup_status = -8
04:28:28.006 [12993] <2> write_backup: tp.tv_sec = 1276676908, 
stp.tv_sec = 1276666291, tp.tv_usec = 6090, stp.tv_usec = 921462, et = 
37732520, mpx_total_kbytes[TWIN_INDEX = 0] = 2211481024
04:28:28.006 [12993] <2> io_terminate_tape: writing empty backup header, 
drive index 5, copy 1
04:28:28.006 [12993] <2> io_terminate_tape: reposition to previous 
tapemark and rewrite header
04:28:28.006 [12993] <2> io_ioctl: command (2)MTBSF 1 from (bptm.c.8703) 
on drive index 5
04:28:28.715 [12993] <16> io_ioctl: io_ioctl_ndmp (MTBSF) failed on 
media id DC4000, drive index 5, return code 18 (NDMP_XDR_DECODE_ERR) 
(bptm.c.8703)
04:28:28.715 [12993] <2> send_MDS_msg: DEVICE_STATUS 1 7884 nbusvr 
DC4000 4000449 LTO4-4 2000022 POSITION_ERROR 0 0
04:28:29.444 [12993] <2> log_media_error: successfully wrote to error 
file - 06/16/10 04:28:29 DC4000 5 POSITION_ERROR LTO4-4
04:28:29.444 [12993] <2> io_close: closing 
/usr/openv/netbackup/db/media/tpreq/drive_LTO4-4, from bptm.c.18975
04:28:29.556 [12993] <2> io_close: ignore ndmp close error, NDMP_EOF_ERR
04:28:29.556 [12993] <2> drivename_write: Called with mode 1
04:28:29.556 [12993] <2> drivename_unlock: unlocked
04:28:29.556 [12993] <2> drivename_close: Called for file LTO4-4
04:28:29.557 [12993] <2> tpunmount: NOP: MEDIA_DONE 0 143965 0 DC4000 
4000449 0 {4A61577C-7908-11DF-BD6F-0003BAC24451}
04:28:29.557 [12993] <2> send_brm_msg: EXIT netapp-e0d_1276638844 99
04:28:29.560 [12993] <2> NdmpSession: ndmp_public_session_destory: 
destroying session 0x100937f90
04:28:29.750 [12993] <2> NdmpSession: ndmp_net_read: connection 
1009b3b20: (9) end of file detected
04:28:29.750 [12993] <2> NdmpSession: ndmp_message_receive: 
xdr_ndmp_header failed, unable to XDR decode
04:28:29.750 [12993] <2> NdmpSession: ndmp_net_read: connection 
1009b3b20: (9) end of file detected
04:28:29.750 [12993] <2> NdmpSession: ndmp_xm_wait_for_connect_close: 
ERROR ndmp_process_message failed, status = 18
04:28:29.750 [12993] <2> NdmpSession: 1009b3b20 ->  (9) 1174 [0] 
04:28:29 NDMP_CONNECT_CLOSE (0x902) NDMP_NO_ERR (0x0)
04:28:29.944 [12993] <2> bptm: EXITING with status 99 <----------

filer ndmpd.log:
Jun 16 04:27:45 EDT [ndmpd:47]: NDMP message type: NDMP_MOVER_GET_STATE
Jun 16 04:27:45 EDT [ndmpd:47]: NDMP message replysequence: 1212
Jun 16 04:27:45 EDT [ndmpd:47]: Message Header:
Jun 16 04:27:45 EDT [ndmpd:47]: Sequence 0
Jun 16 04:27:45 EDT [ndmpd:47]: Timestamp 0
Jun 16 04:27:45 EDT [ndmpd:47]: Msgtype 1
Jun 16 04:27:45 EDT [ndmpd:47]: Method 2560
Jun 16 04:27:45 EDT [ndmpd:47]: ReplySequence 1212
Jun 16 04:27:45 EDT [ndmpd:47]: Error NDMP_NO_ERR
Jun 16 04:27:45 EDT [ndmpd:47]: Error code: NDMP_NO_ERR
Jun 16 04:27:45 EDT [ndmpd:47]: Record number: 34554391
Jun 16 04:27:45 EDT [ndmpd:47]: Mover state: NDMP_MOVER_STATE_ACTIVE
Jun 16 04:27:45 EDT [ndmpd:48]: NDMP message type: NDMP_DATA_GET_STATE_V2
Jun 16 04:27:45 EDT [ndmpd:48]: NDMP message replysequence: 1131
Jun 16 04:27:45 EDT [ndmpd:48]: Message Header:
Jun 16 04:27:45 EDT [ndmpd:48]: Sequence 0
Jun 16 04:27:45 EDT [ndmpd:48]: Timestamp 0
Jun 16 04:27:45 EDT [ndmpd:48]: Msgtype 1
Jun 16 04:27:45 EDT [ndmpd:48]: Method 1024
Jun 16 04:27:45 EDT [ndmpd:48]: ReplySequence 1131
Jun 16 04:27:45 EDT [ndmpd:48]: Error NDMP_NO_ERR
Jun 16 04:27:45 EDT [ndmpd:48]: Error code: NDMP_NO_ERR
Jun 16 04:27:45 EDT [ndmpd:48]: NDMP message type: NDMP_DATA_ABORT
Jun 16 04:27:45 EDT [ndmpd:48]: NDMP message replysequence: 1132
Jun 16 04:27:45 EDT [ndmpd:48]: Message Header:
Jun 16 04:27:45 EDT [ndmpd:48]: Sequence 0
Jun 16 04:27:45 EDT [ndmpd:48]: Timestamp 0
Jun 16 04:27:45 EDT [ndmpd:48]: Msgtype 1
Jun 16 04:27:45 EDT [ndmpd:48]: Method 1027
Jun 16 04:27:45 EDT [ndmpd:48]: ReplySequence 1132
Jun 16 04:27:45 EDT [ndmpd:48]: Error NDMP_NO_ERR
Jun 16 04:27:45 EDT [ndmpd:47]: Message NDMP_NOTIFY_MOVER_HALTED sent
Jun 16 04:27:45 EDT [ndmpd:47]: Message Header:
Jun 16 04:27:45 EDT [ndmpd:47]: Sequence 1217
Jun 16 04:27:45 EDT [ndmpd:47]: Timestamp 1276676865
Jun 16 04:27:45 EDT [ndmpd:47]: Msgtype 0
Jun 16 04:27:45 EDT [ndmpd:47]: Method 1283
Jun 16 04:27:45 EDT [ndmpd:47]: ReplySequence 0
Jun 16 04:27:45 EDT [ndmpd:47]: Error NDMP_NO_ERR
Jun 16 04:27:45 EDT [ndmpd:47]: Reason: 1
Jun 16 04:28:27 EDT [ndmpd:48]: Message: NDMP_NOTIFY_DATA_HALTED sent
Jun 16 04:28:27 EDT [ndmpd:48]: Message Header:
Jun 16 04:28:27 EDT [ndmpd:48]: Sequence 2766
Jun 16 04:28:27 EDT [ndmpd:48]: Timestamp 1276676907
Jun 16 04:28:27 EDT [ndmpd:48]: Msgtype 0
Jun 16 04:28:27 EDT [ndmpd:48]: Method 1281
Jun 16 04:28:27 EDT [ndmpd:48]: ReplySequence 0
Jun 16 04:28:27 EDT [ndmpd:48]: Error NDMP_NO_ERR
Jun 16 04:28:27 EDT [ndmpd:48]: Reason: 2
Jun 16 04:28:27 EDT [ndmpd:48]: Message NDMP_LOG_MESSAGE sent
Jun 16 04:28:27 EDT [ndmpd:48]: Message Header:
Jun 16 04:28:27 EDT [ndmpd:48]: Sequence 2767
Jun 16 04:28:27 EDT [ndmpd:48]: Timestamp 1276676907
Jun 16 04:28:27 EDT [ndmpd:48]: Msgtype 0
Jun 16 04:28:27 EDT [ndmpd:48]: Method 1539
Jun 16 04:28:27 EDT [ndmpd:48]: ReplySequence 0
Jun 16 04:28:27 EDT [ndmpd:48]: Error NDMP_NO_ERR
Jun 16 04:28:27 EDT [ndmpd:48]: Log type: 0
Jun 16 04:28:27 EDT [ndmpd:48]: Message id: 0
Jun 16 04:28:27 EDT [ndmpd:48]: Log message: Aborted by client
Jun 16 04:28:27 EDT [ndmpd:48]: Associated message valid: 0
Jun 16 04:28:27 EDT [ndmpd:48]: Associated message sequence: 0
Jun 16 04:28:27 EDT [ndmpd:48]: Error code: NDMP_NO_ERR
Jun 16 04:28:27 EDT [ndmpd:48]: NDMP message type: NDMP_DATA_STOP
Jun 16 04:28:27 EDT [ndmpd:48]: NDMP message replysequence: 1133
Jun 16 04:28:27 EDT [ndmpd:48]: Message Header:
Jun 16 04:28:27 EDT [ndmpd:48]: Sequence 0
Jun 16 04:28:27 EDT [ndmpd:48]: Timestamp 0
Jun 16 04:28:27 EDT [ndmpd:48]: Msgtype 1
Jun 16 04:28:27 EDT [ndmpd:48]: Method 1031
Jun 16 04:28:27 EDT [ndmpd:48]: ReplySequence 1133
Jun 16 04:28:27 EDT [ndmpd:48]: Error NDMP_NO_ERR
Jun 16 04:28:27 EDT [ndmpd:48]: Error code: NDMP_NO_ERR

filer backup log:
dmp Tue Jun 15 18:01:12 EDT /vol/vol4/(0) Phase_change (III)
dmp Tue Jun 15 18:46:31 EDT /vol/vol4/(0) Phase_change (IV)
dmp Wed Jun 16 04:27:45 EDT /vol/vol4/(0) Tape_close (ndmp)
dmp Wed Jun 16 04:27:45 EDT /vol/vol4/(0) Abort (2159666 MB)
dmp Wed Jun 16 04:27:45 EDT /vol/vol4/(0) Log_msg (reg inodes: 2860708 
other inodes: 7324 dirs: 127786 nt dirs: 31694 nt inodes: 32206 acls: 801)
dmp Wed Jun 16 04:27:45 EDT /vol/vol4/(0) Log_msg (Phase 3: directories 
dumped: 159481)

-- 
Roy McMorran
Systems Administrator
MDI Biological Laboratory
mcmorran AT mdibl DOT org


_______________________________________________
Veritas-bu maillist  -  Veritas-bu AT mailman.eng.auburn DOT edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu