Veritas-bu

[Veritas-bu] External event caused rewind during write - Medi a Servers Windows

2004-09-30 00:24:37
Subject: [Veritas-bu] External event caused rewind during write - Medi a Servers Windows
From: alessandro.dasilveira AT veritas DOT com (Alessandro Da Silveira)
Date: Thu, 30 Sep 2004 00:24:37 -0400
 Or you will can tryied to see the steps below for each Operating System:

Sun Management Center (SyMon) is polling the tape drives during writes to
the tapes by the Windows Media Servers causing the drives to reset and
return status <84>.
· · · · To determine if SyMON is running:
# ps -ef | grep symond
This should return the process ID of the symond daemon if it is running on
the system. Use the following command to stop this process:
# /opt/SUNWsymon/sbin/sm_control stop

Check the event viewer during the time frame that the status code <84> was
seen. If the tape drive is showing a cleaning requested message, this error
can be resolved by cleaning the drive. 
In one instance, the "nolargefiles" option was NOT enabled on the AIX file
system. The catalog backup, which was over 1GB in size, would not back up to
disk, failing with a media write error. There was ample space available on
the file system. Upon altering the option for compressing the catalog to
compress every three days, and compressing the catalog manually using:
· · · ·      /usr/openv/netbackup/bin/admincmd/bpimage - compress -all
clients 
     the catalog backup ran to disk with no issue.
     There is an issue with AIX 4.x being unable to allocate more than 1GB
per file even if "nolargefiles" in not set for the file system. 
      This was proven by attempting to create a file larger 

Large backups fail with status <84> (Media write error) on HP-UX 11.11
systems. 
This combination causes data phase errors when a large amount of data is
being streamed to tape drives. The issue seems to be with 
     the way  an atdd driver interacts with an HBA driver. HP and IBM are
investigating the issue. 

Drive is being downed by NetBackup on HP 11.0 and 11.i
In order for NetBackup to utilize Fast Tape Positioning (Locate Block),
TapeAlert, and Resume logic with HP, it must be configured to use the device
files for the sctl pass-through driver. This is the case for fibre and scsi
connected tape drives. A quick way to verify this is to check the dev/sctl
directory for the controller, target, lun entries of the tape devices. 
An example would be: ls -al dev/sctl  crw-rw-rw- 1 root sys 203 0x042100 Oct
25 14:21 c4t2l1
Fast Tape Positioning (Locate Block) 
This is the capability to locate to the block where the data is written.
Without Locate Block, restore performance maybe slow depending on where on
the tape the data is located. 
TapeAlert
NetBackup also needs this pass-through driver configured to use its AVRD
process to pick up TapeAlerts. TapeAlert is a technology developed by HP to
issue cleans of the drive and report status, hardware defect, read/write
errors, etc. Please review TechNote 231772, 231451, 242872 for further
information regarding the HP TapeAlert technology. The links for these
TechNotes are in the Related Section.NetBackup will function somewhat when
not configured in this manner. NetBackup will attempt to look for the
pass-through path and when not there use the driver configured. This will
cause TapeAlerts to not be picked up and I/O or EIO type errors will be the
result. 
Resume Logic
The last piece of this situation is "Resume Logic". Resume logic was
implemented in NetBackup to recover from a LIP on a fibre channel loop. It
can also recover from a normal fibre channel glitch, like pulling the cable
and reinserting. This resume logic is implemented in the bptm code. Resume
logic will try the drive 5 times with 3 minute intervals before failing the
backup. Without the pass-through path, the backup will instantly fail
because the resume logic is completely bypassed.

Media read and write errors are being generated when using DLT Drives on
AIX. 
Exit Status Code 84: media write error Exit Status Code 85: media read error

AIX has a default tape block size of 512bytes. This causes I/O problems on
DLT tape drives. The solution is to change this to 0 which is variable
length. Use 'smit' to alter the default setting. This is achieved by doing
the following:
SMIT - DEVICES - TAPE DRIVE - CHANGE / SHOW CHARACTERISTICS OF A TAPE DRIVE
-SELECT DRIVE 
An option within the menu that appears, as a result of the above, is for
Block size. This should be 0. 

AIX cannot write image to media because of a system call receiving a
parameter that is not valid. 
A system call received a parameter that is not valid. 
Example of Error Message seen in bptm log:
01:14:22 [24214] <16> write_data: cannot write image to media id H1017,
drive index 0, A system call received a parameter that is not valid.
01:14:22 [24214] <2> log_media_error: successfully wrote to error file -
05/23/01 01:14:22 H1017 0 WRITE_ERROR
01:14:22 [24214] <2> check_error_history: called from bptm line 12272,
EXIT_Status = 84
01:14:26 [24214] <8> check_error_history: DOWN'ing drive index 0, it has had
at least 5 errors in last 12 hour(s)
This error occurs when the tape drives are configured to use fixed length
blocks. NetBackup expects variable length blocks and fails when trying to
write to a tape with media write errors, which in turn, DOWNs the drive.
If the block_size is set to a value other then 0, either change it with smit
or do:
# chdev -l rmt# -a block_size=0

While writing a backup monitoring software should be disabled, it has been
found that this software could cause external events such as rewinds that
may corrupt the data on the tape 
It has been found by Net Backup technical Support and drive vendor technical
support that monitoring software should be disabled while writing data to
tapes. If it is not this software may cause external events while writing a
backup such as a rewind. If this occurs the data on the tape will be
corrupted. If this is suspected the NBU tape header and the contents of the
tape should be dumped with OS commands such as DD or tcopy. It may also be
advisable to run a "/usr/openv/netbackup/bin/admincmd/bpmedialist -m
<mediaid> -mcontents -U" to mount and read through the tape to display the
backup id's. 
This is a list of known monitoring software that should be disabled during
backups:
SAN Management software application listing:
Sun:
SRS - Sun Resource System Monitor
SRC - Sun Resource Center
SMC - Sun Management Console
ESM - Enterprise SAN Manager
HP:
EMS - Unix hosts only
Top Tools - Windows hosts only
OpenView
Compaq:
Insight Manager - Windows and Linux (cmascsid is the agent) hosts only
Fujitsu:
SAN InSite
Computer Associates:
BrightStor SAN Manager
Qlogic:
SANsurfer
EMC:
ESN
Dot Hill:
SANpath
InControl - Vendor unknown
SANMAN - Vendor unknown
Control - BM

When attempting to duplicate images to Linear Tape-Open (LTO) Generation 2
tape drives the job fails with a status 84 error. All backups to the LTO-2
drive complete successfully. 
·Details:When attempting to duplicate images from LTO generation 1 tape
drives or LTO generation 2 tape drives to LTO - 2 drives the job will fail
with a status 84 error. The bptm log will indicate a SCSI IO error and the
job fails with a status 84. There are no errors related to the SCSI card
being reported in the system or application event logs.
Hardware Configuration
The configuration of the tape library had 2 - LTO Gen1 drives and 2 - LTO
gen2 tape drives. The connection to the tape library was with 2 Adaptec 3940
SCSI controllers and 1 Adaptec 39160 controller. The robot and 1 LTO Gen1
drive connected to 1 - 3940 and the other LTO Gen1 drive connected to the
other 3940 controller. The 2 - LTO Gen2 drives were connected to a single
port on the 39160 controller. 
Solution
After various testing and then getting another SCSI cable, the 2nd LTO Gen2
drive was then connected to the 2nd SCSI port on the 39160 controller and
this resolved the issue. The Adaptec 39160 SCSI Controller is a dual port
SCSI controller and has 2 separate interfaces available to connect external
devices too.

Large backups fail with status 84 (Media write error) on HP-UX 11.11
systems. 
Exact Error Message 
Media write error 
Details:
Please check whether the following are true:
1. HP-UX media servers
2. IBM manufactured LTO tapes drives
3. The system is using an atdd driver
4. A6795A Fibre Channel Tachyon XL2 PCI host bus adapter (HBA)
This combination causes data phase errors when a large amount of data is
being streamed to tape drives. The issue seems to be with the way an atdd
driver interacts with an HBA driver. HP and IBM are investigating the issue.


Document ID: 251370 Emulex default settings are incorrect for the LP9802,
LP982 LP952 LP9xxx Family, LP850, LP8xxx Family and LP7000E when utilizing
NetBackup. 
Exact Error Message 
Status Code 84: Media write error 
Details:
When a backup is running and you reboot one of the other Microsoft Windows
NT or Microsoft Windows 2000, the backup will fail with a Status code 84.
The reason for this can be that a SCSI reset is being issued to the drives.
If this is the case then the default Emulex settings have not been changed.
When you install one of the specialized drivers (for example version
5-2.13a4) for the LP9802, LP982 LP952 LP9xxx Family, LP850, LP8xxx Family
and LP7000Eon Microsoft Windows NT or Microsoft Windows 2000 the default
settings are incorrect. The default settings are:
Disabled: "Disable Target Reset for Tape Devices"
Enabled: "Use PLOGI instead of PDISC after LIP"
(See Figure 1.)The correct settings are:
Enable "Disable Target Reset for Tape Devices"
Disable "Use PLOGI instead of PDISC after LIP"
(See Figure 2.)
Figure 2

After changing these settings, select File and Apply. After applying the new
settings, it is necessary to reboot the server to enable the new settings.
Status code 84 is generated as well as an error in the BPTM log 
Exact Error Message 
Status code 84: media writ4e error; and in the bptm log: "WriteFile failed
with: Data error (cyclic redundancy check)" 
Details:
Check the event viewer during the time frame that the status code 84 was
seen. If the tape drive is showing a cleaning requested message, this error
can be resolved by cleaning the drive.

Regards,

Alessandro Silveira
Professinal Services Manager Latin America
VERITAS Software Brazil

-----Original Message-----
From: Daniel Suzuki
To: veritas-bu AT mailman.eng.auburn DOT edu
Sent: 29/9/2004 23:36
Subject: [Veritas-bu] External event caused rewind during write - Media
Servers Windows

Dears,

Someone can help me solve this problem. My site is very critical,
because 
this customer is migrating from EDM (EMC²/Legato) to Veritas Netbackup,
and 
tomorrow is the deadline for the shutdown of the EDM solution and
technical 
support.

Related Problem:
Tape rewind during write backup in media server WINDOWS. The problem
occours 
in any client servers the diferents S.O.
File ENABLE_SCSI_RESERVE has been active in all media servers.
Not possible disable option "speed write" in bridge or tapes drives,
because 
STK equipment not use this option.
Is SIZE_DATA_BUFFERS greater than 64k supported on Windows? Our 
SIZE_DATA_BUFFERS is currently set to 256k on all our media servers. Do
you 
think it can be generating this error? Also, the IBM driver allows
variable 
block length. Should we use VERITAS drivers?

Buffer configuration to all media servers:
NET_BUFFER_SZ = 65536
SIZE_DATA_BUFFERS = 262144
NUMBER_DATA_BUFFERS = 8
MPX_RESTORE_DELAY = 360

SITE IS COMPOSE FOR:
Drives: 10 x IBM LTO-2 - firmware 38D0
Robots: 2 x STK L80 (5 tape LTO2 each)
SSO: Not being used
Interconnect: Fibre Channel and SCSI with Bridges (STK SN3300 - frw
5.3.12)
HBA: Emulex LP9002L - firmware 3.92A2
Master Server Netbackup DC 4.5 FP7 - Solaris 8
Media Server Netbackup DC 4.5 FP7 - Windows 2000
Media Server Netbackup DC 4.5 FP7 - Windows 2003 exist others media
servers 
UNIX (AIX, HP-UX and Solaris)

LOG BPTM:
23:50:13.778 [792.3376] <2> write_data: write of 262144 bytes indicated
only 
262144 bytes were written, err = 1100
23:50:13.778 [792.3376] <2> write_backup: write_data() returned,
exit_status 
= 0, CINDEX = 2, TWIN_INDEX = 0, backup_status = -3
23:50:13.778 [792.3376] <2> send_brm_msg: MEDIA NOT READY
23:50:15.731 [792.3376] <2> io_close: closing C:\Program Files\VERITAS 
\NetBackup\db  \media\tpreq/AAA027, from .\bptm.c.15768
23:50:15.731 [792.3376] <2> write_backup: block position check: actual 
1897457, expected 1897458
23:50:15.731 [792.3376] <2> getsockconnected: host=veri0007
service=bpjobd 
address=172.16.100.7 protocol=tcp non-reserved port=13723
23:50:15.731 [792.3376] <2> logconnections: BPJOBD CONNECT FROM 
172.16.100.13.2824 TO 172.16.100.7.13723
23:50:15.731 [792.3376] <2> job_connect: Connected to the host veri0007 
contype 10 jobid <5135> socket <632>
23:50:15.731 [792.3376] <2> job_connect: Connected on port 2824
23:50:15.731 [792.3376] <2> set_job_details: Done
23:50:15.731 [792.3376] <2> job_monitoring: Timeout cut off was tv_sec =

1096253422, tv_usec = 731000
23:50:15.965 [792.3376] <2> job_monitoring: ACK disconnect  23:50:15.965

[792.3376] <2> job_disconnect: Disconnected
23:50:15.965 [792.3376] <2> getsockconnected: host=veri0007
service=bpdbm 
address=172.16.100.7 protocol=tcp non-reserved port=13721
23:50:15.965 [792.3376] <2> logconnections: BPDBM CONNECT FROM 
172.16.100.13.2825 TO 172.16.100.7.13721
23:50:16.403 [792.3376] <16> write_backup: FREEZING media id AAA027, 
External event caused rewind during write, all data on media is lost

Tanks 

_______________________________________________
Veritas-bu maillist  -  Veritas-bu AT mailman.eng.auburn DOT edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu


<Prev in Thread] Current Thread [Next in Thread>