Re: [Bacula-users] btape fill failure on HP LTO6/4 drives

On 01/04/14 10:26, Roberts, Ben wrote:
>> It appears that the OS tape driver does not properly
>> implement back space record after an EOT.  This is a defect of the operating
>> system driver, but it is not fatal for Bacula.

>> You will very likely see this defect show up when Bacula fills a tape and
>> writes the final EOT mark then tries to verify that the last block was 
>> written
>> correctly.  Due to the OS driver defect this will fail, but it is only a 
>> check
>> and your data may still be good.  The results from btape re-reading what was
>> written does not look encouraging, and it may indicate that the last block is
>> not properly written.  I personally would be worried.

> Indeed I was seeing the same failure to backspace over EOT error in the job 
> logs:
> End of Volume "GSA784L6" at 3910:8005 on device "drive-1-tapestore1" 
> (/dev/rmt/1mbn). Write of 64512 bytes got 0.
> Error: Backspace record at EOT failed. ERR=I/O error
> End of medium on Volume "GSA784L6" Bytes=3,909,704,343,552 Blocks=60,604,295 
> at 28-Mar-2014 21:28

I am getting exactly the same symptoms, also under Solaris 11, except this time 
with an LTO2 drive. The drive worked perfectly under Solaris 10, though, and I 
only started seeing this after upgrading to 11.1.

The btape fill/m test gave me this at the end of the first tape:

Wrote block=3160000, file,blk=204,13499 VolBytes=203,857,855,488 rate=20.62 MB/s
08-May 16:59 btape JobId 0: End of Volume "TestVolume1" at 204:15112 on device 
"lto" (/dev/rmt/4cbn). Write of 64512 bytes got 0.
08-May 16:59 btape JobId 0: Error: Backspace record at EOT failed. ERR=I/O error
btape: btape.c:2702 Last block at: 204:15111 this_dev_block_num=15112
btape: btape.c:2737 End of tape 204:-1. Volume Bytes=203,961,913,344. Write 
rate 
= 20.60 MB/s
08-May 16:59 btape JobId 0: End of medium on Volume "TestVolume1" 
Bytes=203,961,913,344 Blocks=3,161,612 at 08-May-2014 16:59.

>> This can happen if you are not running the tape drive in the right mode
>> /dev/rmt/0mbn
>> is what is recommended in the manual.

I'm using /dev/rmt/4cbn, as under Solaris 10, which I believe is correct.

So I tried a few experiments. Using mt to test the first tape, I found this:

mt -f /dev/rmt/4 rew
mt -f /dev/rmt/4cbn fsf 204
mt -f /dev/rmt/4cbn fsr 15111
... gave me an I/O error

After some trial and error, I found I could fsf/fsr forward to 204/14360, but
If I tried to seek to 204/14361, I got an I/O error. The last 51 records appear 
to be missing from the tape. Not good.

I have difficulty believing that the Solaris mtio and/or st modules fail to 
handle EOT properly, or that Bacula doesn't either. However, this is a 
different 
OS, different build, and different version of GCC (although it's the same 
Bacula 
source). I guess it's worth looking everywhere for the culprit.

I have a DLT IV drive attached to the same system, so I tried btape on that, 
and 
the fill/m test worked perfectly.

I also tried running my Solaris 10 Bacula build in a solaris10 branded zone, 
but 
got the same results. (I wanted to eliminate my Solaris 11/GCC4 build of Bacula 
- the Solaris 10 version was compiled with GCC3.)

All that's left now, that I can possibly think of, is that both the S10 and S11 
versions of Bacula are 32-bit builds. I'll try building 64-bit binaries, and 
install Sun Studio 12 if I have to. It may be a data structure 
type/stride/alignment problem with ioctls - the block numbers on the LTO test 
go 
much higher than on the DLT, and Sun's binary compatibility guarantee does not 
include device interfaces.

OTOH, could this be a buffering problem? (Long shot, I know) :-) The SCSI 
channel the LTO2 is on is clocking at 20 MB/s, but the other channel, where the 
DLT is, is only at 5 MB/s. Could writes to the LTO be so far ahead that not 
enough tape is left between logical and physical EOT for the data already 
written by btape by the time the LEOT was reported?

>> If you can really do multi-volume restores correctly and you have verified
>> that every byte is correct then you are probably OK.
> I've successfully restored 3 jobs that were written in the same way. All
>three were bpipe backups of zfs streams so I'm fairly confident the restore
>is byte-perfect, or the zfs recv would have bailed out. I've updated the
>config with "Backward Space Record = no" to disable this check.

I'm not convinced that "Backward Space Record = no" is a good idea, given the 
results I got with btape. I wouldn't trust the backup even if I managed to pull 
back the data without any *reported* errors, and in any case I *like* having 
the 
last block re-read. What I have done as a temporary workaround is to use 
Maximum 
Volume Bytes. I know that with compression I get about 270 GB per tape, so I 
have limited the volumes to 260 GB. That way I still get the last block test 
and 
I have some time to figure out what the problem is (yesterday I was considering 
a re-install of Solaris 10).

I have to admit that I'm completely stumped over this one - I'm not accustomed 
to that and I *don't* like it :-)

Allan


------------------------------------------------------------------------------
Is your legacy SCM system holding you back? Join Perforce May 7 to find out:
&#149; 3 signs your SCM is hindering your productivity
&#149; Requirements for releasing software faster
&#149; Expert tips and advice for migrating your SCM now
http://p.sf.net/sfu/perforce
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users