Re: [Bacula-users] Tape full before it should.
2017-01-09 10:05:38
On 09/01/17 13:45, Kern Sibbald wrote:
Hello,
The status Bacula received was -1, which means that the tape
drive reported a hardware end of tape (i.e. an end of tape
marker was seen. This can happen for the following reasons:
1. You reached the hardware end of tape marker at 150GB, but the
marker was placed in the wrong place on the tape when it was
manufactured. I.e. the tape cassette is defective.
Kern, that's not a good interpretation of the problem.
LTO tapes don't have a "hardware end of tape marker" as you might
expect with DAT or other older unidirectional tapes.
Because of the serpentine layout of the tape, the beginning of the
tape is also the end of the tape and the servo track (factory
written and unchangeable) contains "offset distance from end of the
reel" information.
Serpentine means:
1: The tape winds to the end of the reel, heads move slightly (onto
the next track) and then the tape winds back into the cartridge.
2: The heads move to the next track again.
3: This process is then repeated until the last track pair is
completed.
4: Data is written to the tape in both directional passes.
When the end of the last track is reached, the tape has been wound
back into the cartridge.
What this means is that the maximum seek time is approximately half
of one track length (~900 metres) and that's around 35GB, even if
you're seeking several hundred GB into the tape - ie: Whilst the
seek command is a linear offset, actual seeking on a LTO is
2-dimensional - "N track and X offset". The tape's internal chip
records the 2D location of files and data blocks, so that there's
never any need to linearly seek along all tracks from the start of
the tape.
LTO heads are constructed so that drives do read-after-write
verification on the fly in both directions. A bacula verification
pass is normally unncecessary because detected errors result in the
data being rewritten to the tape immediately.
If there are errors, the drive will attempt to rewrite the data
several times.(*) If all rewrites fail then it will flag an
uncorrectable error - "The tape is bad and should be discarded"(**).
Bacula interprets this as an end-of-tape error
(*) This means that errors on a tape result in 2 effects if there
are a lot of errors
1: There's a massive slowdown in reported despooling speed for
jobs and tape "full" capacity is reduced somewhat from the
theoretical values (somewhere between 90%-250% of _uncompressed_
capacity would be a normal tape)
2: When reading the tape's RFID chip, it will say that they tape
is somewhere between "97"-"99"% full, but the total amount of data
it says has been written since last labelled is significantly less
than the _uncompressed_ value of the tape.
(**) The same effect will occur if the heads are dirty or damaged -
and it DOES happen(***). Once a contaminated tape finds its way into
a drive and fouls the heads you can pretty much guarantee that all
subsequent tapes will have reported problems, but until the heads
are cleaned or repaired you won't know if the tapes are wrecked or
OK.
(***) We had a bad batch of HP LTO5s contaminate multiple drives
before we realised what was happening. We're still cleaning up the
mess 3 years later.
Drive error codes actually indicate "drive problem", "tape problem"
or "unable to work out which is the problem", but the effect is the
same as far as bacula's concerned. There are a slew of other error
codes.
LTO tapes wear out rapidly with repeated use. The lifespan of a LTO
tape is claimed to be "up to" 162 complete writes but in reality
it's more like 10-20% of this number before degradation is
significant. We're seeing tapes with 20-30 write cycles down to 60%
of original capacity and thanks to rewrites the despool speeds are
_very slow_.
Apart from interrogating the tape drive and tape cartridge chip
(Kern and I have been discussing how to handle this on the fly),
Despooling speed is a critical indicator of tape health. If it
suddenly drops off, this is cause for alarm.
2. You are using some tape driver (e.g. the ibm tape driver)
rather than the Linux st tape driver. The ibm tape driver does
not work correctly with Bacula.
Having encountered this problem, the described issue is not
consistent with the IBM driver error (which comes form "ERROR 0:
Success" messages).
In the case of a IBM driver, the tape can be labelled and written
quite happily. Problems occur when attempts are made to seek to EOD
on a tape with _existing_ data - the error 0 message fools bacula
into thinking the operatiopn has failed.
My opinion:
The error reported and the fact that it took 31 minutes to write
150Gb before erroring out points to fouled heads.
Load a cleaning tape(****) and try writing a new tape.
If that writes ok, then discard the errored tape (and possibly the
one before that). If not then the drive will need return-to-base
repairs and the test tape/last tape and one before that should be
discarded.
(****) NEVER share a cleaning tape between drives. Yes I know that's
what libraries do with dedicated cleaning tape slots, but it's a
really fast way of cross-contaminating hardware. Don't do it.
If you don't have a LTO tape cartridge reader (www.mptapes.com),
then the next best thing is to ensure you have the latest version of
sg3_tools installed, and use sg_read_attr to interrogate the chip.
You should also install the IBM or HP drive management tools (even
if this means installing windows) and interrogate drive health.
tapeinfo and loaderinfo utilities are useful but incomplete for this
kind of diagnosis.
I've been working through the various sg attribute pages trying to
see which ones are useful. Drives actually log a _large_ amount of
data internally about the last few hundred tapes used, but unless
you ask the right questions you won't get any answers out of them
(HP and IBM drive tools ask those questions, of course - and know
how to interpret the answers)
Best regards,
Kern
On 01/09/2017 04:29 AM, Gi Dot wrote:
Hi all,
At the data centre we are using IBM-LTO tape - 3.0TB
compressed, 1.5T uncompressed. Last 2 nights a backup was
running and it stopped at about 150GB size and bacula marked
the tape as full.
Since the total amount of backed up data sometimes could
be huge, I have purged the volume straight away before the
tape was inserted. There is a total of 10 jobs, and the
first job holds the biggest data, somewhere around 500GB to
2TB at a time. Backup failed at the first job, at 150GB
size.
| 3,053 | db01Job | 2017-01-08 01:00:03 | B | F | 43,942 | 150,874,925,633 | f
Excerpt from the logs:
07-Jan 05:00 phisbackupdns1-dir JobId 3052: shell command: run AfterJob "/usr/lib64/bacula/delete_catalog_backup"
08-Jan 01:00 phisbackupdns1-dir JobId 3053: Start Backup JobId 3053, Job=phisdb01Job.2017-01-08_01.00.00_52
08-Jan 01:00 phisbackupdns1-dir JobId 3053: Using Device "Drive0"
08-Jan 01:00 phisbackupdns1-sd JobId 3053: Volume "A00053L5" previously written, moving to end of data.
08-Jan 01:01 phisbackupdns1-sd JobId 3053: Warning: For Volume "A00053L5":
The number of files mismatch! Volume=1955 Catalog=0
Correcting Catalog
08-Jan 01:31 phisbackupdns1-sd JobId 3053: End of Volume "A00053L5" at 2106:1 on device "Drive0" (/dev/nst1). Write of 64512
bytes got -1.
08-Jan 01:31 phisbackupdns1-sd JobId 3053: Re-read of last block succeeded.
08-Jan 01:31 phisbackupdns1-sd JobId 3053: End of medium on Volume "A00053L5" Bytes=150,990,400,512 Blocks=2,340,501 at 08-Ja
n-2017 01:31.
08-Jan 01:31 phisbackupdns1-sd JobId 3053: 3307 Issuing autochanger "unload slot 2, drive 0" command.
08-Jan 01:33 phisbackupdns1-sd JobId 3053: No slot defined in catalog (slot=0) for Volume "A00032L5" on "Drive0" (/dev/nst1).
08-Jan 01:33 phisbackupdns1-sd JobId 3053: Cartridge change or "update slots" may be required.
08-Jan 01:33 phisbackupdns1-sd JobId 3053: Warning: mount.c:217 Open device "Drive0" (/dev/nst1) Volume "A00032L5" failed: ER
R=dev.c:513 Unable to open device "Drive0" (/dev/nst1): ERR=No medium found
Hardware compression is enabled:
# tapeinfo -f /dev/nst1
Product Type: Tape Drive
Vendor ID: 'IBM '
Product ID: 'ULT3580-TD5 '
Revision: 'G360'
Attached Changer API: No
SerialNumber: '10WT008032'
MinBlock: 1
MaxBlock: 8388608
SCSI ID: 1
SCSI LUN: 0
Ready: yes
BufferedMode: yes
Medium Type: 0x58
Density Code: 0x58
BlockSize: 0
DataCompEnabled: yes
DataCompCapable: yes
DataDeCompEnabled: yes
CompType: 0x1
DeCompType: 0x1
BOP: yes
Block Position: 0
Partition 0 Remaining Kbytes: -1
Partition 0 Size in Kbytes: -1
ActivePartition: 0
EarlyWarningSize: 0
NumPartitions: 0
MaxPartitions: 1
Pool configuration for the volume:
Pool {
Name = ADHOC
Label Format = "ADHOC_Vol"
Pool Type = Backup
Recycle = yes
AutoPrune = yes
Storage = ibmts3310
Volume Retention = 12h
Recycle Current Volume = Yes
}
Side note: I just realized that I missed the "Volume Use Duration = 10h" directive in the pool. Reason being is the same tape would be in the drive for 2 nights (Saturday and Sunday), since there is no operator around to change a tape. The tape supposed to be recycled on Sunday night.
Appreciate if anyone can enlighten me as to why the tape is full way earlier compared to the size that it is able to contain.
Thanks.
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users
|
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot _______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users
|
|
|