Bacula-users

Re: [Bacula-users] Tape full before it should.

2017-01-18 22:20:33
Subject: Re: [Bacula-users] Tape full before it should.
From: Gi Dot <gadidot AT gmail DOT com>
To: Alan Brown <ajb2 AT mssl.ucl.ac DOT uk>
Date: Thu, 19 Jan 2017 11:19:12 +0800
Kern, Alan,

Thanks for the advice. A bit over the top for me to digest, but I'll work on it.

On Mon, Jan 9, 2017 at 11:04 PM, Alan Brown <ajb2 AT mssl.ucl.ac DOT uk> wrote:
On 09/01/17 13:45, Kern Sibbald wrote:
Hello,

The status Bacula received was -1, which means that the tape drive reported a hardware end of tape (i.e. an end of tape marker was seen.  This can happen for the following reasons:

1. You reached the hardware end of tape marker at 150GB, but the marker was placed in the wrong place on the tape when it was manufactured.  I.e. the tape cassette is defective.

Kern, that's not a good interpretation of the problem.


LTO tapes don't have a "hardware end of tape marker" as you might expect with DAT or other older unidirectional tapes.


Because of the serpentine layout of the tape, the beginning of the tape is also the end of the tape and the servo track (factory written and unchangeable) contains "offset distance from end of the reel" information.

Serpentine means:

1: The tape winds to the end of the reel, heads move slightly (onto the next track) and then the tape winds back into the cartridge.
2: The heads move to the next track again.
3: This process is then repeated until the last track pair is completed.
4: Data is written to the tape in both directional passes.

When the end of the last track is reached, the tape has been wound back into the cartridge.

What this means is that the maximum seek time is approximately half of one track length (~900 metres) and that's around 35GB, even if you're seeking several hundred GB into the tape - ie: Whilst the seek command is a linear offset, actual seeking on a LTO is 2-dimensional - "N track and X offset". The tape's internal chip records the 2D location of files and data blocks, so that there's never any need to linearly seek along all tracks from the start of the tape.

LTO heads are constructed so that drives do read-after-write verification on the fly in both directions. A bacula verification pass is normally unncecessary because detected errors result in the data being rewritten to the tape immediately.

If there are errors, the drive will attempt to rewrite the data several times.(*) If all rewrites fail then it will flag an uncorrectable error - "The tape is bad and should be discarded"(**). Bacula interprets this as an end-of-tape error


(*) This means that errors on a tape result in 2 effects if there are a lot of errors 

   1: There's a massive slowdown in reported despooling speed for jobs and tape "full" capacity is reduced somewhat from the theoretical values (somewhere between 90%-250% of _uncompressed_ capacity would be a normal tape)

   2: When reading the tape's RFID chip, it will say that they tape is somewhere between "97"-"99"% full, but the total amount of data it says has been written since last labelled is significantly less than the _uncompressed_ value of the tape.

(**) The same effect will occur if the heads are dirty or damaged - and it DOES happen(***). Once a contaminated tape finds its way into a drive and fouls the heads you can pretty much guarantee that all subsequent tapes will have reported problems, but until the heads are cleaned or repaired you won't know if the tapes are wrecked or OK.

(***) We had a bad batch of HP LTO5s contaminate multiple drives before we realised what was happening. We're still cleaning up the mess 3 years later.


Drive error codes actually indicate "drive problem", "tape problem" or "unable to work out which is the problem", but the effect is the same as far as bacula's concerned. There are a slew of other error codes.



LTO tapes wear out rapidly with repeated use. The lifespan of a LTO tape is claimed to be "up to" 162 complete writes but in reality it's more like 10-20% of this number before degradation is significant. We're seeing tapes with 20-30 write cycles down to 60% of original capacity and thanks to rewrites the despool speeds are _very slow_.


Apart from interrogating the tape drive and tape cartridge chip (Kern and I have been discussing how to handle this on the fly), Despooling speed is a critical indicator of tape health. If it suddenly drops off, this is cause for alarm.




2. You are using some tape driver (e.g. the ibm tape driver) rather than the Linux st tape driver.  The ibm tape driver does not work correctly with Bacula.


Having encountered this problem, the described issue is not consistent with the IBM driver error (which comes form "ERROR 0: Success" messages).

In the case of a IBM driver, the tape can be labelled and written quite happily. Problems occur when attempts are made to seek to EOD on a tape with _existing_ data - the error 0 message fools bacula into thinking the operatiopn has failed.


My opinion:

The error reported and the fact that it took 31 minutes to write 150Gb before erroring out points to fouled heads.
Load a cleaning tape(****) and try writing a new tape.
If that writes ok, then discard the errored tape (and possibly the one before that). If not then the drive will need return-to-base repairs and the test tape/last tape and one before that should be discarded.

(****) NEVER share a cleaning tape between drives. Yes I know that's what libraries do with dedicated cleaning tape slots, but it's a really fast way of cross-contaminating hardware. Don't do it.


If you don't have a LTO tape cartridge reader (www.mptapes.com), then the next best thing is to ensure you have the latest version of sg3_tools installed, and use sg_read_attr to interrogate the chip. 


You should also install the IBM or HP drive management tools (even if this means installing windows) and interrogate drive health.


tapeinfo and loaderinfo utilities are useful but incomplete for this kind of diagnosis.


I've been working through the various sg attribute pages trying to see which ones are useful. Drives actually log a _large_ amount of data internally about the last few hundred tapes used, but unless you ask the right questions you won't get any answers out of them (HP and IBM drive tools ask those questions, of course - and know how to interpret the answers)






Best regards,
Kern

On 01/09/2017 04:29 AM, Gi Dot wrote:
Hi all,


At the data centre we are using IBM-LTO tape - 3.0TB compressed, 1.5T uncompressed. Last 2 nights a backup was running and it stopped at about 150GB size and bacula marked the tape as full.

Since the total amount of backed up data sometimes could be huge, I have purged the volume straight away before the tape was inserted. There is a total of 10 jobs, and the first job holds the biggest data, somewhere around 500GB to 2TB at a time. Backup failed at the first job, at 150GB size.

| 3,053 | db01Job          | 2017-01-08 01:00:03 | B    | F     |   43,942 |   150,874,925,633 | f  

Excerpt from the logs:
07-Jan 05:00 phisbackupdns1-dir JobId 3052: shell command: run AfterJob "/usr/lib64/bacula/delete_catalog_backup"
08-Jan 01:00 phisbackupdns1-dir JobId 3053: Start Backup JobId 3053, Job=phisdb01Job.2017-01-08_01.00.00_52
08-Jan 01:00 phisbackupdns1-dir JobId 3053: Using Device "Drive0"
08-Jan 01:00 phisbackupdns1-sd JobId 3053: Volume "A00053L5" previously written, moving to end of data.
08-Jan 01:01 phisbackupdns1-sd JobId 3053: Warning: For Volume "A00053L5":
The number of files mismatch! Volume=1955 Catalog=0
Correcting Catalog
08-Jan 01:31 phisbackupdns1-sd JobId 3053: End of Volume "A00053L5" at 2106:1 on device "Drive0" (/dev/nst1). Write of 64512
bytes got -1.
08-Jan 01:31 phisbackupdns1-sd JobId 3053: Re-read of last block succeeded.
08-Jan 01:31 phisbackupdns1-sd JobId 3053: End of medium on Volume "A00053L5" Bytes=150,990,400,512 Blocks=2,340,501 at 08-Ja
n-2017 01:31.
08-Jan 01:31 phisbackupdns1-sd JobId 3053: 3307 Issuing autochanger "unload slot 2, drive 0" command.
08-Jan 01:33 phisbackupdns1-sd JobId 3053: No slot defined in catalog (slot=0) for Volume "A00032L5" on "Drive0" (/dev/nst1).
08-Jan 01:33 phisbackupdns1-sd JobId 3053: Cartridge change or "update slots" may be required.
08-Jan 01:33 phisbackupdns1-sd JobId 3053: Warning: mount.c:217 Open device "Drive0" (/dev/nst1) Volume "A00032L5" failed: ER
R=dev.c:513 Unable to open device "Drive0" (/dev/nst1): ERR=No medium found


Hardware compression is enabled:
# tapeinfo -f /dev/nst1
Product Type: Tape Drive
Vendor ID: 'IBM     '
Product ID: 'ULT3580-TD5     '
Revision: 'G360'
Attached Changer API: No
SerialNumber: '10WT008032'
MinBlock: 1
MaxBlock: 8388608
SCSI ID: 1
SCSI LUN: 0
Ready: yes
BufferedMode: yes
Medium Type: 0x58
Density Code: 0x58
BlockSize: 0
DataCompEnabled: yes
DataCompCapable: yes
DataDeCompEnabled: yes
CompType: 0x1
DeCompType: 0x1
BOP: yes
Block Position: 0
Partition 0 Remaining Kbytes: -1
Partition 0 Size in Kbytes: -1
ActivePartition: 0
EarlyWarningSize: 0
NumPartitions: 0
MaxPartitions: 1

Pool configuration for the volume:
Pool {
  Name = ADHOC
  Label Format = "ADHOC_Vol"
  Pool Type = Backup
  Recycle = yes                       
  AutoPrune = yes                     
  Storage = ibmts3310
  Volume Retention = 12h
  Recycle Current Volume = Yes
}

Side note: I just realized that I missed the "Volume Use Duration = 10h" directive in the pool. Reason being is the same tape would be in the drive for 2 nights (Saturday and Sunday), since there is no operator around to change a tape. The tape supposed to be recycled on Sunday night.


Appreciate if anyone can enlighten me as to why the tape is full way earlier compared to the size that it is able to contain.

Thanks.





------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot


_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users




------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot


_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users



------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users