Bacula-users

[Bacula-users] Write Error but tape status not set to "Error".

2009-07-06 13:51:06
Subject: [Bacula-users] Write Error but tape status not set to "Error".
From: Richard Frewin <Richard.Frewin AT astro.cf.ac DOT uk>
To: Bacula Users <bacula-users AT lists.sourceforge DOT net>
Date: Mon, 6 Jul 2009 18:18:01 +0100
Hi All,

   First, I want to say thanks to all the Bacula developers for doing such
a great job.  Second this email is going to get a bit long as I'm trying
to include all the relevant details ... sorry.  The subject is an accurate
executive summary if you need one :-)

   Background: I upgraded from 2.4.3 to 3.0.1 for director and SD last
week (on the same Scientific Linux 4.2 machine).  Clients are at everything
from 1.38.11 upwards (though the "easy-to-get-to" ones are now 3.0.0 or
3.0.1 :-).  Tape is an Overland Neo 2000 with a single HP LTO3 drive.

   Let me state now that, yes, I know I have a dodgy tape in my
autoloader.  Looking back at the logs I can now see that the last time
this tape (MHB107L3) was full it had an error (DIR and SD were 2.4.3 at
the time):

   28-Mar 23:10 beunoeth-sd JobId 38837: Committing spooled data to Volume 
"MHB107L3". Despooling 46,841,852,967 bytes ...
   28-Mar 23:32 beunoeth-sd JobId 38837: Error: block.c:568 Write error at 
359:15499 on device "LTO3" (/dev/nst0). ERR=Input/output error.
   28-Mar 23:33 beunoeth-sd JobId 38837: Error: Re-read of last block OK, but 
block numbers differ. Last block=471331 Current block=473270.
   28-Mar 23:33 beunoeth-sd JobId 38837: End of medium on Volume "MHB107L3" 
Bytes=359,976,895,488 Blocks=5,579,998 at 28-Mar-2009 23:33.
   28-Mar 23:33 beunoeth-sd JobId 38837: 3307 Issuing autochanger "unload slot 
8, drive 0" command.

Of course, I didn't notice this error as we have approximately 60 jobs
(and I'm far too lazy to carefully check all the emails) AND the tape was
not marked as Error status in the catalog AND the job didn't fail.

  This weekend that tape was recycled and reused and the following log was
produced for one of the jobs writing to it:

   04-Jul 06:00 beunoeth-dir JobId 43745: Start Backup JobId 43745, 
Job=SkaroASTROWORLD.2009-07-04_06.00.00_37
   04-Jul 06:02 beunoeth-dir JobId 43745: Using Device "LTO3"
   04-Jul 06:04 beunoeth-sd JobId 43745: Spooling data ...
   04-Jul 13:05 beunoeth-sd JobId 43745: User specified spool size reached.
   04-Jul 13:05 beunoeth-sd JobId 43745: Writing spooled data to Volume. 
Despooling 50,000,033,355 bytes ...
   04-Jul 15:35 beunoeth-sd JobId 43745: Despooling elapsed time = 01:14:39, 
Transfer rate = 11.16 M bytes/second
   04-Jul 15:38 beunoeth-sd JobId 43745: Spooling data again ...
   04-Jul 19:37 beunoeth-sd JobId 43745: User specified spool size reached.
   04-Jul 19:37 beunoeth-sd JobId 43745: Writing spooled data to Volume. 
Despooling 50,000,033,735 bytes ...
   04-Jul 21:03 beunoeth-sd JobId 43745: Despooling elapsed time = 01:25:21, 
Transfer rate = 9.763 M bytes/second
   04-Jul 21:05 beunoeth-sd JobId 43745: Spooling data again ...
   05-Jul 00:51 beunoeth-sd JobId 43745: User specified spool size reached.
   05-Jul 00:51 beunoeth-sd JobId 43745: Writing spooled data to Volume. 
Despooling 50,000,033,476 bytes ...
   05-Jul 04:13 beunoeth-sd JobId 43745: Error: block.c:568 Write error at 
397:11999 on device "LTO3" (/dev/nst0). ERR=Input/output error.
   05-Jul 04:14 beunoeth-sd JobId 43745: Fatal error: Re-read of last block: 
block numbers differ by more than one.
   Probable tape misconfiguration and data loss. Read block=288633 Want 
block=290548.
   05-Jul 04:14 beunoeth-sd JobId 43745: Fatal error: spool.c:298 Fatal append 
error on device "LTO3" (/dev/nst0): ERR=dev.c:474 Unable to open device "LTO3" 
(/dev/nst0): ERR=No medium found

   05-Jul 04:14 beunoeth-sd JobId 43745: Despooling elapsed time = 00:28:22, 
Transfer rate = 29.37 M bytes/second
   05-Jul 04:15 beunoeth-sd JobId 43745: Fatal error: fd_cmds.c:181 FD command 
not found: 
   05-Jul 04:15 beunoeth-sd JobId 43745: Job write elapsed time = 22:11:39, 
Transfer rate = 1.875 M bytes/second
   05-Jul 07:49 skaro-fd JobId 43745: Fatal error: backup.c:944 Network send 
error to SD. ERR=Broken pipe
   05-Jul 07:49 beunoeth-sd JobId 43745: Fatal error: fd_cmds.c:170 Command 
error with FD, hanging up. Append data error.

   05-Jul 07:49 beunoeth-dir JobId 43745: Error: Bacula beunoeth-dir 3.0.1 
(30Apr09): 05-Jul-2009 07:49:46
     Build OS:               i686-pc-linux-gnu redhat 4.2
     JobId:                  43745
     Job:                    SkaroASTROWORLD.2009-07-04_06.00.00_37
     Backup Level:           Full
     Client:                 "skaro-fd" 3.0.0 (06Apr09) 
i386-portbld-freebsd7.2,freebsd,7.2-STABLE
     FileSet:                "SkaroASTROWORLD" 2006-11-08 18:53:33
     Pool:                   "MonthlyPool" (From Run pool override)
     Catalog:                "MyCatalog" (From Client resource)
     Storage:                "Overland" (From Job resource)
     Scheduled time:         04-Jul-2009 06:00:00
     Start time:             04-Jul-2009 06:02:14
     End time:               05-Jul-2009 07:49:46
     Elapsed time:           1 day 1 hour 47 mins 32 secs
     Priority:               10
     FD Files Written:       156,357
     SD Files Written:       156,357
     FD Bytes Written:       149,802,789,209 (149.8 GB)
     SD Bytes Written:       149,844,100,532 (149.8 GB)
     Rate:                   1613.4 KB/s
     Software Compression:   26.1 %
     VSS:                    no
     Encryption:             no
     Accurate:               no
     Volume name(s):         MHB113L3|MHB107L3
     Volume Session Id:      208
     Volume Session Time:    1246395161
     Last Volume Bytes:      151,181,549,568 (151.1 GB)
     Non-fatal FD errors:    0
     SD Errors:              1
     FD termination status:  Error
     SD termination status:  Error
     Termination:            *** Backup Error ***

   05-Jul 07:49 beunoeth-dir JobId 43745: Rescheduled Job 
SkaroASTROWORLD.2009-07-04_06.00.00_37 at 05-Jul-2009 07:49 to re-run in 3000 
seconds (05-Jul-2009 08:39).

Due to old / slow server hardware and a rubbish network, we use (up to 6)
concurrent jobs spooling 50 Gb at a time to keep the tape heads moving as
much as possible - so jobs on tapes are interleaved.  Jobs vary in size
from a few to 500+ Gb and according to the catalog there are parts of 11
jobs written to this tape.  BTW: I'm using "Reschedule On Error" and
"Rerun Failed Levels".

   As you can see I've (potentially) lost a large part of this weekend's
backups (but hay, I have a bad tape :-( what do I expect).  I'm now
waiting until the resubmitted job finishes before I can check the tape for
read errors.

Now to the questions:

 * Back in March when the first error occurred ("Error: Re-read of last
block OK, but block numbers differ ... ") why did the tape not get changed
to status "Error" in the catalog and the job fail ? (surely there's
potential data loss here) [this may have been fixed since 2.4.3].

 * Why do errors that produce ominous sounding log messages like "Probable
tape misconfiguration and data loss" not mark the affected tape as error
in the catalog ?

 * Why did the error message not say which tape had the I/O error (it took
me a while to figure out which tape was actually loaded at "05-Jul 04:13").

 * Why did the job finally fail with a "Network send" error from the FD
after there was an I/O error from the SD three and a half hours earlier ?
(something is not behaving well on hardware errors).

Finally, I want to stress that I'm not moaning - I payed nothing for this
software - and I am really happy with the value for money (it's pulled me
out of a number of large holes already :-) ... and I appreciate how
complex the SD code must be and that this is a corner-case, etc. ...  
and if I was a programmer I'd be submitting patches :-)

   Cheers ...
--
 ----------------------------------------------------------------------
  Richard Frewin                | Email: Richard.Frewin AT astro.cf.ac DOT uk
  IT Systems Manager            |
  PHYSX Computing Support Team  |             Tel: +44 (0)29 208 75282
  Cardiff University            |             Fax: +44 (0)29 208 74056
  The Parade                    |
  Cardiff              CF24 3AA |      http://www.astro.cardiff.ac.uk/
 ______________________________________________________________________
               "When the straw breaks the camel's back,
               blood stains the sand.." - Roaring Jelly

------------------------------------------------------------------------------
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users

<Prev in Thread] Current Thread [Next in Thread>
  • [Bacula-users] Write Error but tape status not set to "Error"., Richard Frewin <=