Bacula-users

Re: [Bacula-users] Catastrophic error. Cannot write overflow block to device"LTO4"

2011-07-10 22:37:21
Subject: Re: [Bacula-users] Catastrophic error. Cannot write overflow block to device"LTO4"
From: Dan Langille <dan AT langille DOT org>
To: stevecs AT chaven DOT com
Date: Sun, 10 Jul 2011 22:34:55 -0400
Resending, with additional information.

On Jul 10, 2011, at 3:18 PM, Steve Costaras wrote:

 
-----Original Message-----
From: Dan Langille [mailto:dan AT langille DOT org]
Sent: Sunday, July 10, 2011 12:58 PM
To: stevecs AT chaven DOT com
Cc: bacula-users AT lists.sourceforge DOT net
Subject: Re: [Bacula-users] Catastrophic error. Cannot write overflow block to device "LTO4"

>>
>> 2) since everything is spooled first, there should be NO error that should cancel a job. A tape drive could fail, a tape could burst into flame, all that would be needed was bacula to know that >>there was an issue and give the admin a simple statement do you want to fix the issue or cancel?, the admin to fix the problem, and then bacula told to restart from the last block that was >>stored successfully OR if need be from the beginning of the spooled data file.

>This I do know. Although, at first glance it seems easy to do this, it is not. If it was trivial to do, I assure you, it would already be in place.

>> Canceling jobs that run for days for TB's of data is just screwed up.

>I suggest running smaller jobs. I don't mean to sound trite, but that really is the solution. Given that the alternative is non-trivial, the sensible choice is, I'm afraid, cancel the job.

I'm already kicking off 20+ jobs for a single system already.   This does not work when we're talking over the 100TB/nearly 200TB mark.     And when these errors happen it does not matter how many jobs you have as /all/ outstanding jobs fail when you have concurancy (in this case all jobs that were qued and were not even writing to the same tape were canceled).  
This sounds like a configuration issue.  Queued jobs should not be cancelled when a previous job cancels.  FYI, I've never seen this happen on my systems.  I think this is something you need to follow up on

This does not happen with any other enterprise backup software not that they should be 100% mimicked.
With the data sizes we have today I don't see why there are not better error handling checks/routines.

This is open source software.  Stuff gets written because someone wants it.  Clearly, nobody who wants it has written. That is why it does not exist.

But sorry, that's not helping you find a solution.  James Harper has some good points. :)  I hope it leads somewhere.

-- 
Dan Langille - http://langille.org

------------------------------------------------------------------------------
All of the data generated in your IT infrastructure is seriously valuable.
Why? It contains a definitive record of application performance, security 
threats, fraudulent activity, and more. Splunk takes this data and makes 
sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-d2d-c2
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users
<Prev in Thread] Current Thread [Next in Thread>