Hello, we are running bacula 3.0.1 on several Debian "Lenny" 5.0 servers
Debian packages are self-compiled backporting sources from the debian
unstable distribution to the stable one
On one environment with a tape drive since the upgrade we are
experiencing problems when the customer forget to change the tape: jobs
are no more cancelled.
I configured bacula so that when last backup (catalog job) ends the tape
is ejected, this is the device in bacula-sd.conf
Device {
Name = LTO2
Media Type = LTO2
Archive Device = /dev/nst0
AutomaticMount = yes; # when device opened, read it
AlwaysOpen = yes;
RemovableMedia = yes;
RandomAccess = no;
Offline On Unmount = yes; # eject tape on unmount
Maximum Spool Size = 9900000000;
Spool Directory = /var/spool/bacula;
}
I used "Offline On Unmount = yes;" and when the catalog job ends it
unmount the tape
So usually if the customer forget to change the tape this means that no
tape is inserted.
In the bacula-dir.conf we have configured the jobs this way:
JobDefs {
Name = "ServerJob"
Type = Backup
Level = Full
Schedule = "WeeklyCycle"
Storage = LTO-HENRY
Messages = Standard
Pool = Daily
Priority = 10
# After 10 1/2 hours from scheduled time cancel job if volume not mounted
Max Wait Time = 37800
# After 10 1/2 hours from scheduled time cancel job if not started
Max Start Delay = 37800
SpoolData = yes
}
# Define jobs
Job {
Name = "HenryBackup"
JobDefs = "ServerJob"
Client = henry-fd
FileSet = "Full Set Linux"
Priority = 10
Write Bootstrap = "/var/lib/bacula/henry.bsr"
# mount tape before first backup
Run Before Job = "/usr/local/bin/bacula_mount_tape"
}
Job {
Name = "FalstaffBackup"
JobDefs = "ServerJob"
Client = falstaff-fd
FileSet = "Full Set Linux"
Priority = 30
Write Bootstrap = "/var/lib/bacula/falstaff.bsr"
}
... (other servers jobs)
# Backup the catalog database (after the nightly save)
Job {
Name = "HenryCatalog"
JobDefs = "ServerJob"
Client = henry-fd
Level = Full
FileSet = "Catalog"
Schedule = "WeeklyCycleAfterBackup"
# This creates an ASCII copy of the catalog
RunBeforeJob = "/usr/local/bin/backup_bacula_catalog"
# This unmount the tape after last backup of the day or after
failure/cancel
RunScript {
Command = "/usr/local/bin/bacula_unmount_tape"
RunsWhen = After
RunsOnFailure = yes
RunsOnClient = no
RunsOnSuccess = yes
}
Write Bootstrap = "/var/lib/bacula/BackupCatalog.bsr"
Priority = 100 # run after main backup
# After (10 1/2 hours - 20 minutes) from scheduled time cancel job if
volume not mounted
# this must be 20 minute less than Max Wait Time for backup jobs
Max Wait Time = 36600
# After (10 1/2 hours - 20 minutes) from scheduled time cancel job if
not started
# this must be 20 minute less than Max Start Delay for backup jobs
Max Start Delay = 36600
}
/usr/local/bin/bacula_mount_tape contains the following command:
#!/bin/sh
/usr/bin/bconsole < /usr/local/share/bacula/mount.txt
/usr/local/share/bacula/mount.txt contains:
mount
quit
/usr/local/bin/bacula_unmount_tape contains the following command:
#!/bin/sh
/usr/bin/bconsole < /usr/local/share/bacula/unmount.txt
/usr/local/share/bacula/unmount.txt contains:
unmount
quit
Before the upgrade (bacula version 2.4.4) "Max Wait Time = 37800" was
enough to get the job cancelled when no tape was inserted
Note: this didn't work until in the catalog job we put "Max Start Delay
= 36600" that is 20 minute less than first job.
The first job is scheduled at 22:00, catalog job is scheduled at 22:10
I think that if both jobs have "Max Wait Time = 37800" this is what happens:
- 22:00 first backup starts, no tape is inserted, the job starts waiting
- 22:10 catalog job is queued
- 8:30 first job is cancelled -> catalog job starts waiting on the
tape mount, it would get deleted after 10:30 hours after this time
If "Max Start Delay = 36600" is used for the catalog jobs:
- 22:00 first backup starts, no tape is inserted, the job starts waiting
- 22:10 catalog job is queued
- 8:20 catalog job has not started, it is canceled
- 8:30 first job is cancelled
Could this be correct? As I said it did work with 2.4.4
After the upgrade (bacula version 3.0.1) the jobs are not deleted
anymore, even with this configuration.
A status director shows the following:
*s
dir
henry-dir Version: 3.0.1 (30 April 2009) x86_64-pc-linux-gnu debian
5.0.1
Daemon started 30-Jun-09 07:17, 18 Jobs run since
started.
Heap: heap=5,373,952 smbytes=226,039 max_bytes=226,616 bufs=743
max_bufs=748
Scheduled Jobs:
Level Type Pri Scheduled Name Volume
===================================================================================
Full Backup 10 06-Jul-09 22:00 HenryBackup
*unknown*
Full Backup 20 06-Jul-09 22:00 KatherineBackup
*unknown*
Full Backup 30 06-Jul-09 22:00 FalstaffBackup
*unknown*
Full Backup 40 06-Jul-09 22:00 Server2003Backup
*unknown*
Full Backup 50 06-Jul-09 22:00 MortimerBackup
*unknown*
Full Backup 100 06-Jul-09 22:10 HenryCatalog
*unknown*
====
Running Jobs:
Console connected at 04-Jul-09 15:30
Console connected at 05-Jul-09 15:30
Console connected at 06-Jul-09 15:14
Console connected at 06-Jul-09 15:24
JobId Level Name Status
======================================================================
2363 Full HenryBackup.2009-07-03_22.00.00_35 is waiting for a mount
request
2364 Full KatherineBackup.2009-07-03_22.00.00_36 is waiting
execution
2365 Full FalstaffBackup.2009-07-03_22.00.00_37 is waiting
execution
2366 Full Server2003Backup.2009-07-03_22.00.00_38 is waiting
execution
2367 Full MortimerBackup.2009-07-03_22.00.00_39 is waiting
execution
2368 Full HenryCatalog.2009-07-03_22.10.00_41 is waiting
execution
====
Terminated Jobs:
JobId Level Files Bytes Status Finished Name
====================================================================
2353 Full 399,299 43.32 G OK 02-Jul-09 05:03 FalstaffBackup
2354 Full 36,717 9.304 G OK 02-Jul-09 05:48
Server2003Backup
2355 Full 31,812 662.1 M OK 02-Jul-09 05:58
MortimerBackup
2356 Full 1 286.9 M OK 02-Jul-09 06:01
HenryCatalog
2357 Full 277,949 80.08 G OK 03-Jul-09 01:35
HenryBackup
2358 Full 90,621 9.272 G OK 03-Jul-09 02:32
KatherineBackup
2359 Full 398,727 43.73 G OK 03-Jul-09 05:06
FalstaffBackup
2360 Full 36,722 9.184 G OK 03-Jul-09 05:51
Server2003Backup
2361 Full 31,812 663.7 M OK 03-Jul-09 06:04
MortimerBackup
2362 Full 1 287.1 M OK 03-Jul-09 06:07
HenryCatalog
a "status storage" hangs
in this condition if we try to stop the storage daemon with init script
it does not stop
it is necessary to kill -9 the process and then restart it
after killing the storage daemon and restarting it we got these messages
(we are using TLS to encrypt all comunications):
06-Jul 15:35 henry-dir JobId 0: Fatal error: authenticate.c:120 Director
unable to authenticate with Storage daemon at "henry.cooporso.it:9103".
Possible
causes:
Passwords or names not the same
or
Maximum Concurrent Jobs exceeded on the SD
or
SD networking messed up (restart
daemon).
Please see
http://www.bacula.org/en/rel-manual/Bacula_Freque_Asked_Questi.html#SECTION003760000000000000000
for help.
06-Jul 15:35 henry-dir JobId 0: Failed to connect to Storage
daemon.
04-Jul 13:00 henry-sd JobId 2363: Please mount Volume "Venerdi1" or
label a new one for:
Job:
HenryBackup.2009-07-03_22.00.00_35
Storage: "LTO2"
(/dev/nst0)
Pool:
PoolVenerdi1
Media type:
LTO2
06-Jul 15:35 henry-dir JobId 0: Error: bsock.c:346 Wrote 19 bytes to
client:192.168.10.5:36131, but only 0 accepted.
05-Jul 05:00 henry-sd JobId 2363: Please mount Volume "Venerdi1" or
label a new one for:
Job: HenryBackup.2009-07-03_22.00.00_35
Storage: "LTO2" (/dev/nst0)
Pool: PoolVenerdi1
Media type: LTO2
06-Jul 15:35 henry-dir JobId 0: Error: bsock.c:346 Wrote 19 bytes to
client:192.168.10.5:36131, but only 0 accepted.
06-Jul 05:00 henry-sd JobId 2363: Please mount Volume "Venerdi1" or
label a new one for:
Job: HenryBackup.2009-07-03_22.00.00_35
Storage: "LTO2" (/dev/nst0)
Pool: PoolVenerdi1
Media type: LTO2
06-Jul 15:34 henry-sd JobId 2363: Fatal error: Job 2363 canceled.
06-Jul 15:34 henry-sd JobId 2363: Fatal error: fd_cmds.c:170 Command
error with FD, hanging up. Append data error.
06-Jul 15:35 henry-dir JobId 0: Fatal error: bsock.c:135 Unable to
connect to Storage daemon on henry.cooporso.it:9103. ERR=Connection refused
06-Jul 15:35 henry-dir JobId 0: Error: bsock.c:346 Wrote 19 bytes to
client:192.168.10.5:36131, but only 0 accepted.
06-Jul 15:35 henry-dir JobId 0: Error: openssl.c:86 TLS shutdown
failure.: ERR=error:1409F07F:SSL routines:SSL3_WRITE_PENDING:bad write retry
06-Jul 15:35 henry-dir JobId 0: Fatal error: bsock.c:135 Unable to
connect to Storage daemon on henry.cooporso.it:9103. ERR=Connection refused
06-Jul 15:35 henry-dir JobId 0: Error: bsock.c:346 Wrote 19 bytes to
client:192.168.10.5:36131, but only 0 accepted.
06-Jul 15:35 henry-dir JobId 0: Error: openssl.c:86 TLS shutdown
failure.: ERR=error:1409F07F:SSL routines:SSL3_WRITE_PENDING:bad write retry
06-Jul 15:35 henry-dir JobId 0: Fatal error: bsock.c:135 Unable to
connect to Storage daemon on henry.cooporso.it:9103. ERR=Connection refused
04-Jul 08:30 henry-dir JobId 2363: Fatal error: Max wait time exceeded.
Job canceled.
06-Jul 15:35 henry-dir JobId 2363: Error: message.c:730 Operator mail
program terminated in error.
CMD=/usr/lib/bacula/bsmtp -h localhost -f "(Bacula henry.cooporso.it)
backupadmin@localhost" -s "Bacula: Intervention needed
forHenryBackup.2009-07-03_22.00.00_35" backupadmin@localhost
ERR=Child died from signal 15: Termination
After experiencing the problem a first time I modified the configuration
this way:
JobDefs {
Name = "ServerJob"
Type = Backup
Level = Full
Schedule = "WeeklyCycle"
Storage = LTO-HENRY
Messages = Standard
Pool = Daily
Priority = 10
# After 10 1/2 hours from scheduled time cancel job if volume not mounted
Max Wait Time = 37800
# After 10 1/2 hours from scheduled time cancel job if not started
Max Start Delay = 37800
# After 20 hours from scheduled time cancel job if not completed
Max Run Sched Time = 72000
SpoolData = yes
}
# Backup the catalog database (after the nightly save)
Job {
Name = "HenryCatalog"
JobDefs = "ServerJob"
Client = henry-fd
Level = Full
FileSet = "Catalog"
Schedule = "WeeklyCycleAfterBackup"
# This creates an ASCII copy of the catalog
RunBeforeJob = "/usr/local/bin/backup_bacula_catalog"
# This unmount the tape after last backup of the day or after
failure/cancel
RunScript {
Command = "/usr/local/bin/bacula_unmount_tape"
RunsWhen = After
RunsOnFailure = yes
RunsOnClient = no
RunsOnSuccess = yes
}
Write Bootstrap = "/var/lib/bacula/BackupCatalog.bsr"
Priority = 100 # run after main backup
# After (10 1/2 hours - 20 minutes) from scheduled time cancel job if
volume not mounted
# this must be 20 minute less than Max Wait Time for backup jobs
Max Wait Time = 36600
# After (10 1/2 hours - 20 minutes) from scheduled time cancel job if
not started
# this must be 20 minute less than Max Start Delay for backup jobs
Max Start Delay = 36600
# After 20 hours from scheduled time cancel job if not completed
# this must be 20 minute less than Max Run Sched Time for backup jobs
Max Run Sched Time = 70800
}
I added "Max Run Sched Time = 72000" but the job got stuck anyway (they
are not running).
It seems like "Max Wait Time" is ignored or causes the storage daemon to
stop responding.
Tell me if you need more info or complete config files.
Best regards
Massimiliano Ferrero
--
Massimiliano Ferrero
Midhgard s.r.l.
C/so Svizzera 185 bis
c/o centro Piero della Francesca
10149 - Torino
tel. +39-0117575375
fax +39-0117768576
e-mail: m.ferrero AT midhgard DOT it
sito web: http://www.midhgard.it
------------------------------------------------------------------------------
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users
|