[Bacula-users] Problem with "Max wait time" when not changing tape

Hello, we are running bacula 3.0.1 on several Debian "Lenny" 5.0 servers

Debian packages are self-compiled backporting sources from the debian 
unstable distribution to the stable one

On one environment with a tape drive since the upgrade we are 
experiencing problems when the customer forget to change the tape: jobs 
are no more cancelled.
I configured bacula so that when last backup (catalog job) ends the tape 
is ejected, this is the device in bacula-sd.conf

Device {
  Name = LTO2
  Media Type = LTO2
  Archive Device = /dev/nst0
  AutomaticMount = yes;               # when device opened, read it
  AlwaysOpen = yes;
  RemovableMedia = yes;
  RandomAccess = no;
  Offline On Unmount = yes;           # eject tape on unmount
  Maximum Spool Size = 9900000000;
  Spool Directory = /var/spool/bacula;
}

I used "Offline On Unmount = yes;" and when the catalog job ends it 
unmount the tape


So usually if the customer forget to change the tape this means that no 
tape is inserted.

In the bacula-dir.conf we have configured the jobs this way:

JobDefs {
  Name = "ServerJob"
  Type = Backup
  Level = Full
  Schedule = "WeeklyCycle"
  Storage = LTO-HENRY
  Messages = Standard
  Pool = Daily
  Priority = 10
  # After 10 1/2 hours from scheduled time cancel job if volume not mounted
  Max Wait Time = 37800
  # After 10 1/2 hours from scheduled time cancel job if not started
  Max Start Delay = 37800
  SpoolData = yes
}

# Define jobs
Job {
  Name = "HenryBackup"
  JobDefs = "ServerJob"
  Client = henry-fd
  FileSet = "Full Set Linux"
  Priority = 10
  Write Bootstrap = "/var/lib/bacula/henry.bsr"
  # mount tape before first backup
  Run Before Job = "/usr/local/bin/bacula_mount_tape"
}

Job {
  Name = "FalstaffBackup"
  JobDefs = "ServerJob"
  Client = falstaff-fd
  FileSet = "Full Set Linux"
  Priority = 30
  Write Bootstrap = "/var/lib/bacula/falstaff.bsr"
}

... (other servers jobs)

# Backup the catalog database (after the nightly save)
Job {
  Name = "HenryCatalog"
  JobDefs = "ServerJob"
  Client = henry-fd
  Level = Full
  FileSet = "Catalog"
  Schedule = "WeeklyCycleAfterBackup"
  # This creates an ASCII copy of the catalog
  RunBeforeJob = "/usr/local/bin/backup_bacula_catalog"
  # This unmount the tape after last backup of the day or after 
failure/cancel
  RunScript {
    Command = "/usr/local/bin/bacula_unmount_tape"
    RunsWhen = After
    RunsOnFailure = yes
    RunsOnClient  = no
    RunsOnSuccess = yes
  }
  Write Bootstrap = "/var/lib/bacula/BackupCatalog.bsr"
  Priority = 100                   # run after main backup
  # After (10 1/2 hours - 20 minutes) from scheduled time cancel job if 
volume not mounted
  # this must be 20 minute less than Max Wait Time for backup jobs
  Max Wait Time = 36600
  # After (10 1/2 hours - 20 minutes) from scheduled time cancel job if 
not started
  # this must be 20 minute less than Max Start Delay for backup jobs
  Max Start Delay = 36600
}



/usr/local/bin/bacula_mount_tape contains the following command:

#!/bin/sh

/usr/bin/bconsole < /usr/local/share/bacula/mount.txt



/usr/local/share/bacula/mount.txt contains:

mount
quit




/usr/local/bin/bacula_unmount_tape contains the following command:

#!/bin/sh

/usr/bin/bconsole < /usr/local/share/bacula/unmount.txt



/usr/local/share/bacula/unmount.txt contains:
unmount
quit




Before the upgrade (bacula version 2.4.4) "Max Wait Time = 37800" was 
enough to get the job cancelled when no tape was inserted



Note: this didn't work until in the catalog job we put "Max Start Delay 
= 36600" that is 20 minute less than first job.
The first job is scheduled at 22:00, catalog job is scheduled at 22:10
I think that if both jobs have "Max Wait Time = 37800" this is what happens:
- 22:00 first backup starts, no tape is inserted, the job starts waiting
- 22:10 catalog job is queued
- 8:30   first job is cancelled -> catalog job starts waiting on the 
tape mount, it would get deleted after 10:30 hours after this time

If "Max Start Delay = 36600" is used for the catalog jobs:

- 22:00 first backup starts, no tape is inserted, the job starts waiting
- 22:10 catalog job is queued
- 8:20   catalog job has not started, it is canceled
- 8:30   first job is cancelled

Could this be correct? As I said it did work with 2.4.4




After the upgrade (bacula version 3.0.1) the jobs are not deleted 
anymore, even with this configuration.
A status director shows the following:

*s 
dir                                                                             
                                              

henry-dir Version: 3.0.1 (30 April 2009) x86_64-pc-linux-gnu debian 
5.0.1                                                       
Daemon started 30-Jun-09 07:17, 18 Jobs run since 
started.                                                                       

 Heap: heap=5,373,952 smbytes=226,039 max_bytes=226,616 bufs=743 
max_bufs=748                                                   

Scheduled Jobs:
Level          Type     Pri  Scheduled          Name               Volume
===================================================================================
Full           Backup    10  06-Jul-09 22:00    HenryBackup        
*unknown*      
Full           Backup    20  06-Jul-09 22:00    KatherineBackup    
*unknown*      
Full           Backup    30  06-Jul-09 22:00    FalstaffBackup     
*unknown*      
Full           Backup    40  06-Jul-09 22:00    Server2003Backup   
*unknown*      
Full           Backup    50  06-Jul-09 22:00    MortimerBackup     
*unknown*      
Full           Backup   100  06-Jul-09 22:10    HenryCatalog       
*unknown*      
====                                                                            
   


Running Jobs:
Console connected at 04-Jul-09 15:30
Console connected at 05-Jul-09 15:30
Console connected at 06-Jul-09 15:14
Console connected at 06-Jul-09 15:24
 JobId Level   Name                       Status
======================================================================
  2363 Full    HenryBackup.2009-07-03_22.00.00_35 is waiting for a mount 
request
  2364 Full    KatherineBackup.2009-07-03_22.00.00_36 is waiting 
execution     
  2365 Full    FalstaffBackup.2009-07-03_22.00.00_37 is waiting 
execution      
  2366 Full    Server2003Backup.2009-07-03_22.00.00_38 is waiting 
execution    
  2367 Full    MortimerBackup.2009-07-03_22.00.00_39 is waiting 
execution      
  2368 Full    HenryCatalog.2009-07-03_22.10.00_41 is waiting 
execution        
====                                                                            


Terminated Jobs:
 JobId  Level    Files      Bytes   Status   Finished        Name
====================================================================
  2353  Full    399,299    43.32 G  OK       02-Jul-09 05:03 FalstaffBackup
  2354  Full     36,717    9.304 G  OK       02-Jul-09 05:48 
Server2003Backup
  2355  Full     31,812    662.1 M  OK       02-Jul-09 05:58 
MortimerBackup 
  2356  Full          1    286.9 M  OK       02-Jul-09 06:01 
HenryCatalog   
  2357  Full    277,949    80.08 G  OK       03-Jul-09 01:35 
HenryBackup    
  2358  Full     90,621    9.272 G  OK       03-Jul-09 02:32 
KatherineBackup
  2359  Full    398,727    43.73 G  OK       03-Jul-09 05:06 
FalstaffBackup 
  2360  Full     36,722    9.184 G  OK       03-Jul-09 05:51 
Server2003Backup
  2361  Full     31,812    663.7 M  OK       03-Jul-09 06:04 
MortimerBackup 
  2362  Full          1    287.1 M  OK       03-Jul-09 06:07 
HenryCatalog   


a "status storage" hangs
in this condition if we try to stop the storage daemon with init script 
it does not stop
it is necessary to kill -9 the process and then restart it
after killing the storage daemon and restarting it we got these messages 
(we are using TLS to encrypt all comunications):

06-Jul 15:35 henry-dir JobId 0: Fatal error: authenticate.c:120 Director 
unable to authenticate with Storage daemon at "henry.cooporso.it:9103". 
Possible 
causes:                                                                         
                        

Passwords or names not the same 
or                                                                              
                 

Maximum Concurrent Jobs exceeded on the SD 
or                                                                              
      

SD networking messed up (restart 
daemon).                                                                        
                

Please see 
http://www.bacula.org/en/rel-manual/Bacula_Freque_Asked_Questi.html#SECTION003760000000000000000
 
for help.           
06-Jul 15:35 henry-dir JobId 0: Failed to connect to Storage 
daemon.                                                            
04-Jul 13:00 henry-sd JobId 2363: Please mount Volume "Venerdi1" or 
label a new one for:                                        
    Job:          
HenryBackup.2009-07-03_22.00.00_35                                              
                               

    Storage:      "LTO2" 
(/dev/nst0)                                                                     
                        

    Pool:         
PoolVenerdi1                                                                    
                               

    Media type:   
LTO2                                                                            
                               

06-Jul 15:35 henry-dir JobId 0: Error: bsock.c:346 Wrote 19 bytes to 
client:192.168.10.5:36131, but only 0 accepted.            
05-Jul 05:00 henry-sd JobId 2363: Please mount Volume "Venerdi1" or 
label a new one for:
    Job:          HenryBackup.2009-07-03_22.00.00_35
    Storage:      "LTO2" (/dev/nst0)
    Pool:         PoolVenerdi1
    Media type:   LTO2
06-Jul 15:35 henry-dir JobId 0: Error: bsock.c:346 Wrote 19 bytes to 
client:192.168.10.5:36131, but only 0 accepted.
06-Jul 05:00 henry-sd JobId 2363: Please mount Volume "Venerdi1" or 
label a new one for:
    Job:          HenryBackup.2009-07-03_22.00.00_35
    Storage:      "LTO2" (/dev/nst0)
    Pool:         PoolVenerdi1
    Media type:   LTO2
06-Jul 15:34 henry-sd JobId 2363: Fatal error: Job 2363 canceled.
06-Jul 15:34 henry-sd JobId 2363: Fatal error: fd_cmds.c:170 Command 
error with FD, hanging up. Append data error.

06-Jul 15:35 henry-dir JobId 0: Fatal error: bsock.c:135 Unable to 
connect to Storage daemon on henry.cooporso.it:9103. ERR=Connection refused
06-Jul 15:35 henry-dir JobId 0: Error: bsock.c:346 Wrote 19 bytes to 
client:192.168.10.5:36131, but only 0 accepted.
06-Jul 15:35 henry-dir JobId 0: Error: openssl.c:86 TLS shutdown 
failure.: ERR=error:1409F07F:SSL routines:SSL3_WRITE_PENDING:bad write retry
06-Jul 15:35 henry-dir JobId 0: Fatal error: bsock.c:135 Unable to 
connect to Storage daemon on henry.cooporso.it:9103. ERR=Connection refused
06-Jul 15:35 henry-dir JobId 0: Error: bsock.c:346 Wrote 19 bytes to 
client:192.168.10.5:36131, but only 0 accepted.
06-Jul 15:35 henry-dir JobId 0: Error: openssl.c:86 TLS shutdown 
failure.: ERR=error:1409F07F:SSL routines:SSL3_WRITE_PENDING:bad write retry
06-Jul 15:35 henry-dir JobId 0: Fatal error: bsock.c:135 Unable to 
connect to Storage daemon on henry.cooporso.it:9103. ERR=Connection refused
04-Jul 08:30 henry-dir JobId 2363: Fatal error: Max wait time exceeded. 
Job canceled.
06-Jul 15:35 henry-dir JobId 2363: Error: message.c:730 Operator mail 
program terminated in error.
CMD=/usr/lib/bacula/bsmtp -h localhost -f "(Bacula henry.cooporso.it) 
backupadmin@localhost" -s "Bacula: Intervention needed 
forHenryBackup.2009-07-03_22.00.00_35" backupadmin@localhost
ERR=Child died from signal 15: Termination

After experiencing the problem a first time I modified the configuration 
this way:

JobDefs {
  Name = "ServerJob"
  Type = Backup
  Level = Full
  Schedule = "WeeklyCycle"
  Storage = LTO-HENRY
  Messages = Standard
  Pool = Daily
  Priority = 10
  # After 10 1/2 hours from scheduled time cancel job if volume not mounted
  Max Wait Time = 37800
  # After 10 1/2 hours from scheduled time cancel job if not started
  Max Start Delay = 37800
  # After 20 hours from scheduled time cancel job if not completed
  Max Run Sched Time = 72000
  SpoolData = yes
}

# Backup the catalog database (after the nightly save)
Job {
  Name = "HenryCatalog"
  JobDefs = "ServerJob"
  Client = henry-fd
  Level = Full
  FileSet = "Catalog"
  Schedule = "WeeklyCycleAfterBackup"
  # This creates an ASCII copy of the catalog
  RunBeforeJob = "/usr/local/bin/backup_bacula_catalog"
  # This unmount the tape after last backup of the day or after 
failure/cancel
  RunScript {
    Command = "/usr/local/bin/bacula_unmount_tape"
    RunsWhen = After
    RunsOnFailure = yes
    RunsOnClient  = no
    RunsOnSuccess = yes
  }
  Write Bootstrap = "/var/lib/bacula/BackupCatalog.bsr"
  Priority = 100                   # run after main backup
  # After (10 1/2 hours - 20 minutes) from scheduled time cancel job if 
volume not mounted
  # this must be 20 minute less than Max Wait Time for backup jobs
  Max Wait Time = 36600
  # After (10 1/2 hours - 20 minutes) from scheduled time cancel job if 
not started
  # this must be 20 minute less than Max Start Delay for backup jobs
  Max Start Delay = 36600
  # After 20 hours from scheduled time cancel job if not completed
  # this must be 20 minute less than Max Run Sched Time for backup jobs
  Max Run Sched Time = 70800
}

I added "Max Run Sched Time = 72000" but the job got stuck anyway (they 
are not running).
It seems like "Max Wait Time" is ignored or causes the storage daemon to 
stop responding.

Tell me if you need more info or complete config files.


Best regards
Massimiliano Ferrero

-- 

Massimiliano Ferrero
Midhgard s.r.l.
C/so Svizzera 185 bis
c/o centro Piero della Francesca
10149 - Torino
tel. +39-0117575375
fax  +39-0117768576
e-mail: m.ferrero AT midhgard DOT it
sito web: http://www.midhgard.it


------------------------------------------------------------------------------
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users