Bacula-users

[Bacula-users] SD waiting indefinetly for FD to connect

2008-07-22 05:39:48
Subject: [Bacula-users] SD waiting indefinetly for FD to connect
From: Christian Gaul <christian.gaul AT otop DOT de>
To: bacula-users AT lists.sourceforge DOT net
Date: Tue, 22 Jul 2008 11:09:06 +0200
Hello List,

moving on with my implementation of Bacula as a second level of data 
security i am having problems with timeouts.

I have set every timeout i can find in the documentation manually (to 
make sure the defaults didnt change since writing the documentation), 
also, to make sure bacula isnt waiting on stale TCP/IP connections, i 
have modified the bacula-sd / bacula-dir ip configuration to timeout 
after a much shorter time (minutes/hours instead of days). I have 
upgraded from 2.2.8 to 2.4.1 but the problem remained and i can not 
figure out why bacula insists on acting in this way.

Pretext: I have a couple hourly jobs that take Incremental backups, and 
a couple of daily jobs that take Full backups, when one of the backups 
stalls in the state "Jobname.date is waiting for Client clientname-fd to 
connect to Storage FileStorage" it will sometimes cause all jobs that 
come after it to also stall.

This looks kind of like this after one day:

Running Jobs:
 JobId Level   Name                       Status
======================================================================
  4932 Full    Server1ServerImages.2008-07-21_10.10.18 has been canceled
  4942 Full    Windows1_Backup.2008-07-21_11.30.28 is waiting for Client 
windows1-fd to connect to Storage FileStorage
  4971 Full    BackupCatalog.2008-07-21_14.07.11 has been canceled
  4972 Full    FailJob.2008-07-21_14.07.12 has been canceled
  4973 Increme  Server2Home.2008-07-21_15.00.13 has been canceled
  4974 Increme  Util-MySQL.2008-07-21_15.00.14 has been canceled
  4975 Increme  MailHome.2008-07-21_15.00.15 has been canceled
  4976 Increme  CVSRoot.2008-07-21_15.00.16 has been canceled
  4977 Increme  SVNRoot.2008-07-21_15.00.17 has been canceled
  4978 Increme  LDAP-Dump.2008-07-21_15.00.18 has been canceled
  4979 Increme  Server3Data.2008-07-21_15.00.19 has been canceled
  4980 Increme  Server2Home.2008-07-21_16.00.20 has been canceled
  4981 Increme  Util-MySQL.2008-07-21_16.00.21 has been canceled
  4982 Increme  MailHome.2008-07-21_16.00.22 has been canceled
  4983 Increme  CVSRoot.2008-07-21_16.00.23 has been canceled
  4984 Increme  SVNRoot.2008-07-21_16.00.24 has been canceled
--- snip ---
  5105 Increme  Server4Home.2008-07-22_10.00.25 is waiting execution
  5106 Increme  Server4WWW.2008-07-22_10.00.26 has been canceled
  5107 Increme  Server4Config.2008-07-22_10.00.27 has been canceled
  5108 Increme  CVSRoot.2008-07-22_10.00.28 has been canceled
  5109 Increme  SVNRoot.2008-07-22_10.00.29 has been canceled
  5110 Increme  LDAP-Dump.2008-07-22_10.00.30 has been canceled
  5111 Increme  Server3Data.2008-07-22_10.00.31 is waiting execution
  5112 Increme  Windows1_Backup.2008-07-22_10.00.32 has been canceled
  5113 Increme  Windows2_Backup.2008-07-22_10.00.33 has been canceled
  5114 Increme  Windows3_Backup.2008-07-22_10.00.34 has been canceled
  5115 Full    Windows4_Backup.2008-07-22_10.00.35 is waiting execution
  5116 Full    Windows5_Backup.2008-07-22_10.00.36 is waiting execution
  5117 Full    Server1ServerConfigs.2008-07-22_10.05.37 is waiting execution
  5118 Full    Server1ServerLogs.2008-07-22_10.05.38 is waiting execution
  5119 Full    Server1ProductionData.2008-07-22_10.05.39 is waiting 
execution
====

Now, when i try to manually cancel the jobid


*cancel jobid=4942
3904 Job Windows1_Backup.2008-07-21_11.30.28 not found.

i think this is because the job hasnt actually started yet.

relevant config file excepts:

bacula-sd.conf:

Storage {                             # definition of myself
  Name = bacula-sd
  SDPort = 9103                  # Director's port     
  WorkingDirectory = "/var/lib/bacula"
  Pid Directory = "/var/run"
  Maximum Concurrent Jobs = 20
  ---snip---
  Heartbeat Interval = 5
  Client Connect Wait = 2n
}


bacula-dir.conf:

Director {                            # define myself
  Name = bacula-dir
  DIRport = 9101                # where we listen for UA connections
  QueryFile = "/usr/libexec/bacula/query.sql"
  WorkingDirectory = "/var/lib/bacula"
  PidDirectory = "/var/run"
  Maximum Concurrent Jobs = 20
  --snip--
  Heartbeat Interval = 5
  FD Connect Timeout = 30n      #n = minutes
  SD Connect Timeout = 30n      #n = minutes
}

Job config:
JobDefs {
  Name = "WindowsJob"
  Type = Backup
  Storage = FileStorage
  SpoolData = no
  Max Start Delay = 55n #n = minutes
  Max Wait Time = 55n
  Incremental Max Wait Time = 15n
  Differential Max Wait Time = 15n
  Rerun Failed Levels = yes
 --snip--
}
Job {
  Name = "Windows1 Backup"
  JobDefs = "WindowsJob"
  Client = "windows1-fd"
  FileSet = "WindowsNoDDrive"
}

I found the definition for times after digging through the 
documentation, and unless "n" doesnt equal minutes but months, i am 
clueless as to why the the timeout takes so long. Also, even if that one 
job hangs and even takes the whole Storage FileStorage with it, why 
doesnt the daemon accept more connections? Worst part of it all is, that 
even the Tape jobs get cancelled, because a windows client happens to be 
off.

Additional information i am trying to think of:
The windows client was off when the job was started.
The dns name of the client was not resolvable (because it was off) when 
the job started.
The SD normally (when a client is not on the network) gets a 
host-not-found error and fails the jobs correctly.
The only thing i can do to get bacula to start jobs again is restart 
(actually kill) the bacula-dir.
The SD status during this whole time is totally idle.

I would take the windows job, and make an extra SD just for them, but 
then i would not be able to migrate a monthly image to tape. Any other 
ideas how i could solve that issue? I have tried looking for 
documentation on "Copy jobs" and "Copy Pools" but have not been able to 
find anything online or in the documentation for 2.4.1.

Luckly this is still in development and not the primary backup service, 
but before taking it to a larger scale i need some help on this.

Thank you for your time,
Christian Gaul

P.S.: my last question about bacula-sd making some media WRPROT is still 
unresolved.. randomly on labeling media, the bacula-sd will throw a 
read-only error and the media will stay WRPROT even through reboots / 
scsi reset / shutdown -h / media change. I havnt seen it since the 
switch to 2.4.1, so maybe it has gone away. I'll ask again if i see it 
again.

-- 
Christian Gaul
otop AG
55116 Mainz
Rheinstraße 105-107
fon: 06131 / 57 63 - 310
fax: 06131 / 57 63 - 500
web: http://www.otop.de

Vorsitzender des Aufsichtsrats: Christof Glasmacher
Vorstand: Dirk Flug
Amtsgericht Mainz - HRB 7647 -



-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users

<Prev in Thread] Current Thread [Next in Thread>