Hello List,
moving on with my implementation of Bacula as a second level of data
security i am having problems with timeouts.
I have set every timeout i can find in the documentation manually (to
make sure the defaults didnt change since writing the documentation),
also, to make sure bacula isnt waiting on stale TCP/IP connections, i
have modified the bacula-sd / bacula-dir ip configuration to timeout
after a much shorter time (minutes/hours instead of days). I have
upgraded from 2.2.8 to 2.4.1 but the problem remained and i can not
figure out why bacula insists on acting in this way.
Pretext: I have a couple hourly jobs that take Incremental backups, and
a couple of daily jobs that take Full backups, when one of the backups
stalls in the state "Jobname.date is waiting for Client clientname-fd to
connect to Storage FileStorage" it will sometimes cause all jobs that
come after it to also stall.
This looks kind of like this after one day:
Running Jobs:
JobId Level Name Status
======================================================================
4932 Full Server1ServerImages.2008-07-21_10.10.18 has been canceled
4942 Full Windows1_Backup.2008-07-21_11.30.28 is waiting for Client
windows1-fd to connect to Storage FileStorage
4971 Full BackupCatalog.2008-07-21_14.07.11 has been canceled
4972 Full FailJob.2008-07-21_14.07.12 has been canceled
4973 Increme Server2Home.2008-07-21_15.00.13 has been canceled
4974 Increme Util-MySQL.2008-07-21_15.00.14 has been canceled
4975 Increme MailHome.2008-07-21_15.00.15 has been canceled
4976 Increme CVSRoot.2008-07-21_15.00.16 has been canceled
4977 Increme SVNRoot.2008-07-21_15.00.17 has been canceled
4978 Increme LDAP-Dump.2008-07-21_15.00.18 has been canceled
4979 Increme Server3Data.2008-07-21_15.00.19 has been canceled
4980 Increme Server2Home.2008-07-21_16.00.20 has been canceled
4981 Increme Util-MySQL.2008-07-21_16.00.21 has been canceled
4982 Increme MailHome.2008-07-21_16.00.22 has been canceled
4983 Increme CVSRoot.2008-07-21_16.00.23 has been canceled
4984 Increme SVNRoot.2008-07-21_16.00.24 has been canceled
--- snip ---
5105 Increme Server4Home.2008-07-22_10.00.25 is waiting execution
5106 Increme Server4WWW.2008-07-22_10.00.26 has been canceled
5107 Increme Server4Config.2008-07-22_10.00.27 has been canceled
5108 Increme CVSRoot.2008-07-22_10.00.28 has been canceled
5109 Increme SVNRoot.2008-07-22_10.00.29 has been canceled
5110 Increme LDAP-Dump.2008-07-22_10.00.30 has been canceled
5111 Increme Server3Data.2008-07-22_10.00.31 is waiting execution
5112 Increme Windows1_Backup.2008-07-22_10.00.32 has been canceled
5113 Increme Windows2_Backup.2008-07-22_10.00.33 has been canceled
5114 Increme Windows3_Backup.2008-07-22_10.00.34 has been canceled
5115 Full Windows4_Backup.2008-07-22_10.00.35 is waiting execution
5116 Full Windows5_Backup.2008-07-22_10.00.36 is waiting execution
5117 Full Server1ServerConfigs.2008-07-22_10.05.37 is waiting execution
5118 Full Server1ServerLogs.2008-07-22_10.05.38 is waiting execution
5119 Full Server1ProductionData.2008-07-22_10.05.39 is waiting
execution
====
Now, when i try to manually cancel the jobid
*cancel jobid=4942
3904 Job Windows1_Backup.2008-07-21_11.30.28 not found.
i think this is because the job hasnt actually started yet.
relevant config file excepts:
bacula-sd.conf:
Storage { # definition of myself
Name = bacula-sd
SDPort = 9103 # Director's port
WorkingDirectory = "/var/lib/bacula"
Pid Directory = "/var/run"
Maximum Concurrent Jobs = 20
---snip---
Heartbeat Interval = 5
Client Connect Wait = 2n
}
bacula-dir.conf:
Director { # define myself
Name = bacula-dir
DIRport = 9101 # where we listen for UA connections
QueryFile = "/usr/libexec/bacula/query.sql"
WorkingDirectory = "/var/lib/bacula"
PidDirectory = "/var/run"
Maximum Concurrent Jobs = 20
--snip--
Heartbeat Interval = 5
FD Connect Timeout = 30n #n = minutes
SD Connect Timeout = 30n #n = minutes
}
Job config:
JobDefs {
Name = "WindowsJob"
Type = Backup
Storage = FileStorage
SpoolData = no
Max Start Delay = 55n #n = minutes
Max Wait Time = 55n
Incremental Max Wait Time = 15n
Differential Max Wait Time = 15n
Rerun Failed Levels = yes
--snip--
}
Job {
Name = "Windows1 Backup"
JobDefs = "WindowsJob"
Client = "windows1-fd"
FileSet = "WindowsNoDDrive"
}
I found the definition for times after digging through the
documentation, and unless "n" doesnt equal minutes but months, i am
clueless as to why the the timeout takes so long. Also, even if that one
job hangs and even takes the whole Storage FileStorage with it, why
doesnt the daemon accept more connections? Worst part of it all is, that
even the Tape jobs get cancelled, because a windows client happens to be
off.
Additional information i am trying to think of:
The windows client was off when the job was started.
The dns name of the client was not resolvable (because it was off) when
the job started.
The SD normally (when a client is not on the network) gets a
host-not-found error and fails the jobs correctly.
The only thing i can do to get bacula to start jobs again is restart
(actually kill) the bacula-dir.
The SD status during this whole time is totally idle.
I would take the windows job, and make an extra SD just for them, but
then i would not be able to migrate a monthly image to tape. Any other
ideas how i could solve that issue? I have tried looking for
documentation on "Copy jobs" and "Copy Pools" but have not been able to
find anything online or in the documentation for 2.4.1.
Luckly this is still in development and not the primary backup service,
but before taking it to a larger scale i need some help on this.
Thank you for your time,
Christian Gaul
P.S.: my last question about bacula-sd making some media WRPROT is still
unresolved.. randomly on labeling media, the bacula-sd will throw a
read-only error and the media will stay WRPROT even through reboots /
scsi reset / shutdown -h / media change. I havnt seen it since the
switch to 2.4.1, so maybe it has gone away. I'll ask again if i see it
again.
--
Christian Gaul
otop AG
55116 Mainz
Rheinstraße 105-107
fon: 06131 / 57 63 - 310
fax: 06131 / 57 63 - 500
web: http://www.otop.de
Vorsitzender des Aufsichtsrats: Christof Glasmacher
Vorstand: Dirk Flug
Amtsgericht Mainz - HRB 7647 -
-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users
|