Hi,
I had a bacula director daemon die on me today, after simply restarting it everything looks fine again.
Before it stopped, some strange things happened with the backup it was pulling in from a client.
The server (running director and storage daemon) is a VM running
bacula 5.2.5 on Ubuntu server 12.04 LTS
As file daemon I am using the
bacula systems enterprise windows client 6.0.6.
The server was created by cloning another VM that has been working flawlessly for months. After it was cloned, the machine name was changed, database cleaned up by purging all jobs, files, volumes & everything I could
find, and finally the config files were cleaned out so I could start adding a new set of clients and jobs.
Currently, there are two clients and two jobs defined.
The clients are two almost identical (and very old) windows machines, running the same application, they differ only in name and address (and database content).
One of these jobs has been running successfully for a week, the other was added yesterday and was started for the first time around 3AM this morning.
Now the oldest client was still backed up successfully, but with the new one it almost looks as if a number of things went wrong at the same time -- essentially, it looks like the connection between FD and SD was lost,
while at the same time the connection between FD and director, which is running on the same machine as the SD, remained up.
But the client is not where I am focusing now, I’m trying to find out what happened to the director at or after that moment.
When I came in this morning, I discovered that the director daemon was no longer running.
The log file ends like this (names edited):
25-Mar 05:25 bacula-dir-2
JobId 14975: Rescheduled Job client2.2014-03-24_09.15.32_13 at 25-Mar-2014 05:25 to re-run in 900 seconds (25-Mar-2014 05:40).
25-Mar 05:25 bacula-dir-2
JobId 14976: Job client2.2014-03-25_05.25.16_58 waiting 900 seconds for scheduled start time.
25-Mar 05:26 bacula-dir-2
JobId 14976: Fatal error: Max run time exceeded. Job
canceled.
25-Mar 05:26 bacula-dir-2
JobId 14976: Fatal error: Job canceled because max start delay time exceeded.
25-Mar 05:25 bacula-dir-2
JobId 14976: Job client2.2014-03-25_05.25.16_58 waiting 900 seconds for scheduled start time.
25-Mar 05:26 bacula-dir-2
JobId 14976: Fatal error: Max run time exceeded. Job
canceled.
25-Mar 05:26 bacula-dir-2
JobId 14976: Fatal error: Job canceled because max start delay time exceeded.
Which is strange in more than one regard:
·
Reschedule in 900 seconds, then time out a minute later.
·
Doing that twice in a row, but also the clock seems to have run backwards in-between, so I guess I’m just seeing the same messages written to the log twice.
·
There is no maximum run time defined in my
config, so the default of 6 days should apply, but this client was only added to the .conf yesterday.
In fact, the FD’s are running speed-capped at 1.5 Mbps on 2 Mbps connections, it was expected to take somewhere between 16 and 20 hours to finish, but it failed (and hence the reschedule) after 2.5 hours.
The other client completed in 15 hours, and that one’s database is slightly smaller.
·
No indication as to why the daemon stopped.
All I can add is that it still mailed this job’s result to me, so it must have happened after it was considered finished, and before I arrived at about 7:30.
I checked other log files (syslog etc.), but no indication there either.