Yes, the database is large. The File table has 370+ million records and Bacula is pruning. I'll see what we can do with mysqltuner. I have a feeling I'm occasionally getting the timeout error because there are so many records. If I can't find a cure for the situation, I'm thinking of splitting up our backups using two catalog servers -- one for production and one for test/dev. Do you think this is wise? Currently, our DBAs don't support Postgres so it may not be an option. Would Oracle be an option instead?
No. Even the Enterprise version only have support for Catalog in MySQL, PostegreSQL and SQLite databases. If you do not have success with tunning or a migration to PostegreSQL is not feasible, yes, maybe spliting catalog is a good idea. Maybe if you can prune more Files (keeping jobs and volumes, but not files). The File table grows rapidily with large number of files if they are not pruned properly.
I'll have to check on the "--enable-batch-insert" option. Is there a Bacula command to see what options Bacula was built with? I did not set up our installation and am new to Bacula.
If you still have the source installation directory, you will have this information in the bacula-7.0.5/config.out file, "Batch insert enabled line. I'm quite sure this defaults to yes. So I think you have your bacula build with batch insert enabled.
Would you have any idea why a new duplicate job is starting when one was rescheduled? I looked over the configs I know about, but could not find an option that starts a new job given any situation. At least the job will have two more chances to run if I can prevent the new dup. job from starting when the lock is detected. I know that won't cure the database problem, but I have a feeling the jobs might complete successfully on the rescheduled tries (thinking the lock will be gone by then) until the DB can be tuned.
It seems that JobId 140080 was scheduled to start but this was not possible because JobId 139901 didn't finish (with or without error) due to the MySQL error. So, I suppose (don't know your schedule is) that you have the second job was not started because of the rescheduled, but because it was in your schedule. Also, a job will only be rescheduled if it fails, and it seems that it is not the case of JobId 139901: it was still running with that MySQL error. Could you confirm this please? With your schedule definitions?
You're welcome! I hope give you some tips here. It seems you have a database tunning issue because of "Lock wait timeout exceeded; try restarting transaction". I'm not sure about your database size, but based on your JobIds, it seems large. You can try http://mysqltuner.com/ if you're using MySQL. If you have a really large database, you can think about migrating to PostgreSQL if tunning of MySQL do not solve this problem.
Have you build your bacula with "--enable-batch-insert" option? This is a good idea when dealing with large number of files.
2015-09-10 23:47:22bacula-dir JobId 139901: Fatal error: sql_create.c:870 Fill File table Query failed: INSERT INTO File (FileIndex, JobId, PathId, FilenameId, LStat, MD5, DeltaSeq) SELECT batch.FileIndex, batch.JobId, Path.PathId, Filename.FilenameId,batch.LStat, batch.MD5, batch.DeltaSeq FROM batch JOIN Path ON (batch.Path = Path.Path) JOIN Filename ON (batch.Name = Filename.Name): ERR=Lock wait timeout exceeded; try restarting transaction
This happens when a database lock has timed out on a backup and the job is rescheduled. For some reason, it seems a new job is starting up as soon as the error is detected. I posted about this issue earlier and someone mentioned it is happening because I configured Bacula to do that (or at least that's the impression I got from the post). Would anyone know which config would have the setting to start up a new job for the client backup when an error like a lock is detected? So far, I've only found settings for rescheduling, not restarting such as below:
Reschedule Interval = 1 hour
Reschedule Times = 3
Cancel Lower Level Duplicates = yes
Allow Duplicate Jobs = no
Obviously, the backup is getting canceled because of the last two settings above. But, what setting is causing a new job to be created when I get a lock timeout error is detected that says it has rescheduled the job for 3600 minutes later?
I realize it appears I may need to do some database fixing/turning. But, my immediate wonder is why a new job is being created when one has been rescheduled?