Bacula-users

[Bacula-users] Storage Daemon crash

2010-07-13 15:39:53
Subject: [Bacula-users] Storage Daemon crash
From: LampZy <lampzy AT gmail DOT com>
To: Bacula Users List <bacula-users AT lists.sourceforge DOT net>
Date: Tue, 13 Jul 2010 12:36:29 -0700
Hello,

I'm using Bacula-5.0.2 on CentOS-5.5 x86_64.

Every month or so the storage daemon crashes and I need help to debug 
this problem.

It crashed again this morning. What happened is that yesterday I setup a 
new backup job. The job was upgraded from Incremental to Full because it 
was the first time it ran and put in the queue until all other 
Incremental job finish. All Incremental jobs finished successfully and 
then Bacula found a suitable tape for the Full backup job and started 
the job. The FD on the client didn't authenticate the server and the job 
failed but the SD also crashed.

In the systems logs I found only this line:
----
Jul 13 00:00:13 csebackup2 bacula-sd: Bacula interrupted by signal 11: 
Segmentation violation
----

The traceback file in the working directory has this:
----
ptrace: No such process.
/data/bacula_working/16717: No such file or directory.
$1 = 0
/opt/bacula-5.0.2/scripts/btraceback.gdb:2: Error in sourced command file:
No symbol "exename" in current context.
----

Here is the log entry for the failed backup job:
----
12-Jul 23:04 csebackup2.ucsd.edu-dir JobId 1010: No prior Full backup 
Job record found.
12-Jul 23:04 csebackup2.ucsd.edu-dir JobId 1010: No prior or suitable 
Full backup found in catalog. Doing FULL backup.
12-Jul 23:04 csebackup2.ucsd.edu-dir JobId 1010: Start Backup JobId 
1010, Job=lilliput.2010-07-12_23.04.01_23
12-Jul 23:55 csebackup2.ucsd.edu-sd JobId 1010: 3307 Issuing autochanger 
"unload slot 8, drive 0" command.
12-Jul 23:59 csebackup2.ucsd.edu-dir JobId 1010: Max configured use 
duration exceeded. Marking Volume "CSE009L4" as Used.
12-Jul 23:59 csebackup2.ucsd.edu-dir JobId 1010: Recycled volume "CSE011L4"
12-Jul 23:59 csebackup2.ucsd.edu-dir JobId 1010: Using Volume "CSE011L4" 
from 'Scratch' pool.
12-Jul 23:59 csebackup2.ucsd.edu-dir JobId 1010: Using Device "Drive-1"
12-Jul 23:59 csebackup2.ucsd.edu-dir JobId 1010: Fatal error: Unable to 
authenticate with File daemon at "lilliput.ucsd.edu:9102". Possible causes:
Passwords or names not the same or
Maximum Concurrent Jobs exceeded on the FD or
FD networking messed up (restart daemon).
Please see 
http://www.bacula.org/en/rel-manual/Bacula_Freque_Asked_Questi.html#SECTION003760000000000000000
 
for help.
12-Jul 23:59 csebackup2.ucsd.edu-dir JobId 1010: Fatal error: Network 
error with FD during Backup: ERR=No data available
13-Jul 00:00 csebackup2.ucsd.edu-dir JobId 1010: Fatal error: No Job 
status returned from FD.
13-Jul 00:00 csebackup2.ucsd.edu-dir JobId 1010: Error: Bacula 
csebackup2.ucsd.edu-dir 5.0.2 (28Apr10): 13-Jul-2010 00:00:14
   Build OS:               x86_64-unknown-linux-gnu redhat
   JobId:                  1010
   Job:                    lilliput.2010-07-12_23.04.01_23
   Backup Level:           Full (upgraded from Incremental)
   Client:                 "lilliput.ucsd.edu-fd"
   FileSet:                "lilliput-files" 2010-07-12 23:04:01
   Pool:                   "FullTapes" (From Job FullPool override)
   Catalog:                "MainCatalog" (From Client resource)
   Storage:                "Tape" (From Job resource)
   Scheduled time:         12-Jul-2010 23:04:01
   Start time:             12-Jul-2010 23:04:04
   End time:               13-Jul-2010 00:00:14
   Elapsed time:           56 mins 10 secs
   Priority:               10
   FD Files Written:       0
   SD Files Written:       0
   FD Bytes Written:       0 (0 B)
   SD Bytes Written:       0 (0 B)
   Rate:                   0.0 KB/s
   Software Compression:   None
   VSS:                    no
   Encryption:             no
   Accurate:               no
   Volume name(s):
   Volume Session Id:      20
   Volume Session Time:    1278979090
   Last Volume Bytes:      1 (1 B)
   Non-fatal FD errors:    0
   SD Errors:              0
   FD termination status:  Error
   SD termination status:  Error
   Termination:            *** Backup Error ***
----

The next Full backup job that started after that just said that it can't 
connect to the Storage Daemon:
----
13-Jul 06:29 csebackup2.ucsd.edu-dir JobId 1012: Warning: bsock.c:129 
Could not connect to Storage daemon on csebackup2.ucsd.edu:9103. 
ERR=Connection refused
Retrying ...
13-Jul 06:33 csebackup2.ucsd.edu-dir JobId 1012: Fatal error: 
bsock.c:135 Unable to connect to Storage daemon on 
csebackup2.ucsd.edu:9103. ERR=Connection refused
----

Any idea where to look for the problem?

Thanks
Peter

------------------------------------------------------------------------------
This SF.net email is sponsored by Sprint
What will you do first with EVO, the first 4G phone?
Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users

<Prev in Thread] Current Thread [Next in Thread>