[Bacula-users] Fatal error: Authorization key rejected by Storage daemon

Hi,

So having quite a frustrating issue with bacula on one of our CentOS hosts. 
Without pattern, I will receive this error on some days:

12-Oct 09:13 dir JobId 24468: Start Backup JobId 24468, Job=<job>
12-Oct 09:13 dir JobId 24468: Using Device "chg0_drive0"
12-Oct 09:46 fd JobId 24468: Fatal error: Authorization key rejected by Storage 
daemon.
Please see 
http://www.bacula.org/en/rel-manual/Bacula_Freque_Asked_Questi.html#SECTION003760000000000000000
 for help.
12-Oct 09:46 fd JobId 24468: Fatal error: Failed to authenticate Storage daemon.
12-Oct 09:46 dir JobId 24468: Fatal error: Bad response to Storage command: 
wanted 2000 OK storage , got 2902 Bad storage

If the backup actually DOES work, it takes forever for the job to complete, 
even for very little data (connected to 1gb switch).

Brief Network Setup:

We utilize a lot of VLANs on our network so our bacula server is accessible by 
a set IP on each VLAN (ex. VLAN id= 25, bacula server listens on x.x.25.100 on 
that VLAN). All our servers run on XenServer hosts and we've never had an issue 
with bacula running until this particular host (1.5+ years).

Good Host:

Below is snippets of debug output from a working host with 0 issues-

fd: job.c:263-0 <dird: Hello Director dir calling
fd: job.c:279-0 Executing Hello command.
fd: cram-md5.c:73-0 send: auth cram-md5 <1146528753.1350051814@fd> ssl=0
fd: cram-md5.c:133-0 cram-get received: auth cram-md5 
<1771224273.1350052057@dir> ssl=0
fd: cram-md5.c:152-0 sending resp to challenge: u8/6LAM+5A+Jc4/nzR/NiC
fd: job.c:263-0 <dird: JobId=24472 Job=<job name> SDid=1197 SDtime=1347377831 
Authorization=<authorization key>

>>From there, it goes into-

fd: job.c:263-0 <dird: storage address=<FQDN of bacula director server> 
port=9103 ssl=0
fd: job.c:279-0 Executing storage  command.
fd: job.c:1474-0 StorageCmd: storage address=<FQDN of bacula server>port=9103 
ssl=0
fd: bsock.c:220-0 Current host[ipv4:10.10.x.100:9103] All 
host[ipv4:10.10.x.100:9103]
fd: bsock.c:154-0 who=Storage daemon host=<FQDN of bacula director server> 
port=9103
fd: cram-md5.c:133-0 cram-get received: auth cram-md5 
<1575779100.1350052057@sd> ssl=0
fd: cram-md5.c:152-0 sending resp to challenge: B9/fK0xcX6E+K4+oZl/KPB
fd: cram-md5.c:80-0 send: auth cram-md5 <786312287.1350051814@ovpn1-fd> ssl=0
fd: cram-md5.c:99-0 Authenticate OK A9/4P+/TBjg+N8dRc4/aPD
fd: job.c:263-0 <dird: backup

And the job completes in adequate time.

Problem Host:

The output on the problem host is completely different. First, for connecting, 
it looks like this:

clcd-fd: job.c:263-0 <dird: storage address=<FQDN of bacula server> port=9103 
ssl=0
clcd-fd: job.c:279-0 Executing storage  command.
clcd-fd: job.c:1474-0 StorageCmd: storage address=<FQDN of bacula server> 
port=9103 ssl=0
clcd-fd: bsock.c:220-0 Current host[ipv4:10.10.5x.100:9103] All 
host[ipv4:10.10.5x.100:9103] host[ipv4:10.10.5x.100:65535] 
host[ipv4:10.10.7x.100:65535] host[ipv4:10.10.x.x:65535] 
host[ipv4:10.10.x.100:65535] host[ipv4:10.10.x.10:65535] 
host[ipv4:10.10.1x.100:65535] host[ipv4:10.10.2x.100:65535] 
host[ipv4:10.10.2x.100:65535] host[ipv4:10.10.5x.100:65535] 
host[ipv4:10.10.5x.100:65535] host[ipv4:10.10.5x.100:65535] 
host[ipv4:10.10.5x.100:65535]
clcd-fd: bsock.c:220-0 Current host[ipv4:10.10.5x.100:9103] All 
host[ipv4:10.10.5x.100:9103] host[ipv4:10.10.5x.100:9103] 
host[ipv4:10.10.7x.100:65535] host[ipv4:10.10.x.x:65535] 
host[ipv4:10.10.x.100:65535] host[ipv4:10.10.x.10065535] 
host[ipv4:10.10.1x.100:65535] host[ipv4:10.10.2x.100:65535] 
host[ipv4:10.10.2x.100:65535] host[ipv4:10.10.5x.100:65535] 
host[ipv4:10.10.5x.100:65535] host[ipv4:10.10.5x.100:65535] 
host[ipv4:10.10.5x.100:65535]
clcd-fd: bsock.c:220-0 Current host[ipv4:10.10.7x.100:9103] All 
host[ipv4:10.10.5x.100:9103] host[ipv4:10.10.5x.100:9103] 
host[ipv4:10.10.7x.100:9103] host[ipv4:10.10.x.x:65535] 
host[ipv4:10.10.x.100:65535] host[ipv4:10.10.x.100:5535] 
host[ipv4:10.10.1x.100:65535] host[ipv4:10.10.2x.100:65535] 
host[ipv4:10.10.2x.100:65535] host[ipv4:10.10.5x.100:65535] 
host[ipv4:10.10.5x.100:65535] host[ipv4:10.10.5x.100:65535] 
host[ipv4:10.10.5x.100:65535]
clcd-fd: bsock.c:220-0 Current host[ipv4:10.10.x.x:9103] All 
host[ipv4:10.10.5x.100:9103] host[ipv4:10.10.5x.100:9103] 
host[ipv4:10.10.7x.100:9103] host[ipv4:10.10.x.x:9103] 
host[ipv4:10.10.x.100:65535] host[ipv4:10.10.x.100:6555] 
host[ipv4:10.10.1x.100:65535] host[ipv4:10.10.2x.100:65535] 
host[ipv4:10.10.2x.100:65535] host[ipv4:10.10.5x.100:65535] 
host[ipv4:10.10.5x.100:65535] host[ipv4:10.10.5x.100:65535] 
host[ipv4:10.10.5x.100:65535]

I changed all the IP's for security purposes, but hopefully you get the idea.

Once it connects to the right IP, it passes authorization-

fd: job.c:263-0 <dird: storage address=<FQDN of bacula server> port=9103 ssl=0
fd: job.c:279-0 Executing storage  command.
fd: job.c:1474-0 StorageCmd: storage address=<FQDN of bacula server> port=9103 
ssl=0
fd: bsock.c:220-0 Current host[ipv4:10.10.1x.100:9103] All 
host[ipv4:10.10.1x.100:9103]
fd: bsock.c:154-0 who=Storage daemon host=<FQDN of bacula server> port=9103
fd: cram-md5.c:133-0 cram-get received: auth cram-md5 
<1356737337.1350055466@chef-sd> ssl=0
fd: cram-md5.c:152-0 sending resp to challenge: I//Wz5ZIkS9l/9+sE8Q8cC
fd: cram-md5.c:80-0 send: auth cram-md5 <57904060.1350055466@clcd-fd> ssl=0
fd: cram-md5.c:99-0 Authenticate OK gkUm/T1qS/cEuT/4+4RasB
fd: job.c:263-0 <dird: backup


Once it finally gets through that stage of the process, I get a ton of these 
messages (there's a lot more than shown below)-

fd: heartbeat.c:96-0 wait_intr=0 stop=0
fd: heartbeat.c:96-0 wait_intr=0 stop=0
fd: heartbeat.c:96-0 wait_intr=0 stop=0
fd: heartbeat.c:96-0 wait_intr=0 stop=0
fd: heartbeat.c:96-0 wait_intr=0 stop=0
fd: heartbeat.c:96-0 wait_intr=0 stop=0
fd: heartbeat.c:96-0 wait_intr=0 stop=0
fd: heartbeat.c:96-0 wait_intr=0 stop=0
fd: heartbeat.c:96-0 wait_intr=0 stop=0
fd: heartbeat.c:96-0 wait_intr=0 stop=0
fd: heartbeat.c:96-0 wait_intr=0 stop=0
fd: heartbeat.c:96-0 wait_intr=0 stop=0

Then eventually -

fd: heartbeat.c:143-0 Send kill to heartbeat id
fd: backup.c:211-0 end blast_data ok=1
fd: job.c:282-0 Quit command loop. Canceled=0
fd: job.c:388-0 Calling term_find_files
fd: job.c:391-0 Done with term_find_files
fd: jcr.c:181-0 write_last_jobs seek to 192
fd: job.c:393-0 Done with free_jcr

So, after comparing the two, I'm not sure why the problem host tries to connect 
to every IP on each VLAN that the bacula server runs on, instead of just the 
appropriate one like the working hosts.

We don't have heartbeat configured on any clients. We use all the default ports 
for bacula.

I wasn't able to get debug output for a failure (shocking), but with what I see 
so far, something is really odd with the networking on the problem host.

We have numerous hosts on this VLAN that have no problem connecting to the 
bacula server. All other VLANs work fine as well.

Anyone have any ideas? TIA

+----------------------------------------------------------------------
|This was sent by rgrobbel AT ajboggs DOT com via Backup Central.
|Forward SPAM to abuse AT backupcentral DOT com.
+----------------------------------------------------------------------



------------------------------------------------------------------------------
Don't let slow site performance ruin your business. Deploy New Relic APM
Deploy New Relic app performance management and know exactly
what is happening inside your Ruby, Python, PHP, Java, and .NET app
Try New Relic at no cost today and get our sweet Data Nerd shirt too!
http://p.sf.net/sfu/newrelic-dev2dev
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users