Bacula-users

Re: [Bacula-users] Hostname not found

2017-02-12 06:10:54
Subject: Re: [Bacula-users] Hostname not found
From: Kern Sibbald <kern AT sibbald DOT com>
To: Dan Langille <dan AT langille DOT org>
Date: Sun, 12 Feb 2017 12:09:21 +0100
Hello Dan,

Well, then we are back to my theory that FreeBSD wants a particular length in some of the system calls.
In general, Bacula uses lengths that work for both IPv4 and IPv6, which means that if you are using IPv4, the packet lengths in some calls are longer than the really need to be.  This may or may not be true in your case, but is a path to look at.

Best regards,
Kern

On 02/11/2017 11:08 PM, Dan Langille wrote:
On Feb 11, 2017, at 2:11 AM, Kern Sibbald <kern AT sibbald DOT com> wrote:

Hello Dan,

Well, I am happy to know that it is not Bacula related :-)

Well, it's not solved yet.  This just in.

11-Feb 21:11 bacula-sd-01-sd JobId 256374: Elapsed time=03:20:24, Transfer rate=29.63 M Bytes/second
11-Feb 21:11 bacula-sd-01-sd JobId 256374: Error: bsock.c:375 Socket has errors=1 on call to client:Hostname not found:9103
11-Feb 21:11 bacula-sd-01-sd JobId 256374: Fatal error: fd_cmds.c:142 Read data not accepted
11-Feb 21:11 bacula-sd-01-sd JobId 256374: Error: bsock.c:375 Socket has errors=1 on call to client:Hostname not found:9103


9103?  That must be an SD for which someone cannot figure out the hostname based on the IP address.

Would it not be easier to list the IP address when the hostname cannot be found? That would give a starting point.

Let's go to the SD in question: bacula-sd-01-sd which is at bacula-sd-01.int.unixathome.org (the Read Storage listed below). From there,
I try to resolve the write SD (tape01 at tape01.int.unixathome.org), which is what this message must pertain to.

$ dig +short tape01.int.unixathome.org
10.55.0.103

$ dig +short -x 10.55.0.103
tape01.int.unixathome.org.

That looks OK.  Perhaps I misunderstand which hostname it's trying to provide.

Is there something else I should verify with respect to DNS in this instance?

I looked at the code: The message comes from line 374 of src/lib/bsock.c

The 'Hostname not found' comes from line 646 of src/lib/address_conf.c

The code invokes getnameinfo(3) and I looked up that info .http://www.unix.com/man-page/FreeBSD/3/getnameinfo/

I'm happy to patch with debugging code for this, but I'm out of time this weekend.



The full job output is here: 

11-Feb 17:50 bacula-dir JobId 256374: Warning: FileSet MD5 digest not found.
11-Feb 17:50 bacula-dir JobId 256374: Copying using JobId=255999 Job=knew_jail_snapshots.2017-02-05_03.04.02_47
11-Feb 17:50 bacula-dir JobId 256374: Bootstrap records written to /usr/local/bacula/working/bacula-dir.restore.10.bsr
11-Feb 17:50 bacula-dir JobId 256374: Start Copying JobId 256374, Job=CopyToTape-Full-LTO4.2017-02-11_17.50.17_22
11-Feb 17:50 bacula-dir JobId 256374: Using Device "vDrive-6" to read.
11-Feb 17:50 bacula-sd-01-sd JobId 256374: Ready to read from volume "FullAuto-1377" on file device "vDrive-6" (/usr/local/bacula/volumes).

lots of reading skipped.....

11-Feb 21:00 bacula-sd-01-sd JobId 256374: Ready to read from volume "FullAuto-1514" on file device "vDrive-6" (/usr/local/bacula/volumes).
11-Feb 21:00 bacula-sd-01-sd JobId 256374: Forward spacing Volume "FullAuto-1514" to file:block 0:214.
11-Feb 21:11 bacula-sd-01-sd JobId 256374: Error: bsock.c:453 Wrote 65540 bytes to client:Hostname not found:9103, but only 32768 accepted.
11-Feb 21:11 bacula-sd-01-sd JobId 256374: Fatal error: read.c:284 Error sending to File daemon. ERR=Broken pipe
11-Feb 21:11 bacula-sd-01-sd JobId 256374: Elapsed time=03:20:24, Transfer rate=29.63 M Bytes/second
11-Feb 21:11 bacula-sd-01-sd JobId 256374: Error: bsock.c:375 Socket has errors=1 on call to client:Hostname not found:9103
11-Feb 21:11 bacula-sd-01-sd JobId 256374: Fatal error: fd_cmds.c:142 Read data not accepted
11-Feb 21:11 bacula-sd-01-sd JobId 256374: Error: bsock.c:375 Socket has errors=1 on call to client:Hostname not found:9103
11-Feb 21:11 bacula-dir JobId 256374: Error: Bacula bacula-dir 7.4.4 (28Sep16):
 Build OS:               amd64-portbld-freebsd11.0 freebsd 11.0-RELEASE-p6
 Prev Backup JobId:      255999
 Prev Backup Job:        knew_jail_snapshots.2017-02-05_03.04.02_47
 New Backup JobId:       256375
 Current JobId:          256374
 Current Job:            CopyToTape-Full-LTO4.2017-02-11_17.50.17_22
 Backup Level:           Full
 Client:                 crey-fd
 FileSet:                "EmptyCopyToTape" 2011-02-20 20:53:31
 Read Pool:              "FullFile" (From Command input)
 Read Storage:           "bacula-sd-01-file" (From Pool resource)
 Write Pool:             "FullsLTO4" (From Command input)
 Write Storage:          "tape01" (From Command input)
 Catalog:                "MyCatalog" (From Client resource)
 Start time:             11-Feb-2017 17:50:50
 End time:               11-Feb-2017 21:11:14
 Elapsed time:           3 hours 20 mins 24 secs
 Priority:               10
 SD Files Written:       404,305
 SD Bytes Written:       356,347,133,351 (356.3 GB)
 Rate:                   29636.3 KB/s
 Volume name(s):         000013L4
 Volume Session Id:      143
 Volume Session Time:    1486562541
 Last Volume Bytes:      1,187,871,455,232 (1.187 TB)
 SD Errors:              3
 SD termination status:  Error
 Termination:            *** Copying Error ***




-- 
Dan Langille - BSDCan / PGCon





Kern

On 02/10/2017 09:46 PM, Dan Langille wrote:
On Feb 10, 2017, at 12:55 PM, Dan Langille <dan AT langille DOT org> wrote:

On Feb 10, 2017, at 12:20 PM, Kern Sibbald <kern AT sibbald DOT com> wrote:

Hello,

I suspect that this is a problem with the FreeBSD networking implementation.  If I remember right on FreeBSD, when doing name lookups, if the packet size is not *exactly* what FreeBSD wants, it fails the call.  On Linux and other machines (Solaris, Mac), as long as the packet size is equal or greater than what is needed the OS call succeeds.  If I am not mistaken, Bacula allocates space for the larger of IPv4 and IPv6 (which is always IPv6), and so if you are using an IPv4 network, Bacula may send OS calls with a packet size larger than actually required.

I just spoke with a FreeBSD developer.  They are unaware of anything special in the FreeBSD ports tree for patching FreeBSD when it comes to doing name lookups.  Specifically, gethostby*(), getipnodeby*() just work...

If you can reproduce/encounter a situation which fails, we will look at it and fix it.  In short, I do not think this is an issue with the FreeBSD networking implementation.

I suspect it's a local DNS misconfiguration on one of my hosts.  Which ones, I don't know yet.  Your first post mentioned the SDs, so I checked them. They seem OK now.  I will verify them again if I see them again.

For testing purposes, I will use dig +short and verify that all of these FQDN resolve on the FD, the SD, the director, and from from a fourth host:

The FQDN of the FD (even though it is not used in a Copy job)
The FQDN of the read SD
The FQDN of the write SD
The FQDN of the Director

I will also check that the PTR record for the A record also resolves back to the FQDN.

If this is the case, I would consider it a FreeBSD bug.  For me to fix it is a bit complicated, because I need to know exactly what call is failing and the values that FreeBSD wants.  By the way, it is possible this is already fixed in the Enterprise version where FreeBSD is supported too.  If that is the case, in my next round of backporting to start next week, it will get fixed.

If this was the case, I'd expect apps on FreeBSD to be failing everywhere.  They aren't.  I've never patched anything for DNS issues either.

I suspect it's more likely to be an issue on one of the Bacula nodes in question (SD or FD) where there is a local DNS issue.

For documentation: I found that one of my three name servers was not correctly resolving PTR records.  That is:

 $ dig +short -x 10.52.0.1 @10.55.0.1

Whereas the other two nameservers:

$ dig +short -x 10.52.0.1 @10.55.0.13
bast.int.unixathome.org.

$ dig +short -x 10.52.0.1 @10.55.0.73
bast.int.unixathome.org.

That issue since been fixed.  It took a while to track down.  That nameserver is running pfSense, using their GUI.  The issue is right there on the page:

Note: IN-ADDR.ARPA will be automaticaly included in config files when reverse zone option is checked.

The checkbox is four controls farther down the page.

The expected zone 0.55.10.in-addr.arpa was actually zone 0.55.10.in-addr.arpa.in-addr.arpa

I'll keep an eye on the Bacula logs to see if this shows up again.

Thank you.

-- 
Dan Langille - BSDCan / PGCon





------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users