RE: Cluster backup
2006-05-31 07:49:35
> From: Nicola Mauri
> Sent: 31 May 2006 11:38
>
> We are constantly encountering strange errors whith DLEs that
> refer to cluster virtual addresses.
>
> virtualA /apps/a lev 0 FAILED [data timeout]
> virtualB /apps/b RESULTS MISSING
>
> The disklist contains:
>
> node1 /etc full # Physical node 1
> node2 /etc full # Physical node 2
> virtualA /apps/a full # virtual address A
> virtualB /apps/b full # virtual address B
>
> Error messages are not predictable and may change every day.
> They completely disappear if - in the disklist file - we
> replace the virtual address with the node's physical address
> which is running the service (and is currently mounting the
> shared partition we need to backup). Obviously, services and
> partitions might be relocated to another cluster node, so
> this approach won't work.
>
> I guess this happens because amanda server treats "node1",
> "virtualA" and "virtualB" like three distinct hosts, whereas
> in some situations thay may refer to a single physical host,
> with a single amanda client instance responding.
>
> Can someone suggest how to solve this issues and how to
> configure Amanda to backup a cluster environment?
We had similar behaviour with machines using virtual IP addresses which we
eventually tracked down to inconsistent netmasks.
Taking cyrus1 as an example ...
[root@cyrus1 amanda]# ifconfig -a
eth0 Link encap:Ethernet HWaddr 00:11:85:E7:40:75
inet addr:128.240.233.72 Bcast:128.240.255.255 Mask:255.255.0.0
eth0:1 Link encap:Ethernet HWaddr 00:11:85:E7:40:75
inet addr:128.240.233.238 Bcast:128.240.233.255 Mask:255.255.255.0
(note difference in netmasks - eth0 is configured via DHCP, eth0:1 is
configured statically).
When Amanda server (ucsbs2 - also on 128.240.233.x) sends request to cyrus1 the
reply comes back to ucsbs2 from cyrus (which is the address configured on
eth0:1). I guess this is because the system sees eth0:1 as being more
specific. Of course the Amanda server just drops the reply as it know that it
didn't ask cyrus for anything.
Making the netmasks consistent resulted in replies coming back from the main
interface.
Of course this doesn't help you as you want replies from the primary machine
address for some DLEs and from the floating address for others.
One suggestion that we had before we realised the issue was the netmask was to
use chbind
http://www.solucorp.qc.ca/miscprj/s_context.hc?s1=2&s2=6&s3=3&s4=0&full=0&prjstate=1&nodoc=0
or the interface (aka bind) option in xinetd to run multiple instances of the
amanda client each responding on a different address (whether that will
actually cause the responses to come from the right IP address I don't know).
Paul
Paul
|
|
|