RE: Cluster backup

> From: Nicola Mauri
> Sent: 31 May 2006 11:38
> 
> We are constantly encountering strange errors whith DLEs that 
> refer to cluster virtual addresses. 
> 
>   virtualA /apps/a lev 0 FAILED [data timeout] 
>   virtualB /apps/b  RESULTS MISSING 
>   
> The disklist contains: 
> 
>   node1     /etc       full    # Physical node 1 
>   node2     /etc       full    # Physical node 2 
>   virtualA  /apps/a    full    # virtual address A 
>   virtualB  /apps/b    full    # virtual address B 
> 
> Error messages are not predictable and may change every day. 
> They completely disappear if - in the disklist file - we 
> replace the virtual address with the node's physical address 
> which is running the service (and is currently mounting the 
> shared partition we need to backup). Obviously, services and 
> partitions might be relocated to another cluster node, so 
> this approach won't work. 
> 
> I guess this happens because amanda server treats "node1", 
> "virtualA" and "virtualB" like three distinct hosts, whereas 
> in some situations thay may refer to a single physical host, 
> with a single amanda client instance responding. 
> 
> Can someone suggest how to solve this issues and how to 
> configure Amanda to backup a cluster environment? 

We had similar behaviour with machines using virtual IP addresses which we 
eventually tracked down to inconsistent netmasks.

Taking cyrus1 as an example ...

[root@cyrus1 amanda]# ifconfig -a
eth0      Link encap:Ethernet  HWaddr 00:11:85:E7:40:75  
          inet addr:128.240.233.72  Bcast:128.240.255.255  Mask:255.255.0.0

eth0:1    Link encap:Ethernet  HWaddr 00:11:85:E7:40:75  
          inet addr:128.240.233.238  Bcast:128.240.233.255  Mask:255.255.255.0

(note difference in netmasks - eth0 is configured via DHCP, eth0:1 is 
configured statically).

When Amanda server (ucsbs2 - also on 128.240.233.x) sends request to cyrus1 the 
reply comes back to ucsbs2 from cyrus (which is the address configured on 
eth0:1).  I guess this is because the system sees eth0:1 as being more 
specific.  Of course the Amanda server just drops the reply as it know that it 
didn't ask cyrus for anything.

Making the netmasks consistent resulted in replies coming back from the main 
interface.

Of course this doesn't help you as you want replies from the primary machine 
address for some DLEs and from the floating address for others.

One suggestion that we had before we realised the issue was the netmask was to 
use chbind

 
http://www.solucorp.qc.ca/miscprj/s_context.hc?s1=2&s2=6&s3=3&s4=0&full=0&prjstate=1&nodoc=0

or the interface (aka bind) option in xinetd to run multiple instances of the 
amanda client each responding on a different address (whether that will 
actually cause the responses to come from the right IP address I don't know).

Paul

Paul