Cluster Notification from xxxxxxx (REBOOT (CLUSTER TAKEOVER)) WARNING

karinegh

ADSM.ORG Member
Joined
Apr 30, 2007
Messages
9
Reaction score
0
Points
0
Hi,
I received this warning message. I would appreciate if someone can help me to undestand it and to find out what's the problem.



Many thanks for your help.
 

Attachments

  • error.txt
    12.3 KB · Views: 8
Hi,

can we have more info? From the log I can see you have a NetApp v-series (IBM rebranded version) in a cluster configuration.
There are two nodes, GBFLR1002 and GBFLR1001. Seems that the second one (GBFLR1001) went down and the operation was taken over by the first GBFLR1002 (so it now represents itself as GBFLR1002/GBFLR1001)
I do not see the reason for that in the log - there is a wrong configuration of autosupport system (sending errors and logs to IBM and/or local admins)- so you need to check the file mentioned in the message (/etc/log/autosupport/200804062004.0 - on the v-series) and repair autosupport
(see "options autosupport" on the v-series console)
So see the consoles of both cluster members - you can find out what's wrong then

Without more info I can help you no more

Harry
 
Last edited:
Ive seen two problems
rg0 - volume or volume group is offliine
WINS is not resolving.

It appears you've had a disk failure and the cluster wants to failover but cannot find its cluster mate.

Good Luck
 
Hi,

sorry Steven - disk scrubbing is normal process - it looks for disk problems and here it found no errors - see the messages
scrubbing for /aggr0/plex0/rg0 started at 01:00, suspended at 07:00 (with no errors)
scrubbing for /aggr1/plex0/rg1 started at 01:00 (resumed from previous suspended run), ended at 04:23 with no errors

So it seems to me that the scrubing is set to run daily from 01:00 to 07:00 - anyway, no errors there so that is not the problem.

WINS? yes, there seems to be misconfiguration there but it really should have no effect to takeover ... if set, it takeover can occur in case of network failure, but not when WINS server is not reachable.

and takeover DID occur (as you can see from the GBFLR1002/GBFLR1001 name on the last three lines)

Harry
 
Thanks Harry, well thats two items out of the way for this t-shooting exercise, what about at the application level, Karinegh did anything happen at that level that you know of?
Or this this message an Informational type error log?
 
Thanks a lot for the replay and sorry for the delay ,
Harry_Redl you'll find attached all the log wish that it will help (and then help me to understand the problem :)).
I noticed that there is a problem whith the wins and for autosupport unfortunatelly we don't have a netapps support account.

thanks for your help.
 

Attachments

  • Cluster Notification from GBFLR1002 (REBOOT (CLUSTER TAKEOVER)) WARNING.zip
    28.4 KB · Views: 9
Hi,

went through the log and have to correct myself:
a) it is not the v-series, it looks like normal FAS system (nSeries)
b) the failed node is GBFLR1002, not the GBFLR1001 (this one is the surviving one)

As it is IBM branded device autosupport should be configured do send data to IBM, not to NetApp:
autosupport.support.transport https
autosupport.support.url eccgw01.boulder.ibm.com/support/electronic/nas

Still cannot see the reason of the failure - it just says it is failed - no power, FC, shelf, disk issue ...

Thing is that everything can be OK - GBFLR1002 can be working (ready to work) but it cannot as you have
cf.giveback.auto.enable off
So it does not automaticaly start the giveback process after reboot

What I would do is to connect to the RLM (if you have one) or serial console to the
GBFLR1002 and see what does it say. The best case is if you see "Waiting for giveback" - in that case log in to the surviving node and issue "cf giveback"
If it does not wait for giveback then I need to know the message

Harry

P.S. How do you manage the filer? I see you do not have ssh access enabled. Are you using FilerView?
 
hi,
I want to let you know that each time the problem happens I have to log into the surviving node and issue cf giveback that meens th the dead node was waiting for giveback.so if I put cf.giveback.auto.enable on this will fix the problem?
Could you exlain to me why the node is rebooting? is it maintenance (test) process?

To monitor the filer I'm using filerview.

Many thanks.
 
Hi,

setting the cf.giveback.auto.enable to "on" can solve the problem with dead node starting up - but you need to find the cause of rebooting. In that log I did not see anything that can explain it. There are more logs in the appliance you can check - I would try looking in (using FilerView) Filer -> Audit Logs
or (using CIFS or NFS) looking for /etc/log/auditlog

Hope it helps

Harry
 
Hi,
I am trying to set up the autosupport.support.url but I couldn't.It says that't read only option would you show me how can I modify it.

Many thanks.
 
Hi,

yes, it is read-only - I forgot, sorry. This must be one of the first IBM labeled releases of OnTAP - did you considered upgrading?
Tried to look into OnTAP registry to check if it can be changed there, but it is not the option.
But anyhow - that is not the reason for the reboot.
What about the auditlogs?

Harry
 
Back
Top