Procedure for amnesia recovery
I’m deploying a two-node Sun Cluster environment, and while making some tests in such env, i got into interesting situations. One of them was the “amnesia” problem, and i post this in opensolaris forums. Thanks to Ira Pramanick that wrote this procedure to recover from such a scenario:
Running into a scenario where you are hitting amnesia protection
*and* are not able to bring up one of the servers (the one that
went down last and needs to be the first to come up) can indeed
happen. We have a step-by-step emergency procedure to recover
from such a scenario. Here’s that step-by-step procedure:
Scenario: Two node cluster (nodes A and B) with one QD,
nodeA has gone bad, and amnesia protection is preventing
nodeB from booting up.
– Boot nodeB in non-cluster mode (boot -x).
– Edit nodeB’s file /etc/cluster/ccr/infrastructure as follows:
– Change the value of “cluster.properties.installmode” from
“disabled” to “enabled”.
– Change the number of votes for nodeA from “1” to “0”,
in the following property line:
– Delete all lines with “cluster.quorum_devices” to remove
knowledge of the quorum device.
– Run command:
/usr/cluster/lib/sc/ccradm -i /etc/cluster/ccr/infrastructure -o
– Reboot nodeB in cluster mode.
Pretty soon down the line, you would want to get your 2nd node
back up and running, and the QD configured back in.
And, as for determining which of the 2 scenarios in the email
thread below is the one that you were hitting, pls do save the
console msgs or dig through syslog to see what msgs were
logged during the time of your tests. Those will help pinpoint
which exact scenario that you are hitting.
All the cases that we have been discussing in this thread are
abnormal scenarios, in that all of these represent more than
single points of failures in the system. Solaris Cluster guarantees
surviving all single points of failures, and tries to be robust to
multiple points of failures, surviving many of these as well.
Hope this helps,