Startup, Quorum and Forest Level Failover
26 June 2020 07:09 PM
Quorum is used to either evict or keep a node in a cluster but is quorum required even while starting my cluster?
What is Quorum?
Each node in a cluster communicates with all of the other nodes in the cluster at periodic intervals. This periodic communication, known as a heartbeat, circulates key information about host status and availability between the nodes in a cluster. The cluster uses the heartbeat to determine if a node in the cluster is unavailable. This determination is based on a vote from each node in the cluster, based on each node's view of the current state of the cluster. To vote a node out of the cluster, there must be a quorum of nodes voting to remove a node. A quorum occurs if more than 50% of the total number of nodes in the cluster (including any nodes that are down) vote the same way.
Depending on cluster configuration, this quorum may or may not be required even during startup of a cluster.
On a cluster without forest level failover configured, No quorum is required to bring up the admin UI. If one brings up the server hosting the Security (Schemas and Modules) database then you can access the admin UI.
On a cluster with shared disk failover configured, No quorum is required to bring up the admin UI. If one brings up the server hosting the Security (Schemas and Modules) database then you can access the admin UI.
On a cluster with local disk failover configured, a quorum is required prior to starting operations (e.g. accessing Admin UI). If you do not have quorum, then the MarkLogic admin will have to perform some intervention to bring up the required number of hosts. In case of a power outage, it is expected that all hosts will be powered up simultaneously. The server is designed to handle this well, so there is no need to serialize server startup and in fact we would prefer a simultaneous startup of all hosts in a cluster. If there is any reason for wanting to serialize server startup (such as not wanting to overwhelm the SAN), this is OK too, just be aware that normal cluster operation will start at the point where you have a quorum.
Why do we need to achieve Quorum of more than 50%? Understanding network partitioning, or the "split brain" problem
For failover to occur, you must have a quorum of particpant nodes (defined as "n/2 + 1"). This is what protects you against any risk of network partitioning; if a node can't communicate with more than half the hosts in a cluster, it will be unable to tell whether it's on the losing side of a network partition.
If you were to try to put N hosts in one data center and N hosts in another data center, neither one would be able to determine that it is the surviving data center in the event of a network problem. If you were to try to create a cluster that spans multiple data centers, you'd want at least one more machine in a 3rd location that the two data centers would use to break the tie.
Read more on network partitioning at: https://en.wikipedia.org/wiki/Split-brain_(computing)