MarkLogic Fundamentals - High Availability & False Failovers
19 May 2020 02:09 PM
What does failover look like in MarkLogic Server?
More specifically, failover in MarkLogic Server is designed to address data node (“d-node”) or forest-level failure. D-node failures can include operating system crashes, MarkLogic Server restarts, power failures, or persistent system failures (hardware failures, for example). A forest-level failure is any disk I/O or other failure that results in an error state on the forest.
Failover in MarkLogic Server is "hot" in the sense that switchover occurs immediately to failover hosts already running in the same cluster, with no node restarts required. Failing back from a failover host to the primary host, however, needs to be done manually and requires a node restart.
When a node is perceived as no longer communicating with the rest of the cluster, and a quorum of greater than 50% of the nodes in the cluster vote to remove the affected node, then a failover event will occur automatically. A node is defined to no longer be communicating with the rest of the cluster when that node fails to respond to cluster wide heartbeats within the defined host timeout.
What does false failover look like in MarkLogic Server?
There could be many reasons causing a busy node/cluster and one of the reasons that’s often overlooked is the infrastructure especially when Virtualization is involved where you can get more out of your resources by allowing VMs to share resources under the assumption that not all systems will need the assigned resources at the same time. However, if you are in a situation where multiple VMs are under load, they can outstrip the available physical resources because more than 100% of the resources have been assigned to the VMs causing what is called a "resource starvation".
What should I do about false failover events in MarkLogic Server?
Long term correction of false failover events requires better alignment between your system's workloads and its hardware provisioning. You could, for example, reduce the workload, or spread the same workload over more time, or increase your system’s hardware provisioning. All of these tactics would free up the affected nodes to respond to the clusterwide heartbeat in a more timely manner, thereby avoiding false failover events. You can read more about aligning your workloads and hardware footprint at: