Knowledgebase:
MarkLogic Fundamentals - High Availability & False Failovers
19 May 2020 02:09 PM

Summary
Overlarge workloads, underprovisioned environments, or a combination of the two often result in false failovers - where MarkLogic Server will perceive an overloaded node as unavailable. Failover events redistribute the affected node’s traffic to the remaining nodes in the cluster. False failover events, unfortunately, redistribute an overloaded node’s workload to the likely similarly overloaded (and now even fewer number of) nodes remaining in the cluster. While it’s possible to mitigate this scenario in the short term by allowing more time for nodes to talk to one another, long term correction requires throttling of workloads, increasing the environment’s hardware provisioning, or a combination of the two.

What does failover look like in MarkLogic Server?
High availability systems require continuity within a cluster. MarkLogic Server delivers high availability by providing fault tolerance - if a node in a MarkLogic cluster fails, other nodes automatically pick up the workload so that the data stored in forests is always available. 

More specifically, failover in MarkLogic Server is designed to address data node (“d-node”) or forest-level failure. D-node failures can include operating system crashes, MarkLogic Server restarts, power failures, or persistent system failures (hardware failures, for example). A forest-level failure is any disk I/O or other failure that results in an error state on the forest. 

Failover in MarkLogic Server is "hot" in the sense that switchover occurs immediately to failover hosts already running in the same cluster, with no node restarts required. Failing back from a failover host to the primary host, however, needs to be done manually and requires a node restart.

When a node is perceived as no longer communicating with the rest of the cluster, and a quorum of greater than 50% of the nodes in the cluster vote to remove the affected node, then a failover event will occur automatically. A node is defined to no longer be communicating with the rest of the cluster when that node fails to respond to cluster wide heartbeats within the defined host timeout.

What does false failover look like in MarkLogic Server?
False failover events in MarkLogic Server occur when a node is present and working, but so overloaded that it can no longer respond to cluster wide heartbeats within the specified host timeout. In other words, during false failover events the affected node is so busy that it is unable to communicate its status to the other nodes in the cluster, and consequently unable to prevent the other nodes from voting to remove it from the cluster.

There could be many reasons causing a busy node/cluster and one of the reasons that’s often overlooked is the infrastructure especially when Virtualization is involved where you can get more out of your resources by allowing VMs to share resources under the assumption that not all systems will need the assigned resources at the same time. However, if you are in a situation where multiple VMs are under load, they can outstrip the available physical resources because more than 100% of the resources have been assigned to the VMs causing what is called a "resource starvation".

What should I do about false failover events in MarkLogic Server?
Recall that a node is voted out when it can no longer respond to the rest of the cluster within the specified host timeout. It might be possible to mitigate false failovers in the short term by temporarily increasing the environment’s XDQP and host timeouts. Larger timeouts would give all the nodes in the cluster more time to respond to clusterwide heartbeats, which under heavy load should decrease the frequency of false failover events. That said - DO NOT get in the habit of simply increasing your timeouts to larger and larger values. Increasing timeout to avoid false failovers is, at best, a temporary/short term tactic.

Long term correction of false failover events requires better alignment between your system's workloads and its hardware provisioning. You could, for example, reduce the workload, or spread the same workload over more time, or increase your system’s hardware provisioning. All of these tactics would free up the affected nodes to respond to the clusterwide heartbeat in a more timely manner, thereby avoiding false failover events. You can read more about aligning your workloads and hardware footprint at:

  1. MarkLogic Performance: Understanding System Resources
  2. Performance Issues in MarkLogic Server: what they look like - and what you should do about them

Further reading:

(6 vote(s))
Helpful
Not helpful

Comments (0)