Knowledgebase:
Should I flip failed over forests back to their respective masters? What are the risks if I leave them?
29 August 2018 11:27 AM

Introduction

The notion of "flipping" back control (from failed-over replica forest back to the master forest) has been covered in previous Knowledgebase articles:

https://help.marklogic.com/Knowledgebase/Article/View/427/0/scripting-failover-flipping-replica-forests-back-to-their-masters-using-xquery

In this Knowledgebase article, we will discuss the pros and cons of leaving failed over forests as they are.  Should control be returned to the master forests after a failover event?

Best Practices

Can it be considered good practice to leave forests in their failed-over state?

As long as the original configured master shows that it is in sync replicating state in the database status page, you know it's still ready to take over in the event that the configured replica (acting master) fails at a later time; this means that High Availability is still preserved across the cluster in spite of a prior failover event having taken place.

In summary, the main reasons to fail back the forests to their initial configured state are as follows:

  • Your operating state will match your configured state, which could avoid surprises if you make assumptions based on configuration or naming of forests (e.g. someone somewhere may assume that forest-001-r is a replica forest and not check whether it is currently acting master due to a failover event that took place some time in the past). This is especially important if your team does not maintain a runbook for your MarkLogic cluster.
    • Additionally, if you restart your cluster in a failed-over state, the configured masters will take over again, so your running state will be different before and after a restart, which could complicate diagnosis of any problems you may have involving the restart (e.g. if the restart was in response to a problem, or if a problem surfaces after restart)
  • Both master and replica forests can process updates, although only master forests can process queries.  Presumably you sized your cluster and distributed your forests to spread the load; if you're in a failed over state, then the load is likely to be uneven across hosts in your cluster and you probably want to get back to that even load by failing those forests back to their respective masters.
  • There are likely to be implications with backup / restore if you have an unusual distribution of master/ acting master (replica) forests that could cause further work for you.  These issues are covered in the following Knowledgebase articles:

Conclusion

In the event of a forest failover, as long as your previous master forests are in their (expected) sync replicating state, the risk of leaving the forest in a failed over state is minimal; any disturbance that takes the active master forest offline (such as a forest restart) will cause failover to happen again so you still continue to have High Availability

However, forest failover can be indicative of a larger symptom: a particular host that appears to be encountering issues for any number of possible reasons.  Keeping track of when forests fail over for a given host can be a useful first line of enquiry into a system that is showing early warning signs of a problem.

From the perspective of system management, flipping failed-over forests back to their respective masters could be considered as part of an ongoing approach to managing and maintaining general cluster health.  

In the event of a failover, if the failover details are logged, the forests are failed back to their respective masters, subsequent failover events should become more apparent at a glance; it's easy to quickly review the status tab of a given database to confirm that all the master forests are in their open state (with their replica forests all sync replicating).

Adopting a policy of logging what happened and resolving the issue by failing the forests back makes the procedure of managing a failover an event that gets triaged and in the longer run will make future events easier to spot and - potentially - could provide data to give you advance warning of an inherent issue involving a given host in your cluster.

(4 vote(s))
Helpful
Not helpful

Comments (0)