High Availability and Failover in MarkLogic FAQ
28 October 2021 05:42 PM

How does MarkLogic Server's high-availability work in AWS?

AWS provides fault tolerance within a geographic region through the use of Availability Zones (AZs) while MarkLogic gives that ability through Local Disk Failover (LDF). If you’re using AWS, the best practice is to place each MarkLogic node/EC2 instance in a different Availability Zone within a single region, where a given data forests is in one AZ (AZ A), while its LDF forest is in a different AZ (AZ B). This way, in the event where access to Availability Zone A is lost, the host in the Availability Zone A will failover to its LDF on the host in Availability Zone B, thereby ensuring high-availability within your MarkLogic cluster.

Further reading:

Should failover be configured for the Security forest?

A cluster is not functional without its Security database. Consequently, it’s important to ensure high-availability of the Security database’s forest by configuring failover for that forest.

Further reading:

Should my forests have more than one Local Disk Failover forest?

High-availability through Local Disk Failover with one LDF forest is designed to allow the cluster to survive the failure of a single host. If you're using AWS, careful forest placement across AWS availability zones can provide high-availability even in the event of an entire availability zone going down. With rare exceptions, additional LDF forests are typically not worth the additional complexity and cost for the vast majority of MarkLogic deployments.

If you configure Local Disk Failover with one LDF coupled with Database Replication and Backups, you would have enough copies of your data to survive the failure of a single host to an entire availability zone.

Do I still have high-availability post failover? What happens to the data forest? How can I fail back my forests to the way they were?

When a failover event occurs, the LDF forest takes over as the acting data forest and the configured data forest will assume the role of the acting LDF forest as soon as it is successfully restarted. At this point, as long as both forests are still available, the cluster continues to be high availability but with forests reversing their originally intended roles. To fail back the forests into the roles they were originally intended, you will need to wait until the acting data forest (the originally intended LDF) and acting LDF (the originally intended data forest) are synchronized, then manually restart the acting data forest/intended LDF. At that point, the acting LDF/intended data forest “fails back” to take over its original role of acting data forest, and the acting data forest/intended LDF will once again assume its original role of acting LDF. In short, failover is automatic, but failing back requires a manual restart of the acting data forest/intended LDF. When failing back, it's very important to wait until the forests are synchronized - if you fail back before the forests are fully synchronized, you'll lose any data in the acting data forests that's yet to be propagated back to the acting LDF/intended data forest.

Further reading:

(12 vote(s))
Not helpful

Comments (0)