This article outlines different manual procedures to failback after a failover event
What is failover?
Failover in MarkLogic Server provides high availability for data nodes in the event of a d-node or forest-level failure. With failover enabled and configured, a host can go offline or unresponsive and a MarkLogic Server cluster automatically and gracefully recovers from the outage, continuing to process queries without any immediate action needed by an administrator.
MarkLogic offers support for two varieties of failover at the forest level, both of which provide a high-availability solution for data nodes.
- Local-disk failover: Allows you to specify a forest on another host to serve as a replica forest which will take over in the event of the forest's host going offline. Multiple copies of the forest are kept on different nodes/filesystems in local-disk failover
- Shared-disk failover: Allows you to specify alternate nodes within a cluster to host forests in the event of a forest's primary host going offline. A single copy of the forest is kept in shared-disk failover
More information can be found at:
How does failover work?
The mechanism for how MarkLogic Server automatically fails over is described in our documentation at: How Failover Works
When does failover occur?
Scenarios that trigger a forest to failover are discussed in detail at:
High level overview of failing back after a failover event
If failover is configured, other hosts in the cluster automatically assume control of the forests (or replicas of the forests) of the failed host. However, when the failed host comes back up, the transfer of control back to their original host does not happen automatically. Manual intervention is required to failback. If you have a failed over forest and want to fail back, you'll need to:
- Restart either the forest or the current host of that forest, if using shared-disk failover
- Restart the acting data forest or restart the host of that forest, if using local-disk failover. You should only do this if the original primary forest is in the sync replicating state, which indicates that it is up-to-date and ready to take over. Updates written to an acting primary forest must be synchronized to acting replicas, else those updates will be lost after failing back. After restarting the acting data forest, the intended primary data forest will automatically open on the intended primary host.
Make sure the primary host is safely back online before attempting to fail back the forest.
You can read more about this procedure at: Reverting a Failed Over Forest Back to the Primary Host
Local disk failover procedure for attaching replicas directly to the database and clearing the intended primary forests error states
If your primary data forests are in an error state, you'll need to clear those errors before failing back. This will usually require unmounting the primary forest copy, then directly mounting the local disk failover forest copy (or "LDF") to the relevant database. That procedure looks like:
- Make sure to turn OFF the rebalancer/reindexer at the database level - you don't want to unintentionally move data across forests when manually altering your database's forest topology.
- Break forest level replication between forests (i.e. - between the intended LDF replica (aka "acting primary") and intended primary forest currently in an error state)
- Detach the intended primary forest from database
- Attach the intended LDF replica (aka acting primary) forest directly to the database
- Make sure the database is online
- Delete the intended primary forest in error state
- Create a new forest with the same name as the now deleted intended primary forest
- Re-establish forest-level replication between the intended LDF replica (aka acting primary) forest and the newly created intended primary forest
- Let bulk replication repopulate the intended primary forest
- After bulk replication is finished, fail back as described above, so the intended primary forest is once again the acting primary forest, and the intended LDF replica is once again the acting LDF replica forest
What is the procedure for failing forests back to the primary host in cases where the replicas are directly attached to the database?
If intended LDF replicas are instead directly attached to the relevant database, forest or host restarts will not fail back correctly. Instead, you must rename the relevant forests:
- Forests that are currently attached to the database can be renamed - from their LDF replica naming scheme, to the desired primary forest naming scheme.
- Conversely, unattached primary forests can be renamed as LDF replicas, then configured as LDF replicas for the relevant database
- At this point, the server should detect that the current primary (which was previously the LDF replica) will have more recent data than the current LDF replica (which was previously the primary), which should then cause the server to populate the current LDF replica from the current primary
What should be done in case of a disk failure?
In the unlikely event a logical volume is lost, you'll want to restore from a copy of your data. That copy can take the form of:
- Local disk failover (LDF) replicas within the same cluster (assuming those copies are fully synchronized)
- Database Replication copies in your replication cluster (again, assuming those copies are fully synchronized)
- Backups, which might be missing updates made since the backup was taken
You can restore from backups if you can afford to lose updates subsequent to that backup's timestamp and/or can re-apply whatever updates happened after the backup was taken.
If you would instead prefer not to lose updates, then use LDF replicas to sync back to replacement primary forests created on new volumes, failing back manually when done. In the event that data was moved across forests in some way after the backup was taken, it would be best to use LDF replicas instead, which avoids the possibility database corruption in the form of duplicate URIs.
Database Replication will allow you to maintain copies of forests on databases in multiple MarkLogic Server clusters. Once the replica database in the replica cluster is fully synchronized with its primary database, you may break replication between the two and then go on to use the replica cluster/database as the primary. Note: To enable Database Replication, a license key that includes Database Replication is required. You'll also need to ensure that all hosts are:
- Running the same maintenance release of MarkLogic Server
- Using the same Operating System
- Have Database Replication correctly configured
- It's possible to have multiple copies of your data in a MarkLogic Server deployment
- Under normal operations, these copies are synchronized with one another
- Should failover events occur in a cluster, or catastrophic events occur to an entire cluster, you can shift traffic to the available previously synchronized copies
- Failing back is a manual operation
- Make sure to re-synchronize copies that were offline with online copies
- Shifting previously offline copies to acting primary before re-synchronization may result in data loss, as offline forests can overwrite updates previously committed to LDF forests serving as acting primaries while the intended primary forests were offline