"Wait Replication" scenarios and their resolutions
08 September 2020 09:57 PM
Forests in MarkLogic Server may be in one of several mount states. On mounting, local disk failover forests or database replication forests should both eventually reach the sync replicating or async replicating state. There are occasions, however, where local disk failover or database replication forests will sometimes get stuck in the wait replication state. This knowledgebase article will itemize many of these wait replication scenarios, as well as the operational tactics to use in response.
Wait replication scenarios
Wait replication as a result of lack of quorum
A quorum in MarkLogic server represents more than 50% of the total number nodes of the cluster. It's very important to note the total number of nodes - regardless of group membership, forest assignment, whether nodes are running/not running, etc. - if a machine exists in the hosts.xml configuration file and in the list of hosts in the Admin UI, it contributes to the total count.
While it's possible to run a MarkLogic cluster with only a subset of the configured nodes up, it's not a recommended configuration. In addition, if the number of active nodes in your cluster falls below the greater than 50% quorum threshold, you might run into forests in the wait replication state due to the lack of quorum.
What to do about it? You'll need to alter your cluster's configuration to meet the quorum requirement. That can mean either removing missing nodes from the cluster's configuration (essentially telling the cluster to stop looking for those missing nodes), or alternatively bringing up nodes that are currently part of the configuration, but not actively returning heartbeats (effectively letting the cluster see nodes it expects to be there).
You can read more about quorum at the following knowledgebase articles:
Wait replication as a result of mixed file permissions
The root MarkLogic process is simply a restarter process which waits for the non-root (daemon) process to exit. If the daemon process exits abnormally, for any reason, the root process will fork and exec another process under the daemon process. The root process runs no XQuery scripts, opens no sockets, and accesses no database files. While it's possible to run the MarkLogic process as a non-root user, be very careful about forest file permissions - if your configured MarkLogic user doesn't have the necessary permissions, you might see wait replication and an inability to correctly failover to local disk failover forests when necessary - in which case you'll need to set your forest file permissions correctly to move forward. You can read more about running the MarkLogic process as a non-root user at:
Wait replication due to upgrading in the wrong order
Per our documentation, when upgrading you must first upgrade your replica environment, then subsequently upgrade your master environment.
if your cluster upgrades aren’t done in the correct order, you’re going to need to:
You can read more about upgrading environments using database replication at:
Wait replication because you downgraded
MarkLogic Server does not support downgrades. If you do attempt to downgrade your installation, your replica forests will be stuck in wait replication.
What to do about it? As in the case of upgrading in the wrong order, you'll need to manually run http://(hostname of node hosting the Security forest):8001/security-upgrade-go.xqy?force=true. You can read more about MarkLogic Server and downgrades at:
Wait replication because your master and replica forest names don't match
By default, the "Connect Forests by Name" option is set to true. This means the server has certain expectations around how master and replica forests should be named
What to do about it? Set "Connect Forests by Name" to false, then manually connect master and replica forests. You can read more about wait replication due to forest name mismatch at:
Wait replication as a result of merge blackouts (completely disabled merges)
What is merging and why do we need merge blackouts?
MarkLogic Server does lazy deletes, which marks documents obsolete (but doesn't actually delete them). Merges are when obsolete documents are actually deleted - in bulk, while also optimizing your data. Merge blackouts prevent this deferred deletion and optimization from happening. Merge blackouts can also sometimes result in wait replication. Consider a database that has both master and local disk failover forests where you have configured a merge blackout with the “disable merges completely” option (instead of “limit merges to” option). If a node failure on any of the nodes holding some of these forests were to occur during the merge blackout period, as soon as the failed node comes back online, all the forests associated with that specific node go into a “wait replication” state until the merge blackout period ends or is manually removed.