"Wait Replication" scenarios and their resolutions
07 July 2020 06:34 PM
Forests in MarkLogic Server may be in one of several mount states. On mounting, local disk failover forests or database replication forests should both eventually reach the sync replicating or async replicating state. There are occasions, however, where local disk failover or database replication forests will sometimes get stuck in the wait replication state. This knowledgebase article will itemize many of these wait replication scenarios, as well as the operational tactics to use in response.
Wait replication scenarios
Wait replication as a result of lack of quorum
A quorum in MarkLogic server represents more than 50% of the total number nodes of the cluster. It's very important to note the total number of nodes - regardless of group membership, forest assignment, whether nodes are running/not running, etc. - if a machine exists in the hosts.xml configuration file and in the list of hosts in the Admin UI, it contributes to the total count.
While it's possible to run a MarkLogic cluster with only a subset of the configured nodes up, it's not a recommended configuration. In addition, if the number of active nodes in your cluster falls below the greater than 50% quorum threshold, you might run into forests in the wait replication state due to the lack of quorum.
What to do about it? You'll need to alter your cluster's configuration to meet the quorum requirement. That can mean either removing missing nodes from the cluster's configuration (essentially telling the cluster to stop looking for those missing nodes), or alternatively bringing up nodes that are currently part of the configuration, but not actively returning heartbeats (effectively letting the cluster see nodes it expects to be there).
You can read more about quorum at the following knowledgebase articles:
Wait replication as a result of mixed file permissions
The root MarkLogic process is simply a restarter process which waits for the non-root (daemon) process to exit. If the daemon process exits abnormally, for any reason, the root process will fork and exec another process under the daemon process. The root process runs no XQuery scripts, opens no sockets, and accesses no database files. While it's possible to run the MarkLogic process as a non-root user, be very careful about forest file permissions - if your configured MarkLogic user doesn't have the necessary permissions, you might see wait replication and an inability to correctly failover to local disk failover forests when necessary - in which case you'll need to set your forest file permissions correctly to move forward. You can read more about running the MarkLogic process as a non-root user at:
Wait replication due to upgrading in the wrong order
Per our documentation, when upgrading you must first upgrade your replica environment, then subsequently upgrade your master environment.
if your cluster upgrades aren’t done in the correct order, you’re going to need to:
You can read more about upgrading environments using database replication at:
Wait replication because you downgraded
MarkLogic Server does not support downgrades. If you do attempt to downgrade your installation, your replica forests will be stuck in wait replication.
What to do about it? As in the case of upgrading in the wrong order, you'll need to manually run http://(hostname of node hosting the Security forest):8001/security-upgrade-go.xqy?force=true. You can read more about MarkLogic Server and downgrades at:
Wait replication because your master and replica forest names don't match
By default, the "Connect Forests by Name" option is set to true. This means the server has certain expectations around how master and replica forests should be named
What to do about it? Set "Connect Forests by Name" to false, then manually connect master and replica forests. You can read more about wait replication due to forest name mismatch at: