Knowledgebase: MarkLogic Server
Replacing a failed MarkLogic node in a cluster: a step by step walkthrough
03 April 2019 10:21 AM

Introduction

In this knowledgebase article, we are working on the premise that a host in your cluster has been completely destroyed, that primary forests on the failed host have failed over to their replicas - and that steps need to be taken to introduce a new replacement host to get the cluster back up and running.

We start with some general assumptions for this scenario:

  • We have a 3-node cluster already configured for High Availability (including the necessary auxilary databases)
  • The data is all contained in one database (for the sake of simplicity)
  • That each host in the cluster contains 2 primary forests and 2 replica forests

Cluster topology

Here is an overview of the cluster topology:

Host Name Primary Forest 1 Primary Forest 2 Replica Forest 1 Replica Forest 2
Host A Data-1 Data-2 Data-5-R Data-6-R
Host B Data-3 Data-4 Data-1-R Data-2-R
Host C Data-5 Data-6 Data-3-R Data-4-R

In addition, Host B will also contain replicas for the vital auxiliary forests: Schemas-1-R and Security-1-R.  Host C will contain Schemas-2-R and Security-2-R

Failure Scenario

Host B will be unexpectedly terminated.  For the application, these Forests will need to be detached and removed:

Host Name Primary Forest 1 Primary Forest 2 Replica Forest 1 Replica Forest 2
Host B Data-3 Data-4 Data-1-R Data-2-R

As Host B also contains the replica auxiliary forests (for the Security and Schemas database), these will also need to be removed before Host B can be taken out of the cluster.

Walkthrough: a step-by-step guide

The attached Query Console workspace (KB-607.xml) runs through all the necessary steps to set up a newly configured 3-node cluster for this scenario; feel free to review all 5 tabs in the workspace to gain insight into how everything is set up for this scenario.

1. Overview

The cluster status before Host B is removed from the cluster is as follows; note that the Forests for Host B are all highlighted in the images below:

The Schemas Database

The Security Database

The "Data" Database

2. Create the Failover Scenario

Host B will be stopped. You'll need to give MarkLogic some time to perform the failover.  To illustrate the failure in this scenario, we're going to issue sudo service MarkLogic stop at the command prompt on this host.

This is what you should see after the failover has taken place:

The Schemas Database

The Security Database

The "Data" Database

After failover has taken place, you should see:

  • That the Data database is still online
  • That the Data database contains the same number of documents as it did prior to failover (200,000)
  • That the four Forests that were mounted on Host B are now all listed as being in an error state
  • That the replica forests for the two primary forests are now showing an open state

Recovery - Step 1: Detach and remove the Host B Auxilary Forests

The first task is to ensure the two auxiliary forests for the Schemas and Security databases are removed.

Detach the Schemas Replica Forest

In the Admin GUI go to: Configure > Forests > Schemas > Configure Tab > Forest Replicas and uncheck Schemas-1-R and click ok

Note: these changes will not be applied until you have clicked on the ok button

Detach the Security Replica Forest

In the Admin GUI go to: Configure > Forests > Security > Configure Tab > Forest Replicas and uncheck Security-1-R and click ok

Note: these changes will not be applied until you have clicked on the ok button

The above steps are scripted in the first tab of the attached Query Console workspace (KB-607-Failover.xml)

Delete the Schemas Replica Forest

In the Admin GUI go to: Configure > Forests > Schemas-1-R > Configure Tab and click delete and follow the on-screen prompts to delete the forest configuration

Delete the Security Replica Forest

In the Admin GUI go to: Configure > Forests > Security-1-R > Configure Tab and click delete and follow the on-screen prompts to delete the forest configuration

Note: while the above steps are scripted in the second tab of the attached Query Console workspace (KB-607-Failover.xml) please note that the admin:forest-delete builtin will not allow you to delete a forest that is currently unavailable; instead the call will fail with an XDMP-FORESTMNT exception.

Recovery - Step 2: Remove 'dead' primary forests and replicas and reinstate failed over forests as master forests

Start by disabling the rebalancer on the database until the problem has been completely resolved; to do this go to Configure > Databases > Data > Configure Tab and set enable rebalancer to false.  This will stop any documents from being moved around until the maintenance work has been completed:

The above step is scripted in the third tab of the attached Query Console workspace (KB-607-Failover.xml)

Detach and delete the 'dead' replicas

We're going to start by removing the Data-1-R and the Data-2-R replica forests from the database.

Go to Configure > Forests > Data-1 > Configure Tab and uncheck the entry under forest replicas to remove the Data-1-R replica from the Data-1 forest:

Go to Configure > Forests > Data-2 > Configure Tab and uncheck the entry under forest replicas to remove the Data-2-R replica from the Data-2 forest:

The above step is scripted in the fourth tab of the attached Query Console workspace (KB-607-Failover.xml)

Go to Configure > Forests > Data-1-R > Configure Tab and use the delete button to remove the forest:

Note that the confirmation screen will force you to perform a configuration only delete as the original forest data is no longer available.  Click ok to confirm:

Go to Configure > Forests > Data-2-R > Configure Tab and use the delete button to remove the forest:

Again, the confirmation screen will force you to perform a configuration only delete as the original forest data is no longer available.  Click ok to confirm:

Note: while the above steps are scripted in the fifth tab of the attached Query Console workspace (KB-607-Failover.xml) please note that the admin:forest-delete builtin will not allow you to delete a forest that is currently unavailable; instead the call will fail with an XDMP-FORESTMNT exception.

At this stage, the database should still be completely available and you should now see 2 error messages reported on the database status page (Configure > Databases > Data > Status Tab):

Detach forests Data-3 and Data-4, detach the replicas and re-attach the replicas as master forests

The next step will cause a small outage while the configuration changes are being made.

First, we need to remove the replicas (Data-3-R and Data-4-R) from their respective master forests so we can add them back to the database as primary forests.  To do this:

Using the Admin GUI go to Configure > Forests > Data-3 > Configure Tab and under the forest replicas section, uncheck Data-3-R to remove it as a replica:

Go to Configure > Forests > Data-4 > Configure Tab and under the forest replicas section, uncheck Data-4-R to remove it as a replica:

The above step is scripted in the sixth tab of the attached Query Console workspace (KB-607-Failover.xml)

Now go to Configure > Databases > Data > Forests > Configure Tab:

  • Uncheck Data-3 and Data-4 to remove them from the database
  • Check Data-3-R and Data-4-R to add them to the database
  • Click ok to save the changes

The above step is scripted in the seventh tab of the attached Query Console workspace (KB-607-Failover.xml)

You should now see that there are no further errors reported on the database status page for the Data database:

Delete the configuration for Data-3 and Data-4

We now need to delete the configuration for the Data-3 and Data-4 forests before we can safely remove the 'dead' host from the cluster.

Go to Configure > Forests > Data-3 > Configure Tab and use the delete button to remove the forest:

Click ok to confirm the deletion of the configuration information:

Go to Configure > Forests > Data-4 > Configure Tab and use the delete button to remove the forest:

Click ok to confirm the deletion of the configuration information:

Note: while the above steps are scripted in the eighth tab of the attached Query Console workspace (KB-607-Failover.xml) please note that the admin:forest-delete builtin will not allow you to delete a forest that is currently unavailable; instead the call will fail with an XDMP-FORESTMNT exception.

You can now safely remove the host from the cluster.

Recovery - Step 3: Remove 'dead' host configuration

Using the Admin GUI go to Configure > Hosts to view the current cluster topology:

Note that the status for the 'dead' host is disconnected and there are no Forests listed for that host. Click on the hostname for that host to get to the configuration.

From there you can use the remove button taking care to ensure that you're editing the configuration for the correct host (the host name field will tell you):

Read the warning and confirm the action using the ok button:

After the restart, you should verify that there are only two hosts available in the cluster:

Recovery - Step 4: Adding the replacement host to the cluster

Install MarkLogic Server on your new host, initialize it and join it to the existing cluster

Adding the missing forests to the new host

From the Admin GUI on your newly added host go to: Configure > Forests > Create Tab and manually add the 6 forests that were deleted in earlier steps:

Important Note: the hostname listed next to host indicates the host on which these forests will be created.

This step is scripted in the first tab of the attached Query Console workspace (KB-607-Recovery.xml)

Attach the missing replica forests for the Security and Schemas database

From the Admin GUI go to: Configure > Forests > Security > Configure Tab and add Security-1-R as a forest replica:

From the Admin GUI go to: Configure > Forests > Schemas > Configure Tab and add Schemas-1-R as a forest replica:

This step is scripted in the second tab of the attached Query Console workspace (KB-607-Recovery.xml)

Attach the replicas for the 4 forests for the Data database

From the Admin GUI go to: Configure > Forests > Data-1 > Configure Tab and add Data-1-R as a forest replica and use the ok button to save the changes:

Go to: Configure > Forests > Data-2 > Configure Tab and add Data-2-R as a forest replica and use the ok button to save the changes:

Go to: Configure > Forests > Data-3-R > Configure Tab and add Data-3 as a forest replica and use the ok button to save the changes:

Go to: Configure > Forests > Data-4-R > Configure Tab and add Data-4 as a forest replica and use the ok button to save the changes:

This step is scripted in the third tab of the attached Query Console workspace (KB-607-Recovery.xml)

Conclusion

At the end of the process, your database status should look like this:

The only task that remains (after the new replicas have caught up) is to establish Data-3 and Data-4 as the master forests.

To do this you'd need to detach them as replica forests, remove Data-3-R and Data-4-R from the database, attach Data-3-R as a replica for Data-3 and Data-4-R as a replica for Data-4 and then attach Data-3 and Data-4 back to the database.

After doing this, your final database status should look like this:

And the cluster host status should look like this:

Remember to re-enable the rebalancer if you wish to continue using it.



Attachments 
 
 KB-607.xml (4.73 KB)
 KB-607-Failover.xml (5.36 KB)
 KB-607-Recovery.xml (2.80 KB)
(2 vote(s))
Helpful
Not helpful

Comments (0)