Replacing a failed MarkLogic node in a cluster: a step by step walkthrough
03 April 2019 10:21 AM
|
|||||||||||||||||||||||||||||||
IntroductionIn this knowledgebase article, we are working on the premise that a host in your cluster has been completely destroyed, that primary forests on the failed host have failed over to their replicas - and that steps need to be taken to introduce a new replacement host to get the cluster back up and running. We start with some general assumptions for this scenario:
Cluster topologyHere is an overview of the cluster topology:
In addition, Host B will also contain replicas for the vital auxiliary forests: Schemas-1-R and Security-1-R. Host C will contain Schemas-2-R and Security-2-R Failure ScenarioHost B will be unexpectedly terminated. For the application, these Forests will need to be detached and removed:
As Host B also contains the replica auxiliary forests (for the Security and Schemas database), these will also need to be removed before Host B can be taken out of the cluster. Walkthrough: a step-by-step guideThe attached Query Console workspace (KB-607.xml) runs through all the necessary steps to set up a newly configured 3-node cluster for this scenario; feel free to review all 5 tabs in the workspace to gain insight into how everything is set up for this scenario. 1. OverviewThe cluster status before Host B is removed from the cluster is as follows; note that the Forests for Host B are all highlighted in the images below: The Schemas Database The Security Database The "Data" Database 2. Create the Failover ScenarioHost B will be stopped. You'll need to give MarkLogic some time to perform the failover. To illustrate the failure in this scenario, we're going to issue This is what you should see after the failover has taken place: The Schemas Database The Security Database The "Data" Database After failover has taken place, you should see:
Recovery - Step 1: Detach and remove the Host B Auxilary ForestsThe first task is to ensure the two auxiliary forests for the Schemas and Security databases are removed. Detach the Schemas Replica ForestIn the Admin GUI go to: Configure > Forests > Schemas > Configure Tab > Forest Replicas and uncheck Schemas-1-R and click ok Note: these changes will not be applied until you have clicked on the ok button Detach the Security Replica ForestIn the Admin GUI go to: Configure > Forests > Security > Configure Tab > Forest Replicas and uncheck Security-1-R and click ok Note: these changes will not be applied until you have clicked on the ok button The above steps are scripted in the first tab of the attached Query Console workspace (KB-607-Failover.xml) Delete the Schemas Replica ForestIn the Admin GUI go to: Configure > Forests > Schemas-1-R > Configure Tab and click delete and follow the on-screen prompts to delete the forest configuration Delete the Security Replica ForestIn the Admin GUI go to: Configure > Forests > Security-1-R > Configure Tab and click delete and follow the on-screen prompts to delete the forest configuration Note: while the above steps are scripted in the second tab of the attached Query Console workspace (KB-607-Failover.xml) please note that the admin:forest-delete builtin will not allow you to delete a forest that is currently unavailable; instead the call will fail with an XDMP-FORESTMNT exception. Recovery - Step 2: Remove 'dead' primary forests and replicas and reinstate failed over forests as master forestsStart by disabling the rebalancer on the database until the problem has been completely resolved; to do this go to Configure > Databases > Data > Configure Tab and set enable rebalancer to false. This will stop any documents from being moved around until the maintenance work has been completed: The above step is scripted in the third tab of the attached Query Console workspace (KB-607-Failover.xml) Detach and delete the 'dead' replicasWe're going to start by removing the Data-1-R and the Data-2-R replica forests from the database. Go to Configure > Forests > Data-1 > Configure Tab and uncheck the entry under forest replicas to remove the Data-1-R replica from the Data-1 forest: Go to Configure > Forests > Data-2 > Configure Tab and uncheck the entry under forest replicas to remove the Data-2-R replica from the Data-2 forest: The above step is scripted in the fourth tab of the attached Query Console workspace (KB-607-Failover.xml) Go to Configure > Forests > Data-1-R > Configure Tab and use the delete button to remove the forest: Note that the confirmation screen will force you to perform a configuration only delete as the original forest data is no longer available. Click ok to confirm: Go to Configure > Forests > Data-2-R > Configure Tab and use the delete button to remove the forest: Again, the confirmation screen will force you to perform a configuration only delete as the original forest data is no longer available. Click ok to confirm: Note: while the above steps are scripted in the fifth tab of the attached Query Console workspace (KB-607-Failover.xml) please note that the admin:forest-delete builtin will not allow you to delete a forest that is currently unavailable; instead the call will fail with an XDMP-FORESTMNT exception. At this stage, the database should still be completely available and you should now see 2 error messages reported on the database status page (Configure > Databases > Data > Status Tab): Detach forests Data-3 and Data-4, detach the replicas and re-attach the replicas as master forestsThe next step will cause a small outage while the configuration changes are being made. First, we need to remove the replicas (Data-3-R and Data-4-R) from their respective master forests so we can add them back to the database as primary forests. To do this: Using the Admin GUI go to Configure > Forests > Data-3 > Configure Tab and under the forest replicas section, uncheck Data-3-R to remove it as a replica: Go to Configure > Forests > Data-4 > Configure Tab and under the forest replicas section, uncheck Data-4-R to remove it as a replica: The above step is scripted in the sixth tab of the attached Query Console workspace (KB-607-Failover.xml) Now go to Configure > Databases > Data > Forests > Configure Tab:
The above step is scripted in the seventh tab of the attached Query Console workspace (KB-607-Failover.xml) You should now see that there are no further errors reported on the database status page for the Data database: Delete the configuration for Data-3 and Data-4We now need to delete the configuration for the Data-3 and Data-4 forests before we can safely remove the 'dead' host from the cluster. Go to Configure > Forests > Data-3 > Configure Tab and use the delete button to remove the forest: Click ok to confirm the deletion of the configuration information: Go to Configure > Forests > Data-4 > Configure Tab and use the delete button to remove the forest: Click ok to confirm the deletion of the configuration information: Note: while the above steps are scripted in the eighth tab of the attached Query Console workspace (KB-607-Failover.xml) please note that the admin:forest-delete builtin will not allow you to delete a forest that is currently unavailable; instead the call will fail with an XDMP-FORESTMNT exception. You can now safely remove the host from the cluster. Recovery - Step 3: Remove 'dead' host configurationUsing the Admin GUI go to Configure > Hosts to view the current cluster topology: Note that the status for the 'dead' host is disconnected and there are no Forests listed for that host. Click on the hostname for that host to get to the configuration. From there you can use the remove button taking care to ensure that you're editing the configuration for the correct host (the host name field will tell you): Read the warning and confirm the action using the ok button: After the restart, you should verify that there are only two hosts available in the cluster: Recovery - Step 4: Adding the replacement host to the clusterInstall MarkLogic Server on your new host, initialize it and join it to the existing cluster Adding the missing forests to the new hostFrom the Admin GUI on your newly added host go to: Configure > Forests > Create Tab and manually add the 6 forests that were deleted in earlier steps: Important Note: the hostname listed next to host indicates the host on which these forests will be created. This step is scripted in the first tab of the attached Query Console workspace (KB-607-Recovery.xml) Attach the missing replica forests for the Security and Schemas databaseFrom the Admin GUI go to: Configure > Forests > Security > Configure Tab and add Security-1-R as a forest replica: From the Admin GUI go to: Configure > Forests > Schemas > Configure Tab and add Schemas-1-R as a forest replica: This step is scripted in the second tab of the attached Query Console workspace (KB-607-Recovery.xml) Attach the replicas for the 4 forests for the Data databaseFrom the Admin GUI go to: Configure > Forests > Data-1 > Configure Tab and add Data-1-R as a forest replica and use the ok button to save the changes: Go to: Configure > Forests > Data-2 > Configure Tab and add Data-2-R as a forest replica and use the ok button to save the changes: Go to: Configure > Forests > Data-3-R > Configure Tab and add Data-3 as a forest replica and use the ok button to save the changes: Go to: Configure > Forests > Data-4-R > Configure Tab and add Data-4 as a forest replica and use the ok button to save the changes: This step is scripted in the third tab of the attached Query Console workspace (KB-607-Recovery.xml) ConclusionAt the end of the process, your database status should look like this: The only task that remains (after the new replicas have caught up) is to establish Data-3 and Data-4 as the master forests. To do this you'd need to detach them as replica forests, remove Data-3-R and Data-4-R from the database, attach Data-3-R as a replica for Data-3 and Data-4-R as a replica for Data-4 and then attach Data-3 and Data-4 back to the database. After doing this, your final database status should look like this: And the cluster host status should look like this: Remember to re-enable the rebalancer if you wish to continue using it. | |||||||||||||||||||||||||||||||
|