AWS Cluster Repair: Replacing a Missing EBS Volume
16 February 2021 02:32 PM
Customers using the MarkLogic AWS Cloud Formation Templates may encounter a situation where someone has deleted an EBS volume that stored MarkLogic data (mounted at /var/opt/MarkLogic). Because the volume, and the associated data are no longer available, the host is unable to rejoin the cluster.
Getting the host to rejoin the cluster can be complicated, but it will typically be worth the effort if you are running an HA configuration with Primary and Replica forests.
This article details the procedures to get the host to rejoin the cluster.
Preparing the New Volume and New Host
The easiest way to create the new volume is using a snapshot of an existing host's MarkLogic data volume. This saves the work of manually copying configuration files between hosts, which is necessary to get the host to rejoin the cluster.
In the AWS EC2 Dashboard:Elastic Block Store:Volumes section, create a snapshot of the data volume from one of the operational hosts.
Next, in the AWS EC2 Dashboard:Elastic Block Store:Snapshots section, create a new volume from the snapshot in the correct zone and note the new volume id for use later.
(optional) Update the name of the new volume to match the format of the other data volumes
(optional) Delete the snapshot
Edit the Auto Scaling Group with the missing host to bring up a new instance, by increasing the Desired Capacity by 1
This will trigger the Auto Scaling Group to bring up a new instance.
Attaching the New Volume to the New Instance
Once the instance is online, and startup is complete connect to the new instance via ssh
Ensure MarkLogic is not running, by stopping the service and checking for any remaining processes.
Remove /var/opt/MarkLogic if it exists, and is mounted on the root partition.
Edit /var/local/mlcmd and update the volume id listed in the MARKLOGIC_EBS_VOLUME variable to the volume created above.
Run mlcmd to attach and mount the new volume to /var/opt/MarkLogic on the instance
Remove contents of /var/opt/MarkLogic/Forests (if they exist)
Run mlcmd to sync the new volume information to the DynamoDB table
Configuring MarkLogic With Empty /var/opt/MarkLogic
If you did not create your volume from a snapshot as detailed above, complete the following steps. If you created your volume from a snapshot, then skip these steps, and continue with Configuring MarkLogic and Rejoining Existing Cluster
Configuring MarkLogic and Rejoining Existing Cluster
Note the host-id of the missing host found in /var/opt/MarkLogic/hosts.xml.
Start MarkLogic and view the ErrorLog for any issues
You should see messages about forests synchronizing (if you have local disk failover enabled, with replicas) and changing states from wait or async replication to sync replication. Once all the forests are either 'open' or 'sync replicating', then your cluster is fully operational with the correct number of hosts.
At this point you can fail back to the primary forests on the new instances to rebalance the workload for the cluster.
You can also re-enable xdqp ssl enabled, by setting the value to true on the Group Configuration page, if you disabled the setting as part of these procedures.
Update the Userdata In the Auto Scaling Group
To ensure that the correct volume will be attached if the instance is terminated, the Userdata needs to be updated in a Launch Configuration.
Copy the Launch Configuration associated with the missing host.
Edit the details
Edit the Auto Scaling Group associated with the new node
Change the Launch Configuration to the one that was just created and save the Auto Scaling Group.
Now that normal operations have been restored, it's a good opportunity to ensure you have all the necessary database backups, and that your backup schedule has been reviewed to ensure it meets your requirements.