Knowledgebase: Administration
Detecting and Reporting Failover Events
18 June 2019 11:02 AM

SUMMARY

This article will help MarkLogic Administrators to monitor the health of their MarkLogic cluster. By studying the attached scripts, you will learn how to find out which hosts are down and which forests have failed over, enabling you to take the necessary recovery actions.

Initial Setup

On a separate Linux host (not a member of the cluster), download the file attachments from this article, making sure that they all reside within the same directory.

Here is a general description of each file:

cluster-name.conf - Example configuration file used by script. Configures information for monitoring one ML cluster. 

ml-ck-for-life.sh - A very simple, low-load check that all the nodes of a cluster are up and running.

ml-ck-for-health.sh - A more detailed check for essential cluster functionality with alerting (paging and/or emails to DBAs) if warranted. This script relies on at least one external XQuery file (mon-report-failed-over-forests.xqy) and makes use of the REST MGMT API as well as REST XQuery requests.

mon-report-failed-over-forests.xqy - External XQuery file used by ml-ck-for-health.sh

 

Preparing the CONF File for Use on Your Cluster

Before running the scripts, the cluster-name.conf needs to be customized for your specific cluster. Start by changing the file name to match the name of your cluster, e.g.,

$ mv cluster-name.conf some-other-name.conf

Where "some-other-name" is the actual name of the cluster, or of the application that is hosted on that cluster.

Next, you will need to customize some of the internal variables inside the CONF file itself. Here is the contents of the cluster-name.conf file, as downloaded:

CLUSTER_NAME="CLUSTER-NAME"
CLUSTER_NODES=( node1.my-company.com node2.my-company.com node3.my-company.com )
# MarkLogic Credentials for the REST Management port - 8002
USER_PW_MGMT=rest-manager-user:re-manager-password
# MarkLogic Credentials for the XQuery eval port - 8000
USER_PW_XQ=user-name:user-password
UNIX_USER=unix-user-name
PAGE_ADDRESSES=ml.alert.page@my-company.com
MAIL_ADDRESSES=ml.alert.mail@my-company.com

---------  end of listing ---------

For CLUSTER_NAME, provide the cluster-name listed in the cluster's /var/log/MarkLogic/clusters.xml file.

For CLUSTER_NODES, write in the host-names for each node in your cluster.

For USER_PW_MGMT, provide the user-name and password for the REST MANAGEMENT user, the format is name:password.

For USER_PW_XQ, provide the user-name and password for the user who will execute the XQuery scripts, the format is name:password.

The UNIX_USER is a local Unix username with the correct rwx access rights for this directory.

The PAGE_ADDRESSES & MAIL_ADDRESSES are alert email addresses who will be notified whenever there is a failover event.

Periodicity

The script ml-ck-for-health.sh was created with the idea it would be run repeatedly at a certain interval to keep tabs on system health. For example, it can be configured to be invoked with a cron job. A frequency of 5 to 120 minutes is a good candidate range. Ten minutes is a good time if you would like to be woken up (on average) within 5 minutes of a failover event.

Setting up SSH Passwordless Login

In monitoring script ml-ck-for-health.sh, section (6) FOREST STATUS CHANGE, requires ssh access to the cluster hosts. That is because this section greps through MarkLogic server ErrorLogs. To enable this part of the script to run without prompting the user, "ssh passwordless login" should be setup between the monitoring host and all the cluster hosts.There are many examples of how to do this on the internet, for example: http://www.tecmint.com/ssh-passwordless-login-using-ssh-keygen-in-5-easy-steps/ Alternatively, this monitoring section can be commented out.

Also regarding section (6), the “grep” command is setup up to grep the latest 10 minutes from the ErrorLog. If this script is configured to be run less often then every 10 minutes, the “grep” command line should be adapted to cover the desired period between script runs.

Example Usage

You are now ready to execute the failover monitoring scripts! Here is how you would execute them:


$ ./ml-ck-for-health.sh some-other-name.conf MY-CLUSTER-NAME

$ ./ml-ck-for-life.sh some-other-name.conf

[where "some-other-name" and MY-CLUSTER-NAME are your actual CONF and cluster-name, as described above]

Monitoring Multiple Clusters

So, given a monitoring machine with a directory of cluster configuration files in the style of cluster-name.conf, those configuration files could be iterated through to monitor a suite of clusters from a single monitoring machine. It should be fairly easy to build a custom shell script to iterate through various cluster CONF files.

Final thought and Limitations

Please be aware that the ml-ck-for-health.sh script is only partially implemented. In particular, the Replication Lag and Replication Failure sections are left as exercises for the user.

This script is being presented as a backup, lowest common denominator monitoring solution. For a more complete solution, you should explore other options, such as Splunk or Nagios.

 

 

 



Attachments 
 
 cluster-name.conf (0.41 KB)
 ml-ck-for-health.sh (10.82 KB)
 ml-ck-for-life.sh (0.90 KB)
 mon-report-failed-over-forests.xqy (2.77 KB)
(2 vote(s))
Helpful
Not helpful

Comments (0)