How to handle XDQP-TIMEOUT on a busy cluster
24 February 2020 04:28 PM
Sometimes, when a cluster is under heavy load, your cluster may show a lot of XDQP-TIMEOUT messages in the error log. Often, a subset of hosts in the cluster may become so busy that the forests they host get unmounted and remounted repeatedly. Depending on your database and group settings, the act of remounting a forest may be very time-consuming, due to the fact that that all hosts in the cluster are being forced to do extra work of index detection.
Every time a forest remounts, the error log will show a lot messages like these:
This can go on for several minutes and will cost you more down time than necessary, since you already know the indexes for each database.
Improving the situation
Here are some suggestions for improving this situation:
Repeat steps 1-4 for all active databases.
Now tweak the group settings to make the cluster less sensitive to an occasional busy host:
The database-level changes tell the server to speed up cluster startup time when a server node is perceived to be offline. The group changes will cause the hosts on that group to be a little more forgiving before declaring a host to be offline, thus preventing forest unmounting when it's not really needed.
If after performing these changes, you find that you are still experiencing XDQP-TIMEOUT's, the next step is to contact MarkLogic Support for assistance. You should also alert your Development team, in case there is a stray query that is causing the data nodes to gather too many results.