Knowledgebase:
How to handle XDQP-TIMEOUT on a busy cluster
24 February 2020 04:28 PM

Introduction

Sometimes, when a cluster is under heavy load, your cluster may show a lot of XDQP-TIMEOUT messages in the error log. Often, a subset of hosts in the cluster may become so busy that the forests they host get unmounted and remounted repeatedly. Depending on your database and group settings, the act of remounting a forest may be very time-consuming, due to the fact that that all hosts in the cluster are being forced to do extra work of index detection.

Forest Remounts

Every time a forest remounts, the error log will show a lot messages like these:

2012-08-27 06:50:33.146 Debug: Detecting indexes for database my-schemas
2012-08-27 06:50:33.146 Debug: Detecting indexes for database Triggers
2012-08-27 06:50:35.370 Debug: Detected indexes for database Last-Login: sln
2012-08-27 06:50:35.370 Debug: Detected indexes for database Triggers: sln
2012-08-27 06:50:35.370 Debug: Detected indexes for database Schemas: sln
2012-08-27 06:50:35.370 Debug: Detected indexes for database Modules: sln
2012-08-27 06:50:35.373 Debug: Detected indexes for database Security: sln
2012-08-27 06:50:35.485 Debug: Detected indexes for database my-modules: sln
2012-08-27 06:50:35.773 Debug: Detected indexes for database App-Services: sln
2012-08-27 06:50:35.773 Debug: Detected indexes for database Fab: sln
2012-08-27 06:50:35.805 Debug: Detected indexes for database Documents: ss, fp

... and so on ...

This can go on for several minutes and will cost you more down time than necessary, since you already know the indexes for each database.

Improving the situation

Here are some suggestions for improving this situation:

  1. Browse to Admin UI -> Databases -> my-database-name
  2. Set ‘index detection’ to ‘none’
  3. Set ‘expunge locks’ to ‘none’

Repeat steps 1-4 for all active databases.

Now tweak the group settings to make the cluster less sensitive to an occasional busy host:

  1. Browse to Admin UI -> Groups -> E-Nodes
  2. Set ‘xdqp timeout’ to 30
  3. Set ‘host timeout’ to 90
  4. Click OK to make this change effective.

The database-level changes tell the server to speed up cluster startup time when a server node is perceived to be offline. The group changes will cause the hosts on that group to be a little more forgiving before declaring a host to be offline, thus preventing forest unmounting when it's not really needed.

If after performing these changes, you find that you are still experiencing XDQP-TIMEOUT's, the next step is to contact MarkLogic Support for assistance. You should also alert your Development team, in case there is a stray query that is causing the data nodes to gather too many results.

Related Reading

XML Data Query Protocol (XDQP)

(8 vote(s))
Helpful
Not helpful

Comments (0)