Knowledgebase: MarkLogic Server
Hung Messages in the ErrorLog
05 November 2021 03:37 PM

Summary

Hung messages in the ErrorLog indicate that MarkLogic Server was blocked while waiting on host resources, typically I/O or CPU. 

Debug Level

The presence of Debug-level Hung messages in the ErrorLog does not indiciate a critical problem, but it does indicate that the server is under load and intermittently unresponsive for some period of time. A server that is logging Debug-level Hung messages should be closely monitored and the reason(s) for the hangs should be understood.  You'll get a debug message if the hang time is greater than or equal to the Group's XDQP timeout. 

Warning Level

When the duration of the Hung message is greater than or equal to two times the Group's XDQP timeout setting, the Hung message will appear at the Warning log level. Consequently, if the host is unresponsive to the rest of the cluster (that is, they have not received a heartbeat for the group's host timeout number of seconds), it may trigger a failover.

Common Causes

Hung messages in the ErrorLog have been traced back to the following root causes:

  • MarkLogic Server is installed on a Virtual Machine (VM), and
    • The VM does not have sufficient resources provisioned for peak use; or
    • The underlying hardware is not provisioned with enough resources for peak use.
  • MarkLogic Server is using disk space on a Storage Area Network (SAN) or Network Attached Storage (NAS) device, and
    • The SAN or NAS is not provisioned to handle peak load; or
    • The network that connects the host to the storage system is not provisioned to handle peak load.
  • Other enterprise level software is running on the same hardware as MarkLogic Server. MarkLogic Server is designed with the assumption that it is running on dedicated hardware.
  • A file backup or a virus scan utility is running against the same disk where forest data is stored, overloading the I/O capabilities of the storage system.
  • There is insufficient I/O bandwidth for the merging of all forests simultaneously.
  • Re-indexing overloads the I/O capabilities of the storage system.
  • A query that performs extremely poorly, or a number of such queries, caused host resource exhaustion.

Forest Failover

If the cause of the Hung message further causes the server to be unresponsive to cluster heartbeat requests from other servers in the cluster, for a duration greater than the host timeout, then the host will be considered unavailable and will be voted out of the cluster by a quorum of its peers.  If this happens, and failover is configured for forests stored on the unresponsive host, the forests will fail over.  

Debugging Tips

Look at system statistics (such as SAR data) and system logs from your server for entries that occurred during the time-span of the Hung message.  The goal is to pinpoint the resource bottleneck that is the root cause.

Provisioning Recommendation

The host on which MarkLogic Server runs needs to be correctly provisioned for peak load. 

MarkLogic recommends that your storage subsystem simultaneously support:

  •     20MB/s read throughput, per forest
  •     20MB/s write throughput, per forest

We have found that customers who are able to sustain these throughput rates have not encountered operational problems related to storage resources.

Configuration Tips

If the Hung message occurred during a I/O intensive background task (such as database backup, merge or reindexing), consider setting of decreasing the backgound IO Limit - This group level configuration controls the I/O resources that background I/O tasks will consume.

If the Hung message occurred during a database merge, consider decreasing the merge priority in the database’s Merge Policy.  For example, if the priority is set to "normal", then try decreasing it to "lower".

 

(9 vote(s))
Helpful
Not helpful

Comments (0)