Hung Messages in the ErrorLog
05 November 2021 03:37 PM
Hung messages in the ErrorLog indicate that MarkLogic Server was blocked while waiting on host resources, typically I/O or CPU.
The presence of Debug-level Hung messages in the ErrorLog does not indiciate a critical problem, but it does indicate that the server is under load and intermittently unresponsive for some period of time. A server that is logging Debug-level Hung messages should be closely monitored and the reason(s) for the hangs should be understood. You'll get a debug message if the hang time is greater than or equal to the Group's XDQP timeout.
When the duration of the Hung message is greater than or equal to two times the Group's XDQP timeout setting, the Hung message will appear at the Warning log level. Consequently, if the host is unresponsive to the rest of the cluster (that is, they have not received a heartbeat for the group's host timeout number of seconds), it may trigger a failover.
Hung messages in the ErrorLog have been traced back to the following root causes:
If the cause of the Hung message further causes the server to be unresponsive to cluster heartbeat requests from other servers in the cluster, for a duration greater than the host timeout, then the host will be considered unavailable and will be voted out of the cluster by a quorum of its peers. If this happens, and failover is configured for forests stored on the unresponsive host, the forests will fail over.
Look at system statistics (such as SAR data) and system logs from your server for entries that occurred during the time-span of the Hung message. The goal is to pinpoint the resource bottleneck that is the root cause.
The host on which MarkLogic Server runs needs to be correctly provisioned for peak load.
MarkLogic recommends that your storage subsystem simultaneously support:
We have found that customers who are able to sustain these throughput rates have not encountered operational problems related to storage resources.
If the Hung message occurred during a I/O intensive background task (such as database backup, merge or reindexing), consider setting of decreasing the backgound IO Limit - This group level configuration controls the I/O resources that background I/O tasks will consume.
If the Hung message occurred during a database merge, consider decreasing the merge priority in the database’s Merge Policy. For example, if the priority is set to "normal", then try decreasing it to "lower".