Knowledgebase:
Understanding Slow Infrastructure Notifications
14 November 2017 06:37 PM

MarkLogic Server Logging: Slow Infrastructure Notifications

Introduction

As of MarkLogic Server versions 8.0-6.8 and 9.0-3, MarkLogic Server has added logging to note and help diagnose slowness.

Details

MarkLogic Server makes extensive use of file system and network resources of each host within the cluster. Because the performance of MarkLogic Server is dependent on the performance of the file systems and network, information is written to the server logs when interacting with these devices becomes slow. Effective with the release of MarkLogic Server versions 8.0-6.8/9.0-3 and beyond, the server will log an error message when an XDQP network protocol send packet, a file system journal write, a file system label write, or a mapped file sync takes longer than one second to complete. For operations where the size of the data involved is known, the write or send performance slower than 1 MB per second will be logged. Size is taken into consideration so that a large write or send doesn’t generate a log entry merely as a result of its size. Additionally, logging of these messages is contingent on the slow operation being less than half the average rate for the particular operation: the slowness of the operation has to be somewhat sudden for the slowdown to be of note. The average rate is derived from figures that can be found for forests in xdmp:forest-status(), and connection-based figures found in xdmp:host-status().

Log messages will be written to the server log file ErrorLog.txt. In order to avoid the log being unnecessarily noisy, the following criteria are considered:

  • No message for any individual file or network connection will be reported more than once a minute

  • The message is logged at Notice level if the duration of the operation is less than two seconds

  • The message is logged at Warning level if the duration of the operation is greater than two seconds

  • If a message has been logged at the Notice level, the one minute period of time during which no message will be logged will be honored even if the same target device encounters a Warning level threshold during the period

Some control of the frequency of the logging messages is available through trace events (see How to use diagnostic trace events):

  • The trace event 'No Slow Warnings' completely turns off this logging

  • The trace event 'Fewer Slow Warnings' doubles all the time limits listed above to decrease the number of log entries

  • The trace event 'More Slow Warnings' halves all the time limits listed above to increase the number of log entries

  • The trace event 'No Slow Warning Rate Threshold' removes the 1 MB/sec threshold above which a delay must take place before logging a message

  • The trace event 'No Slow Warning Interval' removes the one minute interval restriction between log messages for any individual file or network connection

Examples of some of the log messages:

2017-07-06 12:10:33.553 Warning: Slow fsync /var/opt/MarkLogic/Forests/doc-stress-F4/Label, 2.044 sec
2017-07-06 12:11:50.868 Notice: Slow open /var/opt/MarkLogic/Forests/doc-stress-F10/00000183/Label, 1.659 sec

2017-07-06 12:11:59.734 Warning: Slow utime /var/opt/MarkLogic/Forests/doc-stress-F10/Label, 2.006 sec

2017-07-06 12:11:59.735 Warning: Slow fsync /var/opt/MarkLogic/Forests/doc-stress- F10/00000182/Obsolete, 2.278 sec
2017-07-06 12:12:00.924 Notice: Slow utime /var/opt/MarkLogic/Forests/doc-stress-F19/Label, 1.061 sec

2017-07-06 11:07:49.261 Warning: Slow sync_file_range /var/opt/MarkLogic/Forests/doc-stress-R12/Journals/Journal-20170706-175428-494356-14993636680966550-16521532899203342486-9000200, 512 KB in 35.05 sec
2017-07-06 11:07:49.262 Warning: Slow fsync /var/opt/MarkLogic/Forests/doc-stress-R6/Label, 35.7 sec

2017-07-06 00:34:32.645 Notice: Slow send 172.18.19.140:17800-172.18.19.207:7999, 282.8 KB in 1.353 sec; check host rh7-intel64-80-7.example.com

While every system is different, some of the example messages above are for operations that should complete in milliseconds, so a warning of taking two seconds will often be slow by several orders of magnitude. For example, in replication configurations, fsync is performed against the Label file of each forest on each node once per second, so a slow fsync timing of two seconds is significantly out of line with expectations.

Monitoring the Logs

Slow infrastructure messages can and should be monitored as part of normal cluster management and maintenance. There is no exhaustive list that will sum up every possible resource or connection that can receive these messages, and additional entries may be added over time. Monitoring scripts or rules should take the following conditions into consideration:

  • Most of the messages will appear in the ErrorLog.txt file

  • An occasional message can appear in a specific port number error log (e.g., 8000_ErrorLog.txtwhen direct connect operations such as mlcp are used

  • The fourth token in the line will be Slow

  • The message level can be any of Info, Notice, or Warning

  • The name of the resource will follow in the fifth token

  • There will most always be a duration time, listed in seconds

  • There will usually be a rate listed in the scale most appropriate to the particular instance (B, K, MB, etc.)

  • There will always be a diagnostic pointing to the device/operation that is being deemed slow

  • There will sometimes be a diagnostic pointing to another host where the source of the problem may lie

These guidelines should aid in producing monitoring scripts or rules for rule-based log monitoring applications.

(16 vote(s))
Helpful
Not helpful

Comments (0)