System clock synchronization and XDMP-CLOCKSKEW
28 April 2020 04:14 PM
MarkLogic Server expects the system clocks to be synchronized across all the nodes in a cluster, as well as between Primary and Replica clusters. The acceptable level of clock skew (or drift) between hosts is less than 0.5 seconds, and values greater than 30 seconds will trigger XDMP-CLOCKSKEW errors, and could impact cluster availability.
Cluster Hosts should use NTP to maintain proper clock synchronization.
Inside MarkLogic Clock Time usage
MarkLogic hosts include a precise time of day in XDQP heartbeat messages they send to each other. When a host processes incoming XDQP heartbeat messages, host compares the time of the day in the message against its own clock. If the time difference from the comparison is large enough host will report a CLOCKSKEW in ErrorLog.
MarkLogic does not thoroughly test clusters in a clock skewed configuration, as it is not a valid configuration. As a result, we do not know all of the ways that a MarkLogic Server Cluster would fail. However, there are some areas where we have noticed issues:
If MarkLogic Server detects a clock skew, it will write a message to the error log such as one of the following:
If one of these lines appears in the error log, or you see repeated XDMP-CLOCKSKEW errors over an extended time period, the clock skew between the hosts in the cluster should be verified. However, do not be alarmed if this warning appears even if there is no clock skew. This message may appear on a system under load, or at the same time as a failed host comes back online. In these cases the errors will typically clear within a short amount of time, once the load on the system is reduced.
Time Sync Config
NTP is the recommended solution for maintaining system clock synchronization.
(1) NTP clients on Linux
The most common Linux NTP clients are ntpd and chrony. Either of these can be used to ensure your hosts stay synchronized to a central NTP time source. You can check the settings for NTP, and manually update the date if needed
The instructions in the link below goes over the process of checking the ntpd service, and updating the date manually using the ntpdate command.
The following Server Fault article goes over the process of forcing chrony to manually update and step the time using the chronyc command.
Running the applicable command on the affected servers should resolve the CLOCKSKEW errors for the short term.
If the ntpd or chrony service is not running, you can still use the ntpdate or chronyc command to update the system clock, but you will need to configure a time service to ensure accurate time is maintained, and avoid future CLOCKSKEW errors. For more information on setting up a time sychonization service, see the following KB article:
(2) NTP clients on Windows
Windows servers can be configured to retrieve time directly from an NTP server, or from a Primary Domain Controller (PDC) in the root of an Active Directory forest that is configured as an NTP server. The following link includes information on configuring NTP on a Windows server, as well as configuring a PDC as an NTP server.
(3) VMWare time synchronization
If your systems are VMWare virtual machines then you may need to take the additional step of disabling time synchronization of the virtual machine. By default the VMWare daemon will synchronize the Guest OS to the Host OS once per minute, and may interfere with ntpd settings. Through the VMSphere Admin UI, you can disable time synchronization between the Guest OS and Host OS in the virtual machine settings.
Configuring Virtual Machine Options
This will prevent regular time synchronization, but synchronization will still occur during some VMWare operations such as, Guest OS boots/reboots, resuming a virtual machine, among others. To disable VMWare clock sync completely, then you need to edit the .vmx for the virtual machine to set several synchronization properties to false. Details can be found in the following VMWare Blog:
Completely Disable Time Synchronization for your VM
(4) AWS EC2 time synchronization
For AWS EC2 instances, if you are noticing CLOCKSKEW in MarkLogic cluster you would benefit from changing clock source from default xen to tsc.
Other sources for Clock Skew
(1) Overloaded Host leading to Clock Skew
If for some reason there is a long time between when a XDQP heartbeat message was encoded in sending host, and when it was decoded at receiving host end, it will be interpreted as a CLOCKSKEW. Below are some of the combinations which can lead to CLOCKSKEW.
If you see a CLOCKSKEW message in ErrorLog combined with other messages (Hung messages, Slow Warning) then Server is likely overloaded and thrashing. Messages reporting broken XDQP connections (
(2) XDQP Thread start fail leading to Clock Skew
When MarkLogic starts up it tries to make the number of process per user (set limit) on System to at least 16384. But if MarkLogic is not starting as root, then MarkLogic will only be able to raise the soft limit (for number of processes per user) up to the hard limit, which could fail XDQP thread start up. You can get the current setting with the shell command ulimit -u and make sure number of process per user is at least 16384.