Warning messages for lagging operations
21 September 2016 11:47 AM
In releases 8.0-5.3 and 7.0-6.4, we've added code to detect lagged operations and log warnings.
Fast Query Timestamp
Every forest has a notion of a "fast query timestamp", also sometimes referred to as a "nonblocking timestamp". This is the maximum timestamp at which a query can run without waiting for the forest's timestamp to advance; it indicates the most current time at which the forest has complete state to answer a query. There are several reasons for forests to have this timestamp.
The first has to do with transaction commits, during which the forest places a finger on the commit timestamp for the duration of the commit. The point of this is to ensure that queries perceive committed transactions to be atomic. There can be multiple (even many) transactions with a finger on various timestamps at any given point in time.
The second has to do with asynchronous database replication, in which case each replicated journal frame is accompanied by an appropriate fast query timestamp from the master forest, sampled when the frame was journaled. The forest in the replica database will advance its fast query timestamp to track the journal stream. If replication is interrupted for some reason, the timestamp will stay fixed until replication resumes.
There is now code to detect and warn that a forest's fast query timestamp has lagged the cluster commit timestamp by an excessive amount. For forests in a master database, this means 30 seconds. For forests in a replica database, the warning starts at 60 seconds. The complaint frequency automatically backs off to once every 5 minutes for each forest, with one final warning when the lag is no longer excessive to properly identify when the issue was resolved. The text of the warning looks like
This warning will help flag any problems with overly long transactions that can hold up queries. The warning helps flag the lag issue earlier, rather than later.
There are times when it takes a very long time to write a journal frame, which may result in a lagged timestamp. Reasons can include underprovisioned disk, oversubscribed VM environments, VM migration, etc. These incidents will now get flagged by a new warning message like the following whenever writing a journal frame exceeds 30 seconds:
Another addition is a canary thread that wakes up each second, checks how long it was asleep, and warns if it was longer than 10 seconds. That message looks like
Information on these and other events, including how to control the time limits:
Information on database replication lag: