Knowledgebase:
Performance Issues in MarkLogic Server: what they look like - and what you should do about them
01 June 2021 07:22 PM

Overview

Performance issues in MarkLogic Server typically involve either 1) unnecessary waiting on locks or 2) overlarge workloads. The goal of this knowledgebase article is to give a high level overview of both of these classes of performance issue, as well as some guidelines in terms of what they look like - and what you should do about them.

Waiting on Locks

We often see customer applications waiting on unnecessary read or write locks. 

What does waiting on read or write locks look like? You can see read or write lock activity in our Monitoring History dashboard at port 8002 in the Lock Rate, Lock Wait Load, Lock Hold Load, and Deadlock Wait Load displays. This scenario will typically present with low resource utilization, but spikes in the read/write lock displays and high request latency.

What should you do when faced with unnecessary read or write locks? Remediation of this scenario pretty much always goes through optimization of either request code, data model, or both. Additional hardware resources will not help in this case because there is no hardware resource bound present. You can learn more about data model optimizations through MarkLogic University's On-Demand courses, in particular XML and JSON Data Modeling Best Practices and Impact of Normalization: Lessons Learned

Relevant Knowledgebase articles:

  1. Understanding XDMP Deadlock
  2. How Do Updates Work in MarkLogic Server?
  3. Fast vs Strict Locking
  4. Read Only Queries Run at a Timestamp & Update Transactions use Locks
  5. Performance Theory: Tales From MarkLogic Support

Overlarge Workloads

Overlarge workloads typically take two forms: a. too many concurrent workloads or b. work intensive individual requests

Too Many Concurrent Workloads

With regard to too many concurrent workloads - we often see clusters exhibit poor performance when subjected to many more workloads than the cluster can reasonably handle. In this scenario, any individual workload could be fine - but when the total amount of work over many, many concurrently running workloads is large, the end result is often the oversubscription of the underlying resources.

What does too many concurrent workloads look like? You can see this scenario in our Monitoring History at port 8002, in the Disk I/O, CPU, Memory Footprint, App Server Request Rate, App Server Latency, or Task Server Queue Size displays. This scenario will typically present with spikes in both App Server Latency and App Server Request Rate, and correlated maximum level plateaus in one or more of the aforementioned hardware resource utilization charts.

What should you do when faced with too many concurrent workloads? Remediation of this scenario pretty much always involves the addition of more rate-limiting hardware resource(s). This assumes, of course, that request code and/or data model are both already fully optimized. If either could be further optimized, then it might be possible to enable a higher request count given the same amount of resources - see the "Work Intensive Individual Requests" section, below. Rarely, in circumstances where traffic spikes are unpredictable - but likely - we’ve seen customers incorporate load shedding or traffic management techniques in their application architectures. For example, when request times pass a certain threshold, traffic is then routed through a less resource hungry code path.

Note that concurrent workloads entail both request workload and maintenance activities such as merging or reindexing. If your cluster is not able to serve both requests and maintenance activities, then the remediation tactics are the same as listed above: you either need to a. add more rate-limiting hardware resource(s) to serve both, or b. you need to incorporate load shedding or traffic management techniques like restricting maintenance activities to periods where the necessary resources are indeed available.

Relevant Knowledgebase articles:

  1. When submitting lots of parallel queries, some subset of those queries take much longer - why?
  2. How reindexing works, and its impact on performance
  3. MarkLogic Server I/O Requirements Guide
  4. Sizing E-nodes
  5. Performance Theory: Tales From MarkLogic Support
Work Intensive Individual Requests

With regard to work intensive individual requests - we often see clusters exhibit poor performance when individual requests attempt to do too much work. Too much work can entail an unoptimized query, but it can also be seen when an otherwise optimized query attempts to work over a dataset that has grown past its original hardware specification.

What do work intensive requests look like? You can see this scenario in our Monitoring History at port 8002, in the Disk I/O, CPU, Memory Footprint, App Server Request Rate, App Server Latency, or Task Server Queue Size displays. This scenario will typically present with spikes in one or more system resources (Disk I/O, CPU, Memory Footprint) and App Server Latency. In contrast to the "Too Many Concurrent Requests" scenario App Server Request Rate should not exhibit a spike.

What should you do when faced with work intensive requests? As in the case with too many concurrent requests, it's sometimes possible for customers to address this situation with additional hardware resources. However, remediation in this scenario more typically involves finding additional efficiencies via code or data model optimizations. Code optimizations can be made with the use of xdmp:plan() and xdmp:query-trace(). You can learn more about data model optimizations through MarkLogic University's On-Demand courses, in particular XML and JSON Data Modeling Best Practices and Impact of Normalization: Lessons Learned. If the increase in work is rooted in data growth, it's also possible to reduce the amount of data. Customers pursuing this route will typically do periodic data purges or by using features like Tiered Storage.

Relevant Knowledgebase articles:

  1. Gathering information to troubleshoot long-running queries
  2. Fast searches: resolving from the indexes vs. filtering
  3. What do I do about XDMP-LISTCACHEFULL errors?
  4. Resolving XDMP-EXPNTREECACHEFULL errors
  5. When should I look into query or data model tuning?
  6. Performance Theory: Tales From MarkLogic Support

Additional Resources

  1. Monitoring MarkLogic Guide
  2. Query Performance and Tuning Guide
  3. Performance: Understanding System Resources

 

(16 vote(s))
Helpful
Not helpful

Comments (0)