Knowledgebase:
Best practices for load testing and resource provisioning with MarkLogic Server
08 February 2021 01:25 PM

Introduction

MarkLogic Server delivers performance at scale, whether we're talking about large amounts of data, users, or parallel requests. However, people do run into performance issues from time to time. Most of those performance issues can be found ahead of time via well-constructed and well-executed load testing and resource provisioning.

There are three main aspects to load testing against and resource provisioning for MarkLogic:

  1. Building your load testing suite
  2. Examining your load testing results
  3. Addressing hot spots

Building your load testing suite

The biggest issue we see with problematic load testing suites is unrepresentative load. The inaccuracy can be in the form of missing requests, missing query inputs, unanticipated query inputs, unanticipated or underestimated data growth rates, or even a population of requests that skews towards different load profiles compared to production traffic. For example - a given load test might heavily exercise query performance, only to find in production that ingest requests represent the majority of traffic. Alternatively, perhaps one kind of query represents the bulk of a given load test when in reality that kind of query is dwarfed by the number of invocations of a different kind of query.

Ultimately, to be useful, a given load test needs to be representative of production traffic. Unfortunately, the less representative a load test is, the less useful it will be.

Examining your load testing results

Beginning with version 7.0, MarkLogic Server ships a Monitoring History dashboard, visible from any host in your cluster at port 8002/history. The Monitoring History dashboard will illustrate the usage of resources such as CPU, RAM, disk I/O, etc... both at the cluster and individual host levels. The Monitoring History dashboard will also illustrate the occurrence of read and write locks over time. It's important to get a handle on both resource and lock usage in the course of your load test as both will limit the performance of your application - but the way to address those performance issues depends on which class of usage is most prevalent.

Addressing hot spots

By having a representative load test and closely examining your load testing results, you'll likely find hot spots or slow performing parts of your application. MarkLogic Server's Monitoring History allows you to correlate resource and lock usage over time against the workload being submitted by your load tests. Once you find a hot spot, it's worthwhile examining it more closely by either running those requests in isolation or at larger scales. For example, you could run 4x and 16x the number of parallel requests, or 4x and 16x the number of inputs to an individual request - both of which will give you an idea of how the suspect requests scale in response to increased load.

Once you've found a hot spot - what should you do about it? Well, that ultimately depends on the kind of usage you're seeing in your cluster's Monitoring History. If it's clear that your suspect requests are running into a resource bound (for example, 100% utilization of CPU/RAM/disk I/O/etc.), then you'll either need to provision more of that limiting resource (either through more machines, or more powerful machines, or both), or reduce the amount of load on the system provisioned as-is. It may also be possible to re-architect the suspect request to be more efficient with regard to its resource usage.

Alternatively, you may find that your system is not, in fact, seeing a resource bound - where it appears there are plenty of spare CPU cycles/free RAM/low amounts of disk I/O/etc. If you're seeing poor performance in that situation, it's almost always the case that you'll instead see large spikes in the number of read/write locks taken as your suspect requests work through the system. Provisioning more hardware resources may help to some small degree in the presence of read/write locks, but what really needs to happen is the requests need to be re-architected to use as few locks as possible, and preferably to run completely lock free.

 

 

 

(5 vote(s))
Helpful
Not helpful

Comments (0)