RAMblings - Opinions on Scaling Memory in MarkLogic Server
10 December 2021 05:07 PM
This not-too-technical article covers a number of questions about MarkLogic Server and its use of memory:
Let’s say you have an existing MarkLogic environment that’s running acceptably well. You have made sure that it does not abuse the infrastructure on which it’s running. It meets your SLA (maybe expressed as something like “99% of searches return within 2 seconds, with availability of 99.99%”). Several things about your applications have helped achieve this success:
As such, your application’s performance is largely determined by the number of disk accesses required to satisfy any given query. Most of the processing involved is related to our major data structures:
Fulfilling a query can involve tens, hundreds or even thousands of accesses to these data structures, which reside on disk in files within stand directories. (The triple world especially tends to exhibit the greatest variability and computational burden.)
Of course, MarkLogic is designed so that the great majority of these accesses do not need to access the on-disk structures. Instead, the server caches termlists, range indexes, triples, etc. which are kept in RAM in the following places:
One key notion to keep in mind is that the in-memory operations (the “hit” cases above) operate at speeds of about a microsecond or so of computation. The go-to-disk penalty (the “miss” cases) cost at least one disk access which takes a handful of milliseconds plus even more computation than a hit case. This represents a difference on the order of 10,000 times slower.
Nonetheless, you are running acceptably. Your business is succeeding and growing. However, there are a number of forces stealthily working against your enterprise continuing in this happy state.
In the best of all worlds, you have been measuring your system diligently and can sense when your response time is starting to degrade. In the worst of all worlds, you perform some kind of operational / application / server / operating system upgrade and performance falls off a cliff.
Let’s look under the hood and see how pressure is building on your infrastructure. Specifically, let’s look at consumption of memory and effectiveness of the key caching structures in the server.
Recall that the response time of a MarkLogic application is driven predominantly by how many disk operations are needed to complete a query. This, in turn, is driven by how many termlist and range index requests are initiated by the application through MarkLogic Server and how many of those do not “hit” in the List Cache and in-memory Range Indexes. Each one of those “misses” generates disk activity, as well as a significant amount of additional computation.
All the forces listed above contribute to decreasing cache efficiency, in large part because they all use more RAM. A fixed size cache can hold only a fraction of the on-disk structure that it attempts to optimize. If the on-disk size keeps growing (a good thing, right?) then the existing cache will be less effective at satisfying requests. If more users are accessing the system, they will ask in total for a wider range of data. As applications are enriched, new on-disk structures will be needed (additional range indexes, additional index types, etc.) And when did any software upgrade use LESS memory?
There’s a caching concept from the early days of modern computing (the Sixties, before many of you were born) called “folding ratio”. You take the total size of a data structure and divide it by the size of the “cache” that sits in front of it. This yields a dimensionless number that serves as a rough indicator of cache efficiency (and lets you track changes to it). A way to compute this for your environment is to take the total on-disk size of your database and divide it by the total amount of RAM in your cluster. Let’s say each of your nodes has 128GB of RAM and 10 disks of about 1TB each that are about half full. So, the folding ratio of each node of (the shared-nothing approach of MarkLogic allows us to consider each node individually) this configuration at this moment is (10 x 1TB x 50%) / 128GB or about 40 to 1.
This number by itself is neither good nor bad. It’s just a way to track changes in load. As the ratio gets larger, cache hit ratio will decrease (or, more to the point, the cache miss ratio will increase) and response time will grow. Remember, the difference between a hit ratio of 98% versus a hit ratio of 92% (both seem pretty good, you say) is a factor of four in resulting disk accesses! That’s because one is a 2% miss ratio and the other is an 8% miss ratio.
Consider the guidelines that MarkLogic provides regarding provisioning: 2 VCPUs and 8GB RAM to support a primary forest that is being updated and queried. The maximum recommended size of a single forest is about 400 GB, so the folding ratio of such a forest is 400GB / 8GB or about 50 to 1. This suggests that the configuration outlined a couple of paragraphs back is at about 80% of capacity. It would be time to think about growing RAM before too long. What will happen if you delay?
Since MarkLogic is a shared-nothing architecture, the caches on any given node will behave independently from those on the other nodes. Each node will therefore exhibit its own measure of cache efficiency. Since a distributed system operates at the speed of its slowest component, it is likely that the node with the most misses will govern the response time of the cluster as a whole.
At some point, response time degradation will become noticeable and it will become time to remedy the situation. The miss ratios on your List Cache and your page-in rate for your Range Indexes will grow to the point at which your SLA might no longer be met.
Many installations get surprised by the rapidity of this degradation. But recall, the various forces mentioned above are all happening in parallel, and their effect is compounding. The load on your caches will grow more than linearly over time. So be vigilant and measure, measure, and measure!
In the best of all possible worlds, you have a test system that mirrors your production environment that exhibits this behavior in advance of production. One approach is to experiment with reducing the memory on the test system by, say, configuring VMs for a given memory size (and adjusting huge pages and cache sizes proportionately) to see where things degrade unacceptably. You could measure:
When you find the memory size at which things degraded unacceptably, use that to project the largest folding ratio that your workload can tolerate. Or you can be a bit clever and do the same additional calculations for ListCache and Anonymous memory:
[To be fair, on occasion a resource other than RAM can become oversubscribed (beyond the scope of this discussion):
Alternatively, you may know you need to add RAM because you have an emergency on your hands: you observe that MarkLogic is issuing Low Memory warnings, you have evidence of heavy swap usage, your performance is often abysmal, and/or the operating system’s OOM (out of memory) killer is often taking down your MarkLogic instance. It is important to pay attention to the warnings that MarkLogic issues, above and beyond any that come from the OS.
You need to tune your queries so as to avoid bad practices (see the discussion in the beginning of this article) that waste memory and other resources, and almost certainly add RAM to your installation. The tuning exercise can be labor-intensive and time-consuming; it is often best to throw lots of RAM at the problem to get past the emergency at hand.
So, how to add more RAM to your cluster? There are three distinct techniques:
Each of these options has its pros and cons. Various considerations:
Option 1: Scale Vertically
On the surface, this is the simplest way to go. Adding more RAM to each node requires upgrading all nodes. If you already have separate e- and d-nodes, then it is likely that just the d-nodes should get the increased RAM.
In an on-prem (or, more properly, non-cloud) environment this is a bunch of procurement and IT work. In the worst case, your RAM is already maxed out so scaling vertically is not an option.
In a cloud deployment, the cloud provider dictates what options you have. Adding RAM may drag along additional CPUs to all nodes also, which requires added MarkLogic licenses as well as larger payment to the cloud provider. The increased RAM tends to come in big chunks (only 1.5x or 2x options). It’s generally not easy to get just the 20% more RAM (say) that you want. But this may be premature cost optimization; it may be best just to add heaps of RAM, stabilize the situation, and then scale RAM back as feasible. Once you are past the emergency, you should begin to implement longer-term strategies.
This approach also does not add any network bandwidth, storage bandwidth and capacity in most cases, and runs the small risk of just moving the bottleneck away from RAM and onto something else.
Option 2: Scale Horizontally
This approach adds whole nodes to the existing complex. It has the net effect of adding RAM, CPU, bandwidth and capacity. It requires added licenses, and payment to the cloud provider (or a capital procurement if on-prem). The granularity of expansion can be controlled; if you have an existing cluster of (2n+1) nodes, the smallest increment that makes sense in an HA context is 2 more nodes (to preserve quorum determination) giving (2n+3) nodes. In order to make use of the RAM in the new nodes, rebalancing will be required. When the rebalancing is complete, the new RAM will be utilized.
This option tends to be optimal in terms of granularity, especially in already larger clusters. To add 20% of aggregate RAM to a 25-node cluster, you would add 6 nodes to make a 31-node cluster (maintaining the odd number of nodes for HA). You would be adding 24%, which is better than having to add 50% if you had to scale all 25 nodes by 50% because that was what your cloud provider offered.
Option 3: Scale Functionally
Scaling functionally means adding new nodes as e-nodes to cluster and reconfiguring existing e/d-nodes to be d-nodes. This frees up RAM on the d-side (specifically by dramatically reducing the need for Expanded Tree Cache and memory for query evaluation) which will go towards restoring good folding ratio. Recent experience says about 15% of RAM could be affected in this manner.
More licenses are again required, plus installation and admin work to reconfigure the cluster. You need to make sure that network can handle increases in XDMP traffic from e-nodes to d-nodes, but this is not typically a problem. The resulting cluster tends to run more predictably. One of our largest production clusters typically runs its d-nodes at nearly 95% memory usage as reported by MarkLogic as the first number in an error log line. It can get away with running so full because it is a classical search application whose d-node RAM usage does not fluctuate much. Memory usage on e-nodes is a different story, especially when the application uses SQL or Optic. In such a situation, on-demand allocation of large Join Tables can cause abrupt increase in memory usage. That’s why our advice on combined e/d nodes is to run below 80% to allow for query processing.
Thereafter, the two groups of nodes can be scaled independently depending on how the workload evolves.
Here are a few key takeaways from this discussion: