Summary
If you have already optimized your queries and data (removing unused indexes, dropping older data, etc.), you might be looking to size or scale your environment to ensure it meets either your current and/or future requirements. This article is intended to provide high-level guidance around some of the main areas to consider when thinking about sizing or scaling a MarkLogic Server environment.
While scaling is now easier, thanks to the flexibility of virtualization and cloud technologies, we would still recommend that customers work with MarkLogic Sales and/or Professional Services teams to review and advise on any changes whenever possible. Precise sizing and scaling advice is outside the scope of the MarkLogic Support team.
MarkLogic Server Resource Requirements
MarkLogic Server is just one part of an environment – the health of a cluster depends on the health of the underlying infrastructure, such as disk I/O, network bandwidth, memory, and CPU. Therefore, as a first step, we would recommend reviewing and considering MarkLogic Server's resource needs, which are available within its Installation Guide:
Memory, Disk Space, and Swap Space Requirements
https://docs.marklogic.com/guide/installation-guide/en/requirements-and-database-compatibility/memory,-disk-space,-and-swap-space-requirements.html
Identifying Resource Contention/Starvation
You are, no doubt, already tracking the performance of queries, whether that be in your current or candidate environment, but it is also important to check for and track resource bottlenecks. Some high I/O and CPU activity, as well as increased memory utilization may not necessarily be a cause for concern and can just indicate the system is operating properly. However, you will want to look for evidence of resource contention/starvation, which might impact cluster performance, if not now, then potentially in the near future.
The MarkLoigc Server hosts will indicate issues encountered with resources in their ErrorLogs, and such messages could include details on slow infrastructure or background tasks, lagging operations, hosts low on memory (RAM), disk space and/or other areas.
The Monitoring Dashboard and Monitoring History can be useful MarkLogic Server features to help you understand bottlenecks and what to do next. Some key areas to look for resource contention/starvation include:
Memory
Check the ErrorLogs for any Warning-level memory related messages such as the following, which will indicate the areas involved, for example:
Warning: Memory low: forest+cache=97%phys
Warning: Memory low: huge+anon+swap+file=128%phys
Nearby "Info" level messages on the host can provide further information on the areas involved. Some potential paths for remediation for low memory situations are outlined within the following knowledgebase article:
Memory Consumption Logging and Status
https://help.marklogic.com/Knowledgebase/Article/View/memory-consumption-logging-and-status
For D/E-nodes, also check that the memory situation on each host is well-balanced between the group-level caches; in-memory content; App Server work and the Operating System. A "Rule of Thirds" provides a conceptual explanation on this, which is covered in the following knowledgebase article:
Group caches and Linux huge pages
https://help.marklogic.com/Knowledgebase/Article/View/15/0/group-caches-and-linux-huge-pages
A number of questions specifically on the scaling of memory are also covered in the following knowledgebase article:
RAMblings - Opinions on Scaling Memory in MarkLogic Server
https://help.marklogic.com/knowledgebase/article/View/ramblings---opinions-on-scaling-memory-in-marklogic-server
Caches
If you intend to scale physical memory, then you should consider any re-configuration of MarkLogic Server's group-level caches. During the installation process, MarkLogic sets memory and other settings based on the characteristics of the computer in which it is running. For the group-level caches, automatic sizing is usually recommended. However, for RAM size greater than 256GB, group cache settings are configured the same as for 256GB with automatic cache sizing. These can be changed using manual cache sizing.
Group Level Cache Settings based on RAM
https://help.marklogic.com/Knowledgebase/Article/View/group-level-cache-settings-based-on-ram
Check for queries that are contending for the caches. If the caches are not efficiently used, you will also see high I/O utilization on D-nodes. Cache hits are good, and indicate the query is running in an optimized fashion. Cache misses indicate that the query could not retrieve its results directly from the cache and had to read the data from disk. Disk I/O is expensive relative to reading from memory. Cache misses indicate that the query might be able to be optimized, either by rewriting the parts of the query that have cache misses to better take advantage of the indexes, or by adding indexes that the query can use.
A simple way to review cache hit/miss data is via the "Databases" section in the Monitoring History, which will show details for List Cache, Expanded Tree Cache and Compressed Tree Cache. Also shown, is triple-related cache, Triple Cache and Triple Value Cache, however, unlike other MarkLogic caches, these can shrink and grow, only taking up memory when it needs to add to the caches. Further information on sizing caches and understanding cache statistics may be found via the following resources:
Semantic Graph Developer's Guide: Sizing Caches
https://docs.marklogic.com/guide/semantics/indexes#id_28957
Tuning Queries with query-meters and query-trace
https://docs.marklogic.com/guide/performance/query_meters
I/O Bandwidth
It is important to provision the appropriate amount of I/O bandwidth, where each forest will typically need a minimum of 20MB/sec read and 20MB/sec write. Further information on MarkLogic Server’s I/O requirements, may be found within the following knowledgebase article:
MarkLogic Server I/O Requirements Guide:
https://help.marklogic.com/knowledgebase/article/View/11/0/marklogic-server-io-requirements-guide
Generally, when provisioning local disk, there is already some awareness of performance guidance from the vendors of the I/O controllers or disks being used on hosts. We have seen situations in the past where actual available bandwidth has been much different from expected, but at a minimum the expected values will provide a decent baseline for comparison against eventual testing results. If not already known, we would recommend contacting the vendors of the disk I/O related hardware used by the hosts before testing.
Look out for evidence of I/O Wait, which is the percentage of CPU time spent waiting for I/O operations to complete on a host. Some common causes of I/O Wait include slow storage devices and disk congestion (also faulty hardware and file system issues). I/O Wait can be monitored via technologies such as:
MarkLogic Server Monitoring History
https://docs.marklogic.com/guide/monitoring/history
Sar, from the sysstat package (external link)
https://github.com/sysstat/sysstat/
Network
Network should be monitored. Depending upon the size of the cluster, network traffic can be substantial (in the case of 50 or greater hosts) or small (1-3 hosts). Query workload can also impact network – if queries are requesting large numbers of documents, this can impact network.
CPU
To recap, some high CPU utilization may not be a cause of concern, as there are workloads and tasks that are known to be CPU intensive, such as certain queries, filtering, ingestion, reindexing, rebalancing and merging (note that merge activity will show up as nice % in CPU statistics).
Remediation for high CPU might include tuning code to see if there a way to make better use of MarkLogic caches and reduce E-node operations. Otherwise, for sizing, adding additional capacity can alleviate a CPU bottleneck, so you might look into the option of adding E-nodes/cores.
Disk Space
Disk utilization is an important part of the host's ecosystem. The results of filling the file system can have disastrous effects on server performance and data integrity. It is very important to ensure that your host always has an appropriate amount of free disk space. Sufficient disk space beyond the bare minimum requirement should be available in order to handle influx of data into your system for at least the amount of time it takes to provision more capacity. Further information on MarkLogic's disk space requirements may be found in the following knowledgebase article:
Understanding MarkLogic Minimum Disk Space Requirements
https://help.marklogic.com/Knowledgebase/Article/View/284/0/understanding-marklogic-minimum-disk-space-requirements
Other Areas to Consider
Have You Planned for Failover Situations?
Host resource utilization may vary greatly after a failover event, and such situations should be sized and tested accordingly. Remember that memory utilization on the D-node might vary greatly after a failover and you should size accordingly. For example, if preload is turned off for range indexes, a host that properly served 6 primary forests and 6 failover forests could find itself with inadequate memory when it is serving 9 primary forests and 3 failover forests after a node failure. Likewise, those failover forests might not have impacted cache utilization on that host before the failover, but once active, are consuming cache resources.
Will You Be Changing the Number of Data Nodes?
If scaling data nodes horizontally, you will likely want to take advantage of the new node arrangement by redistributing your database data across all the data nodes in a well-balanced way. The following knowledgebase articles contain information on best practice on how this can be achieved:
MarkLogic Fundamentals - How should I scale out my cluster?
https://help.marklogic.com/Knowledgebase/Article/View/how-should-i-build-out-my-cluster
Considerations when scaling out your MarkLogic instance
https://help.marklogic.com/Knowledgebase/Article/View/162/0/considerations-when-scaling-out-your-marklogic-instance
Are You Running MarkLogic Server as Non-root User?
MarkLogic Server's root process makes a number of OS-specific settings to allow the product to run optimally. However, some customers choose to run MarkLogic Server without the watchdog process running as root. If as part of a scaling you will be using systems with different specifications than before, there are modifications that you should consider making to the user that is taking the responsibility of running as the root process would have done. These modifications are detailed within the following knowledgebase article:
Pitfalls Running MarkLogic Process as non-root user
https://help.marklogic.com/Knowledgebase/Article/View/306/0/pitfalls-running-marklogic-process-as-non-root-user
Are There Any Licensing Implications?
If you are scaling your environment, consider if there will be any licensing implications as part of any change. If you have any questions in this area, you are welcome to open a Support ticket or simply fill out the form on the following page and we will be in touch with you:
How Can We Help?
https://www.progress.com/company/contact?s=marklogic
Test Any Changes
As always, we would recommend thoroughly testing any potential changes in a lower environment that is representative of Production (including while it is under a representative Production load) before being used in Production, to identify any issues or changes in performance.
Enable Request Monitoring
The Request Monitoring feature enables you to configure logging of information related to requests, including metrics collected during request execution. This feature lets you enable logging of internal preset metrics for requests on specific endpoints. You can also log custom request data by calling the provided Request Logging APIs. This logged information may help you evaluate server performance.
Endpoints and Request Monitoring
https://docs.marklogic.com/guide/performance/request_monitoring
Issues After Scaling
If you run into issues after making changes to your infrastructure, you are welcome to contact MarkLogic Support for assistance. You may also find the following resource useful:
Performance Issues in MarkLogic Server: what they look like - and what you should do about them
https://help.marklogic.com/Knowledgebase/Article/View/performance-issues-in-marklogic-server-what-they-look-like---and-what-you-should-do-about-them
References & Further Reading
Performance: Understanding System Resources
https://developer.marklogic.com/learn/understanding-system-resources/