Knowledgebase : Performance Tuning

Summary

When changing the amount of RAM on your MarkLogic Server host, there are additional considerations such as cache settings and swap space.

Group Cache Settings

As a ‘Rule of Thumb’, the memory allocated to group caches (List, Compressed Tree and Expanded Tree) on a host should come out to be about 1/3 to 3/8 of main memory. Increasing the group caches beyond this ratio can result in excessive swapping which will adversely affect performance.

  • For E/D-nodes: this can be distributed as 1/8 of main memory dedicated to your list cache, 1/16 to your compressed tree cache, and 1/8 to your expanded tree cache.
  • For E-nodes: Can be configured with a larger Expanded Tree Cache; the List Cache and the Compressed Tree Cache can be set to 128MB each. 
  • For D-nodes: Can be configured with a larger List Cache and Compressed Tree Cache; the Expanded Tree Cache can be set to 128MB. 

Swap Space (Linux)

  • Linux Huge Pages should be set to 3/8 the size of your physical memory. 
  • Swap space should be set to the size of your physical memory minus the size of your Huge Pages (because Linux Huge Pages are not swapped), or 32GB, whichever is lower.

For Example: If you have 96GB RAM; Huge Pages should be set to 36GB, and swap space at 32GB. 

Swap Space (Solaris)

Swap space should be twice the size of physical memory

Solaris technical note:

Why do we recommend 2x the size of main memory allocated to swap space? When MarkLogic allocates virtual memory on Solaris with mmap and the MAP_ANON option, we do not specify MAP_NORESERVE, and instead let the kernel reserve swap space at the time of allocation.  We do this so that if we run out of swap space we find out by the memory allocation failing with an error, rather than the process getting killed at some inopportune time with SIGBUS.  The process could be using about all of physical memory, so that explains why you need at 1X physical memory in swap space.

MarkLogic Server uses the standard fork() and exec() idiom to run the pstack program.  The pstack program can be a critically important tool for the diagnosis of server problems.  We can’t use vfork() to run pstack instead of fork() because it’s not safe for multithreaded programs.  When a process calls fork(), the kernel makes a virtual copy of all the address space of the process, so it also reserves swap space for this anonymously mapped memory for the forked process.  Of course, immediately after forking, the forked process calls exec() on the pstack program, which frees that reserved memory.  Unlike Linux, Solaris doesn’t overbook swap space, so if the kernel cannot reserve the swap space, fork() fails.  That's why you need 2X physical memory for swap space on Solaris.

Page File (Windows)

On a Windows system, the page file should be twice the size of physical memory. You can set the page file size in the virtual memory section in the advanced system settings from the Windows Control Panel. 

Performance Solution?

Increasing the amount of RAM is not a "cure all" for performance issues on the server.  Even when memory issues appears to be the resource bottleneck, increasing RAM may not be the required solution.  However, here are a few scenarios where increasing RAM on your server may be appropriate

  • You have a need to increase group cache sizes because your cache hit / miss ratio is too high or your queries are failing with cache full errors. In this case, increasing RAM can give you additional flexibility on how the group caches are configured.  However, an alternative solution that could result in even greater performance improvements may involve reworking your queries and index settings so that the queries can be fully resolved during the index resolution phase of the query evaluation.
  • While monitoring your server for swap utilization, you noticed that the server is swapping often and you have already checked your memory and cache setting to verify they are within the MarkLogic recommendations.  The system should be configured so that swapping does not occur during normal operations of the server as swapping can severely affect performance adversely. If that is the case, then adding RAM may improve performance. 

Increasing RAM on your server may only be a temporary fix.  If your queries do not scale, then, as the data size in your forests grow, you may once again hit the issues that caused you to increase your RAM in the first place. If this is the case evaluate your queries and indexes to make them more efficient.

Best Practice for Adding an Index in Production

Summary

It is sometimes necessary to remove or add an index to your production cluster. For a large database with more than a few GB of content, the resulting workload from reindexing your database can be a time and resource intensive process, that can affect query performance while the server is reindexing. This article points out some strategies for avoiding some of the pain-points associated with changing your database configuration on a production cluster.

Preparing your Server for Production

In general, high performance production search implementations run with tight controls on the automatic features of MarkLogic Server. 

  • Re-indexer disabled by default
  • Format-compatibility set to the latest format
  • Index-detection set to none.
  • On a very large cluster (several dozen or more hosts), consider running with expunge-locks set to none
  • On large clusters with insufficient resources, consider bumping up the default group settings
    • xdqp-timeout: from 10 to 30
    • host-timeout: from 30 to 90

The xdqp and host timeouts will prevent the server from disconnecting prematurely when a data-node is busy, possibly triggering a false failover event. However, these changes will affect the legitimate time to failover in an HA configuration. 

Preparing to Re-index

When an index configuration must be changed in production, you should:

  • First, index-detection should be set back to automatic
  • Then, the index configuration change should be made

When you have Database Replication Configured:

If you have to add or modify indexes on a database which has database replication configured, make sure the same changes are made on the Replica cluster as  well. Starting with ML server version 9.0-7, index data is also replicated from the Master to the Replica, but it does not automatically check if both sides have the same index settings. Reindexing is disabled by default on a replica cluster. However, when database replication configuration is removed (such as after a disaster),  the replica database will reindex as necessary. So it is important that the Replica database index configuration matches the Master’s to avoid unnecessary reindexing.

Note: If you are on a version prior to 9.0-7 - When adding/updating index settings, it is recommended that you update the settings on the Replica database before updating those on the Master database; this is because changes to the index settings on the Replica database only affect newly replicated documents and will not trigger reindexing on existing documents.

Further reading -

Master and Replica Database Index Settings

Database Replication - Indexing on Replica Explained

  • Finally, the reindexer should be enabled during off-hours to reindex the content.

Reindexing works by reloading all the Uris that are affected by the index change, this process tends to create lots of new/deleted fragments which then need to be merged. Given that reindexing is very CPU and disk I/O intensive, the re-indexer-throttle can be set to 3 or 2 to minimize impact of the reindex.

After the Re-index

After the re-index has completed, it is important to return to the old settings by disabling the reindexer and setting index-detection back to none.

If you're reindexing over several nights or weekends, be sure to allow some time for the merging to complete. So for example, if your regular busy time starts at 5AM, you may want to disable the reindexer at around midnight to make sure all your merging is completed before business hours.

By following the above recommendations, you should be able to complete a large re-index without any disruption to your production environment.

BEST PRACTICES FOR EXPORTING AND IMPORTING DATA IN BULK

Handling large amounts of data can be expensive in terms of both computing resources and runtime. It can also sometimes result in application errors or partial execution. In general, if you’re dealing with large amounts of data as either output or input, the most scalable and robust approach is to break-up that workload into a series of smaller and more manageable batches.

Of course there are other available tactics. It should be noted, however, that most of those other tactics will have serious disadvantages compared to batching. For example:

  • Configuring time limit settings through Admin UI to allow for longer request timeouts - since you can only increase timeouts so much, this is best considered a short term tactic for only slightly larger workloads.
  • Eliminating resource bottlenecks by adding more resources – often easier to implement compared to modifying application code, though with the downside of additional hardware and software license expense. Like increased timeouts, there can be a point of diminishing returns when throwing hardware at a problem.
  • Tuning queries to improve your query efficiency – this is actually a very good tactic to pursue, in general. However, if workloads are sufficiently large, even the most efficient implementation of your request will eventually need to work over subset batches of your inputs or outputs.

For more detail on the above non-batching options, please refer to XDMP-CANCELED vs. XDMP-EXTIME.

WAYS TO EXPORT LARGE AMOUNTS OF DATA FROM MARKLOGIC SERVER

1.    If you can’t break-up the data into a series of smaller batches - use xdmp:save to write out the full results from query console to the desired folder, specified by the path on your file system. For details, see xdmp:save.

2.    If you can break-up the data into a series of smaller batches:

            a.    Use batch tools like MLCP, which can export bulk output from MarkLogic server to flat files, a compressed ZIP file, or an MLCP database archive. For details, see Exporting Content from MarkLogic Server.

            b.    Reduce the size of the desired result set until it saves successfully, then save the full output in a series of batches.

            c.    Page through result set:

                               i.     If dealing with documents, cts:uris is excellent for paging through a list of URIs. Take a look at cts:uris for more details.

                               ii.     If using Semantics

                                             1.    Consider exporting the triples from the database using the Semantics REST endpoints.

                                             2.    Take a look at the URL parameters start? and pageLength? – these parameters can be configured in your SPARQL query to return the results in batches.  See GET /v1/graphs/sparql for further details.

WAYS TO IMPORT LARGE AMOUNTS OF DATA INTO MARKLOGIC SERVER

1.    If you’re looking to update more than a few thousand fragments at a time, you'll definitely want to use some sort of batching.

             a.     For example, you could run a script in batches of say, 2000 fragments, by doing something like [1 to 2000], and filtering out fragments that already have your newly added element. You could also look into using batch tools like MLCP

             b.    Alternatively, you could split your input into smaller batches, then spawn each of those batches to jobs on the Task Server, which has a configurable queue. See:

                            i.     xdmp:spawn

                            ii.    xdmp:spawn-function

2.    Alternatively, you could use an external/community developed tool like CoRB to batch process your content. See Using Corb to Batch Process Your Content - A Getting Started Guide

3.    If using Semantics and querying triples with SPARQL:

              a.    You can make use of the LIMIT keyword to further restrict the result set size of your SPARQL query. See The LIMIT Keyword

              b.    You can also use the OFFSET keyword for pagination. This keyword can be used with the LIMIT and ORDER BY keywords to retrieve different slices of data from a dataset. For example, you can create pages of results with different offsets. See  The OFFSET Keyword

Introduction

This article outlines various factors influencing the performance of xdmp:collection-delete function and furthermore provides general best practices for improving the performance of large collection deletes.

What are collections?

Collections in MarkLogic Server are used to organize documents in a database. Collections are a powerful and high-performance mechanism to define and manage subsets of documents.

How are collections different from directories?

Although both collections and directories can be used for organizing documents in a database, there are some key differences. For example:

  • Directories are hierarchical, whereas collections are not. Consequently, collections do not require member documents to conform to any URI patterns. Additionally, any document can belong to any collection, and any document can also belong to multiple collections
  • You can delete all documents in a collection with the xdmp:collection-delete function. Similarly, you can delete all documents in a directory (as well as all recursive subdirectories and any documents in those directories) with a different function call - xdmp:directory-delete
  • You can set properties on a directory. You cannot set properties on a collection

For further details, see Collections versus Directories.

What is the use of the xdmp:collection-delete function?

xdmp:collection-delete is used to delete all documents in a database that belong to a given collection - regardless of their membership in other collections.

  • Use of this function always results in the specified unprotected collection disappearing. For details, see Implicitly Defining Unprotected Collections
  • Removing a document from a collection and using xdmp:collection-delete are similarly contingent on users having appropriate permissions to update the document(s) in question. For details, see Collections and Security
  • If there are no documents in the specified collection, then nothing is deleted, and the function still returns the empty sequence

What factors affect performance of xdmp:collection-delete?

The speed of xdmp:collection-delete depends on several factors:

Is there a fast operation mode available within the call xdmp:collection-delete?

Yes. The call xdmp:collection-delete("collection-uri") can potentially be fast in that it won't retrieve fragments. Be aware, however, that xdmp:collection-delete will retrieve fragments (and therefore perform much more slowly) when your database is configured with any of the following:

What are the general best practices in order to improve the performance of large collection deletes?

  • Batch your deletes
    • You could use an external/community developed tool like CoRB to batch process your content
    • Tools like CoRB allow you to create a "query module" (this could be a call to cts:uris to identify documents from a number of collections) and a "transform module" that works on each URI returned. CoRB will run the URI query and will use the results to feed a thread pool of worker threads. This can be very useful when dealing with large bulk processing. See: Using Corb to Batch Process Your Content - A Getting Started Guide
  • Alternatively, you could split your input (for example, URIs of documents inside a collection that you want to delete) into smaller batches
    • Spawn each of those batches to jobs on the Task Server instead of trying to delete an entire collection in a single transaction
    • Use xdmp:spawn-function to kick off deletions of one document at a time - be careful not to overflow the task server queue, however
      • Don't spawn single document deletes
      • Instead, make batches of size that work most efficiently in your specific use case
    • One of the restrictions on the Task Server is that there is a set queue size - you should be able to increase the queue size as necessary
  • Scope deletes more narrowly with the use of cts:collection-query

Related knowledgebase articles:

 

Introduction

MarkLogic Server is engineered to scale out horizontally by easily adding forests and nodes. Be aware, however, that when adding resources horizontally, you may also be introducing additional demand on the underlying resources.

Details

On a single node, you will see some performance improvement in adding additional forests, due to increased parallelization. This is a point of diminishing returns, though, where the number of forests can overwhelm the available resources such as CPU, RAM, or I/O bandwidth. Internal MarkLogic research (as of April 2014) shows the sweet spot to be around six forests per host (assuming modern hardware). Note that there is a hard limit of 1024 primary forests per database, and it is a general recommendation that the total number of forests should not grow beyond 1024 per cluster.

At cluster level, you should see performance improvements in adding additional hosts, but attention should be paid to any potentially shared resources. For example, since resources such as CPU, RAM, and I/O bandwidth would now be split across multiple nodes, overall performance is likely to decrease if additional nodes are provisioned virtually on a single underlying server. Similarly, when adding additional nodes to the same underlying SAN storage, you'll want to pay careful attention to making sure there's enough I/O bandwidth to accommodate the number of nodes you want to connect.

More generally, additional capacity above a bottleneck generally exacerbates performance issues. If you find your performance has actually decreased after horizontally scaling out some part of your stack, it is likely that a part of your infrastructure below the part at which you made changes is being overwhelmed by the additional demand introduced by the added capacity.

Summary

MarkLogic Server clusters are built on a distributed, shared nothing architecture.  Typical query loads will maximize resource utilization only when database content is evenly distributed across D-node hosts in a cluster.  That is, optimal performance will occur when the amount of concurrent work required of each node in a cluster is equivalent.   Having your data balanced across the forests in your cluster is necessary in order to achieve optimal performance.

If all of the forests in a multi-forest database are present from the time when the database was created, the forests will likely each have approximately the same number of documents. If forests were added later on, the newer forests will tend to have fewer documents. In cases like this, rebalancing the forests may be in order.

Default Document Forest Assignment (bucket assignment policy)

Earlier versions used a default document forest assignment policy (or legacy policy). Currently bucket is the default assignment policy and the rebalancer enable configuration for a database is set to true.

In bucket assignment policy, in a multi-forest database, a new document gets assigned to a forest based on the URI hash.  For practical purposes, the default forest assignment is random. In most cases, the default behavior is sufficient to guarantee evenly distributed content.  

There are API functions that allow you to determine where a document resides or will reside:

  • The xdmp:document-assign() function can be used to determine the forest for which a document URI will be assigned. 
  • For existing documents, document updates will occur in the same forest as the existing document. The xdmp:document-forest() function can be used to determine which forest the document is assigned to. 

In-Forest Placement

'In-forest placement' is a technique that is used to override the default document forest assignments.  

Both xdmp:document-insert() and xdmp:document-load() allow you to specify the forest in which the document will be inserted.

mlcp has a -fastload option which will insert content directly.  See Time vs. Correctness: Understanding -fastload Tradeoffs to understand the tradeoffs when using this option.

Some common open source document loading tools also support in-forest placement. RecordLoader (http://developer.marklogic.com/code/recordloader) and XQsync (http://developer.marklogic.com/code/xqsync) support in-forest placement with the OUTPUT_FORESTS property setting.

Rebalancing

MarkLogic 7 introduced database rebalancing using a database rebalancer configured with one of several assignment policies.

A database rebalancer consists of two parts:

  1. an assignment policy for data insert and rebalancing, and
  2. a rebalancer for data movement.

The rebalancer can be configured with one of several assignment policies, which define what is considered 'balanced' for a database. The rebalancer runs on each forest and consults the database's assignment policy to determine which documents do not 'belong to' this forest and then pushes them to the correct forests. You can read more about database rebalancing at http://docs.marklogic.com/guide/admin/database-rebalancing

For a brand new database, the rebalancer is enabled by default and the assignment policy is bucket.  For older versions (before ML 7), by default, the assignment was done using legacy policy.  Upgrades do not change the assignment policy for a database.

(Note that rebalancing forests may result in forests that contain many deleted fragments. To recover disk space, you may wish to force some forests to merge.)

Before Rebalancing, Consider This …

Before embarking on a process to rebalance the documents in your database, consider that rebalancing is generally slower than clearing the database and reloading. The reason is that rebalancing involves updating documents, and updates are more expensive than inserts. Rebalancing the forests may not be the best to solution. If you have the luxury of clearing the database and reloading everything, do it.  However, if the database must be available throughout the rebalancing process, then using the rebalancer may be appropriate.

Summary

Long URI prefix may lead to imbalance in data distribution among the forests. 

Observation

Database assignment policy is set to 'Bucket'. Rebalancer is set to enable, and no fragments is pending to be rebalanced; However, data is imbalanced across forests associated with database. Few forests has higher number of fragments compared to other forests in a given database.

Root cause

For bucket assignment policy, document uri is hashed to match specific bucket. The bucket policy algorithm maps a document’s URI to one of 16K “buckets,” with each bucket being associated with a forest. A table mapping buckets to forests is stored in memory for fast assignment.

Bucket algorithm does not consider whole uri length for the calculation while determining bucket based on uri hash. Uri based bucket determination in bucket assignment policy rely largely on initial characters for hashing algorithm.

If document uri includes long common prefix then all documents uri will result in same hash value and same bucket, even if they different suffix number, and hence result is skewed if there is larger common prefix.

Analysis

To confirm if uneven number of fragments between different forests in database, you can run below query which will give 100 sample documents from each forests and you can review if there are common prefix in document uri in forests with higher number of fragments.

xquery version "1.0-ml";

for $i in xdmp:database-forests(xdmp:database('<dbname>'))
    let $uri := for $j in cts:uris((),(),(),(), $i)[0 to 100]
                return <uri>{$j}</uri>
return <forests><forest>{$i}</forest><uris>{$uri}</uris></forests>

Recommendation

We recommend document uri to not have long name and common prefix. Certain common document uri values can be changed to collection.

Example uri -  /Prime/InternationalTradeDay/Activity/AccountId/ABC0001/BusinessDate/2021-06-14/CurrencyCode/USD/ID/ABC0001-XYZ-123.json

Can be -  /ABC0001-XYZ-123.json. with collection "USD", "Prime", and doc that have date element with "2021-06-14".

Above is just an example, but suggestion is to have an URI naming pattern to avoid large common prefix or save under collection. 

You can use document-assign built-in to verify if URI’s are distributed per the bucket algorithm.

https://docs.marklogic.com/xdmp:document-assign

Additional Resources

Summary

A forest reindex timeout error may occur when there are transactions holding update locks on documents for an extended period of time. A reindexer process is started as a result of a database index change or a major MarkLogic Server upgrade.  The reindexer process will not complete until after update locks are released.

Example error text seen in the MarkLogic Server ErrorLog.txt file:

XDMP-FORESTERR: Error in reindex of forest Documents: SVC-EXTIME: Time limit exceeded

Detail

Long running transactions can occur if MarkLogic Server is participating in a distributed transaction environment. In this case transactions are managed through a Resource Manager. Each transaction is executed in a two phase commit. In the first phase, the transaction will be prepared for a commit or a rollback. The actual commit or rollback will occur in the second phase. More details about XA transactions can be found in the Applicactions Developer Guide - Understanding Transactions in MarkLogic Server

In a situation where the Resource Manager get's disconnected between the two phases, all transactions may be left in a "prepare" state within MarkLogic Server. The Resource Manager maintains transaction information and will clean up transactions left in "prepare" state after a successful reconnect. In the rare case where this doesn't happen, all transactions left in "prepare" state will stay in the system until they are cleaned up manually. The method to manually intervene is described in the XCC Developers Guide - Heuristically Completing a Stalled Transaction.

In order for a XA transaction to take place, it needs to prepare the execution for the commit. If updates are being made to pre-existing documents, update locks are held against the URIs for those documents. When reindexing is occuring during this process, the reindexer will wait for these locks to be released before it can successfully reindex the new documents.   Because the reindexer is unable to complete due to these pending XA transactions, the hosts in the cluster are unable to completely finish the reindexing task and will eventually throw a timeout error.

Mitigation

To avoid these kind of reindexer timeouts, it is recommended that the database is checked for outstanding XA transactions in "prepare" state before starting a reindexing process. There are two ways to verify if the database has outstanding transactions in "prepare" state:

  • In the Admin UI, navigate  to each forest of the database and review the status page; or
  • Run the following XQuery code (in Query Console):

    xquery version "1.0-ml"; 
    declare namespace fo = "http://marklogic.com/xdmp/status/forest";   

    for $f in xdmp:database-forests(xdmp:database()) 
    return    
      xdmp:forest-status($f)//fo:transaction-coordinator[fo:decision-state = 'prepare']

In the case where there are transactions in the "prepare" state, a roll-back can be executed:

  • In the Admin UI, click on the "rollback" link for each transaction; or
  • Run the following XQuery code (in Query Console):

    xquery version "1.0-ml"; 
    declare namespace fo = "http://marklogic.com/xdmp/status/forest";

    for $f in xdmp:database-forests(xdmp:database()) 
    return    
      for $id in xdmp:forest-status($f)//fo:transaction-coordinator[fo:decision-state = 'prepare']/fo:transaction-id/fn:string()
      return
        xdmp:xa-complete($f, $id, fn:false(), fn:false())

Introduction

Users of Java based batch processing applications, such as CoRB, XQSync, mlcp and the hadoop connector may have seen an error message containing "Premature EOF, partial header line read". Depending on how exceptions are managed, this may cause the Java application to exit with a stacktrace or to simply output the exception (and trace) into a log and continue.

What does it mean?

The premature EOF exception generally occurs in situations where a connection to a particular application server connection was lost while the XCC driver was in the process of reading a result set. This can happen in a few possible scenarios:

  • The host became unavailable due to a hardware issue, segfault or similar issue;
  • The query timeout expired (although this is much more likely to yield an XDMP-EXTIME exception with a "Time limit exceeded" message);
  • Network interruption - a possible indicator of a network reliability problem such as a misconfigured load balancer or a fault in some other network hardware.

What does the full error message look like?

An example:

INFO: completed 5063408/14048060, 103 tps, 32 active threads
 Feb 14, 2013 7:04:19 AM com.marklogic.developer.SimpleLogger logException
 SEVERE: fatal error
 com.marklogic.xcc.exceptions.ServerConnectionException: Error parsing HTTP
 headers: Premature EOF, partial header line read: ''
 [Session: user=admin, cb={default} [ContentSource: user=admin,
 cb={none} [provider: address=localhost/127.0.0.1:8223, pool=0/64]]]
 [Client: XCC/4.2-8]
 at
 com.marklogic.xcc.impl.handlers.AbstractRequestController.runRequest(AbstractRequestController.java:116)
 at com.marklogic.xcc.impl.SessionImpl.submitRequest(SessionImpl.java:268)
 at com.marklogic.developer.corb.Transform.call(Unknown Source)
 at com.marklogic.developer.corb.Transform.call(Unknown Source)
 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
 at java.util.concurrent.FutureTask.run(FutureTask.java:166)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
 at java.util.concurrent.FutureTask.run(FutureTask.java:166)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
 at java.lang.Thread.run(Thread.java:679)
 Caused by: java.io.IOException: Error parsing HTTP headers: Premature EOF,
 partial header line read: ''
 at com.marklogic.http.HttpHeaders.nextHeaderLine(HttpHeaders.java:283)
 at com.marklogic.http.HttpHeaders.parseResponseHeaders(HttpHeaders.java:248)
 at com.marklogic.http.HttpChannel.parseHeaders(HttpChannel.java:297)
 at com.marklogic.http.HttpChannel.receiveMode(HttpChannel.java:270)
 at com.marklogic.http.HttpChannel.getResponseCode(HttpChannel.java:174)
 at
 com.marklogic.xcc.impl.handlers.EvalRequestController.serverDialog(EvalRequestController.java:68)
 at
 com.marklogic.xcc.impl.handlers.AbstractRequestController.runRequest(AbstractRequestController.java:78)
 ... 11 more
 2013-02-14 07:04:19.271 WARNING [12] (AbstractRequestController.runRequest):
 Cannot obtain connection: Connection refused

Configuration / Code: things to try when you first see this message

A possible cause of errors like this may be due to the JVM starting garbage collection and this process taking long enough as to exceed the server timeout setting. If this is the case, try adding the -XX:+UseConcMarkSweepGC java option

Setting the "keep-alive" value to zero for the affected XDBC application server will disable socket pooling and may help to prevent this condition from arising; with keep-alive set to zero, sockets will not be re-used. With this approach, it is understood that disabling keep-alive should not be expected to have a significant negative impact on performance, although thorough testing is nevertheless advised.

Summary

Here we discuss various methods for sharing metering data with Support:  telemetry in MarkLogic 9 and exporting monitoring data.

Discussion

Telemetry

In MarkLogic 9, enabling telemetry collects, encrypts, packages, and sends diagnostic and system-level usage information about MarkLogic clusters, including metering, with minimal impact to performance. Telemetry sends information about your MarkLogic Servers to a protected and secure location where it can be accessed by the MarkLogic Technical Support Team to facilitate troubleshooting and monitor performance.  For more information see Telemetry.

Meters database

If telemetry is not enabled, make sure that monitoring history is enabled and data has been collected covering the time of the incident.  See Enabling Monitoring History on a Group for more details.

Exporting data

One of the attached scripts can be used in lieu of a Meters database backup. They will provide the raw metering XML files from a defined period of time and can be reloaded into MarkLogic and used with the standard tools.

exportMeters.xqy

This XQuery export script needs to be executed in Query Console against the Meters database and will generate zip files stored in the defined folder for the defined period of time.

Variables for start and end times, batch size, and output directory are set at the top of the script.

get-raw.sh

This bash version will use MLCP to perform a similar export but requires an XDBC server and MLCP installed. By default the script creates output in a subdirectory called meters-export. See the attached script for details. An example command line is

./get-raw.sh localhost admin admin "2018-04-12T00:00:00" "2018-04-14T00:00:00"

Backup of Meters database

A backup of the full Meters database will provide all the available raw data and is very useful, but is often very large and difficult to transfer, so an export of a defined time range is often requested.

Summary

Executing searches as "unfiltered" is an important performance optimization, particularly for large result sets. This article describes how "filtered" and "unfiltered" searches work, and what tradeoffs each option entails. In general, unfiltered index resolution is fast and filtering is slow. It is often possible to explicitly perform a search either unfiltered or filtered. 

Filtered Searches

In a typical search, MarkLogic Server will first do index resolution from the D-Nodes - which results in unfiltered search results. As a second step, ther Server will then do filtering of those unfiltered search results on the E-Nodes to remove false positives from the result set - which then results in filtered search results.

Unfiltered Searches

If you want to maximize your query performance, you will want to avoid filtering whenever possible - try to structure your documents and configure your indexes to maximize both query accuracy and speed through unfiltered index resolution alone. You can use the "unfiltered" option in both cts:search() and search:search() to test the accuracy of your unfiltered queries.

SUMMARY

This article discusses the MarkLogic group-level caches and Linux Huge Page configurations.

Group Caches

MarkLogic utilizes caches to increase retrieval performance of frequently-accessed objects. In particular, MarkLogic caches:

1. Expanded trees (Expanded Tree Cache)

On any groups that have app servers configured for your application (E-nodes), the Expanded Tree Cache is used to hold frequently-accessed XML documents. This cache is used as workspace for holding the result set of a particular query. MarkLogic recommends that most customers set the Expanded Tree Cache size to 1/8th of the physical memory on the server, up to the maximum allowed size.

For groups that only manage forest content and do not have app servers configured (D-nodes), the Expanded Tree Cache is used only during the process of reindexing content. The cache size should be set to 1024 for D-nodes.

2. Compressed trees (Compressed Tree Cache)

On any groups that do not manage forest content (E-nodes), the Compressed Tree Cache is unused, and should be set to 128.

For groups that manage forest content (D-nodes), the Compressed Tree Cache is used to hold recently-accessed XML content in a compressed form. Its purpose is to minimize random disk reads for frequently-accessed content. MarkLogic recommends that most customers set the Compressed Tree Cache size to 1/16th of the physical memory on the server, up to the maximum allowed size.

3. Lists (List Cache)

On any groups that do not manage forest content (E-nodes), the List Cache is unused, and should be set to 128.

For groups that manage forest content (D-nodes), the List Cache is used to hold recently-accessed index termlists. Its purpose is to minimize disk reads for frequently-accessed index terms, which are used for almost every MarkLogic XQuery. MarkLogic recommends that most customers set the List Cache size to 1/8th of the physical memory on the server, up to the maximum allowed size.

4. Triples (Triple Cache)

The triple cache holds blocks of compressed triples from disk.  

If a cache page has not been accessed after the given amount of time, it's released from the cache.  It is flushed using a least recently used algorithm, so the cache memory shrinks as pages in the cache time out.

See Group Level Cache Settings based on RAM for more information on default settings and maximums.

Logging

If you have logging set at Debug level, the error log will give information on the cache sizes at startup:

2022-07-18 04:46:16.310 Debug: Initializing list cache: 24576 MB, 8 partitions at 3072 MB/partition
2022-07-18 04:46:16.327 Debug: Initializing compressed tree cache: 12288 MB, 16 partitions at 768 MB/partition
2022-07-18 04:46:16.357 Debug: Initializing expanded tree cache: 24576 MB, 16 partitions at 1536 MB/partition
2022-07-18 04:46:16.388 Debug: Initializing triple cache: 12288 MB, 16 partitions at 768 MB/partition
2022-07-18 04:46:16.389 Debug: Initializing triple value cache: 24576 MB, 32 partitions at 768 MB/partition

Rule of Thirds

By default, MarkLogic Server will allocate roughly one third of physical memory to the aforementioned caches, but the server will try to utilize as much memory as possible. The "Rule of Thirds" provides a conceptual explanation of how MarkLogic uses memory on a server:

  • One third of physical memory for MarkLogic group-level caches
  • One third of physical memory for in-memory content (range indexes and in-memory stands)
  • One third of physical memory for workspace, app server overhead, and Linux filesystem buffer

It is very common for Linux servers running MarkLogic to show high memory utilization. In fact, it is desirable to have MarkLogic utilize much of the memory on the server. However, the server should use very little swap, as that will have a severe negative impact on performance. Adhering to the Rule of Thirds should generally ensure that a server is properly sized, and any cases of memory-related performance degradations should be compared against this rule to identify improper sizing.

Huge Pages

MarkLogic server memory use falls into two major categories: large block and small block. Caches and in-memory stands look for large blocks of contiguous memory space, while range indexes, workspace memory, and the Linux filesystem buffer utilize smaller blocks of memory. In order to efficiently allocate the large blocks of memory for the group-level caches and in-memory stands, MarkLogic recommends the usage of Linux Huge Pages. Instead of the kernel allocating 4k pages of memory, huge pages are 2048k in size and can be quickly allocated for larger blocks of memory. At a minimum, MarkLogic recommends allocating enough huge pages to cover the group-level caches (roughly one third of physical memory). The upper end of recommended huge pages includes both the caches and in-memory stands.

The Installation Guide for All Platforms offers the following guidelines for setting up Linux Huge pages:

On Linux systems, MarkLogic recommends setting Linux Huge Pages to 3/8 the size of your physical memory, and should be configured to reserve the space at boot time. For details on setting up Huge Pages, see the following Red Hat Enterprise Linux (RHEL) KB:

How can I configure huge pages in Red Hat Enterprise Linux

If you have Huge Pages set up on a Linux system, your swap space on that machine should be equal to the size of your physical memory minus the size of your Huge Page (because Linux Huge Pages are not swapped), or 32GB, whichever is lower. For example, if you have 64 GB of physical memory, and Huge Pages are set to 24 GB, then lower of (64-24) GB or 32 GB being 32GB is the required swap space.

At system startup on Linux machines, MarkLogic Server logs a message to the ErrorLog.txt file showing the Huge Page size, and the message indicates if the size is below the recommended level.

Further Reading

Linux Huge Pages and Transparent Huge Pages

Group Level Cache Based on RAM size

Knowledgebase: Memory Consumption Logging and Status 

Knowledgebase: RAMblings - Opinions on Scaling Memory in MarkLogic Server 

MarkLogic default Group Level Cache and Huge Pages settings

The table below shows the default (and recommended) group level cache settings based on a few common RAM configurations for the 9.0-9.1 release of MarkLogic Server:

Total RAM List Cache Compressed Tree Cache Expanded Tree Cache Triple Cache Triple Value Cache Default Huge Page Ranges
8192 (8GB) 1024 (1 partition) 512 (1 partition) 1024 (1 partition) 512 (1 partition) 1024 (2 partitions) 1280 to 1994
16384 (16GB) 2048 (1 partition) 1024 (2 partitions) 2048 (1 partition) 1024 (2 partitions) 2048 (2 partitions) 2560 to 3616
24576 (24GB) 3072 (1 partition) 1536 (2 partitions) 3072 (1 partition) 1536 (2 partitions) 3072 (4 partitions) 3840 to 4896
32768 (32GB) 4096 (2 partitions) 2048 (3 partitions) 4096 (2 partitions) 2048 (3 partitions) 4096 (6 partitions) 5120 to 6176
49152 (48GB) 6144 (2 partitions) 3072 (4 partitions) 6144 (2 partitions) 3072 (4 partitions) 6144 (8 partitions) 7680 to 8736
65536 (64GB) 8064 (3 partitions) 4032 (6 partitions) 8064 (3 partitions) 4096 (6 partitions) 8192 (11 partitions) 10080 to 11136
98304 (96GB) 12160 (4 partitions) 6080 (8 partitions) 12160 (4 partitions) 6144 (8 partitions) 12160 (16 partitions) 15200 to 16256
131072 (128GB) 16384 (6 partitions) 8192 (11 partitions) 16384 (6 partitions) 8192 (11 partitions) 16384 (22 partitions) 20480 to 21020
147456 (144GB) 18432 (6 partitions) 9216 (12 partitions) 18432 (6 partitions) 9216 (12 partitions) 18432 (24 partitions)

23040 to 24096

262144 (256GB) 32768 (9 partitions) 16384 (11 partitions) 32768 (9 partitions) 16128 (22 partitions) 32256 (32 partitions)

40320 to 42432

524288 (512GB)

65536 (22 partitions) 32768 (32 partitions) 65536 (32 partitions) 32768 (32 partitions) 65536 (32 partitions)

81920 to 82460

Note that these values are safe to use for MarkLogic 7 and above.

For all the databases that ship with MarkLogic Server, the Huge Pages ranges on this table will cover the out-of-the box configuration. Note that adding more forests will cause the second value in the range to increase.

From MarkLogic Server 9.0-7 and above

In the 9.0-7 release and above (and all versions of MarkLogic 10), automatic cache sizing was introduced; this setting is usually recommended.

Note: For RAM size greater than 256GB, group cache settings are configured the same as for 256GB with automatic cache sizing. These can be changed using manual cache sizing.

Maximum group level cache settings

Assuming a Server configured with 256GB RAM (and above), these are the maximum sizes for the three main group level caches and will utilise 180GB (184320MB) per host for the Group Level Caches:

  • Expanded Tree Cache - 73728 (72GB) (with 9 8GB partitions)
  • List Cache - 73728 (72GB) (with 9 8GB partitions)
  • Compressed Tree Cache - 36864 (36GB) (with 11 3 GB partitions)

We have found that configuring 4GB partitions for the Expanded Tree Cache and the List Cache generally works well in most cases; for this you would set the number of partitions to 18

For the Compressed Tree Cache the number of partitions can be set to 22.

Important note

The maximum number of configurable partitions is 32

Each cache partition should be no more than 8192 MB

Introduction

This article is intended to give you enough information to enable you to understand the output from query console's profiler.

Details

Query

Consider the following XQuery:

xquery version '1.0-ml';
let $path := '/Users/chamlin/Downloads/medsamp2012.xml'
let $citations := xdmp:document-get($path)//MedlineCitation
for $citation at $i in $citations
return
xdmp:document-insert(fn:concat("/",$i,$citation/PMID/fn:string(),".xml"), $citation)

This FLWOR expression will load an xml file into memory, then find each MedlineCitation element and insert it as a document in the database.  Although this example is very simple, it should give us enough information to understand what the profiler does and how to understand the output.

Scenario / Walkthrough

Setup

  • Download the small dataset for medline at http://www.nlm.nih.gov/databases/dtd/medsamp2012.xml and save it to your MarkLogic machine
  • Open a buffer in Query Console
  • Load the XML fragments into your nominated database by executing the XQuery shown above, altering $path so it points to your downloaded medsamp2012.xml file
  • You should have loaded 156 Medline XML fragments in your database if everything worked correctly.  If you receive an error, make sure that the file is available to MarkLogic and has the correct permissions to allow access

Profile the query

Now run the same query again, only this time, ensure "Profile" is selected before you hit the run button.

You should see something like this (click image to view it in a separate window):

QConsole profiler output

 

The header shows overall statistics for the query:

Profile 627 Expressions PT0.286939S
The number of XQuery expression evaluations along with the entire query execution expressed as an xs:dayTimeDuration (hence the PT prefix)

The table shows the profiler results for the expressions evaluated in the request, one row for each expression:

Module:Line No.:Col No.
The point in the code where the expression can be found.
Count
The number of times the expression was evaluated.
Shallow %
The percentage of time spent evaluating a particular expression compared to the entire query, excluding any time spent evaluating any sub-expressions.
Shallow µs
The time (in microseconds) taken for all the evaluations of a particular expression. This excludes time spent evaluating any sub-expressions.
Deep %
The percentage of time spent evaluating a particular expression compared to the entire query, including any time spent evaluating any sub-expressions.
Deep µs
The time (in microseconds) taken for all the evaluations of a particular expression. This includes time spent evaluating any sub-expressions.
Expression
The particular XQuery expression being profiled and timed.  Expressions can represent FLWOR expressions, literal values, arithmetic operations, functions, function calls, and other expressions.

Shallow time vs deep time

In profiler output you will usually want to pay the most attention to expressions that have a large shallow time.  These expressions are doing the most work, exclusive of work done in sub-expressions.

If an expression has a very large deep time, but almost no shallow time, then most of the time is being spent in sub-expressions.

For example, in the profiler output shown, the FLWOR expression at .main:2:0 has the most deep time since it has includes the other expressions, but not a lot of shallow time since it doesn't do much work itself. The expression at .main:3:45 has a lot of deep time, but that all comes from the subexpression at .main:3:18, which takes the most time.

Sorting

The default sorting of the table is by Shallow % descending.  This a generally a good view as it will bring the expressions taking the most shallow time to the top.  You can sort on a different column by clicking on the column header.

Cold vs warm

Timings may change for a query if you execute it more than once, due to the caching performed by MarkLogic.  A query will be slower if it needs data that is not available in the caches (cold) vs where much of the information is available from caches (warm).  This is by design and gives better performance as the system runs and caches frequently used information.

Lazy evaluation

Another characteristic of MarkLogic Server is its use of lazy evaluation.  A relatively expensive evaluation may return quickly without performing the work necessary to produce results.  Then, when the results are needed, the work will actually be performed and the evaluation time will be assigned at that point.  This can give surprising results.

Wrapping an expression in xdmp:eager() will evaluate it eagerly, giving a better idea of how much time it really takes because the time for evaluation will be better allocated in the profiler output.

Further reading

Background

A database consists of one or more forests. A forest is a collection of documents (mostly XML trees, thus the name), implemented as a physical directory on disk. Each forest holds a set of documents and all their indexes. 

When a new document is loaded into MarkLogic Server, the server puts this document in an in-memory stand and writes the action to an on-disk journal to maintain transactional integrity in case of system failure. After enough documents are loaded, the in-memory stand will fill up and be flushed to disk, written out as an on-disk stand. As more document are loaded, they go into a new in-memory stand. At some point this in-memory stand fills up as well, and the in-memory stand gets written as yet another new on-disk stand.

To read a single term list, MarkLogic must read the term list data from each individual stand and unify the results. To keep the number of stands to a manageable level where that unification isn't a performance concern, MarkLogic runs merges in the background. A merge takes some of the stands on disk and creates a new singular stand out of them, coalescing and optimizing the indexes and data, as well as removing any previously deleted fragments
Each forest has its own in-memory stand and set of on-disk stands. Loading and indexing content is a largely parallelizable activity so splitting the loading effort across forests and potentially across machines in a cluster can help scale the ingestion work.

Deletions and Multi-Version Concurrency Control (MVCC)

What happens if you delete or change a document? If you delete a document, MarkLogic marks the document as deleted but does not immediately remove it from disk. The deleted document will be removed from query results based on its deletion markings, and the next merge of the stand holding the document will bypass the deleted document when writing the new stand. MarkLogic treats any changed document like a new document, and treats the old version like a deleted document.

This approach is known in database circles as which stands for Multi-Version Concurrency Control (or MVCC).
In an MVCC system changes are tracked with a timestamp number which increments for each transaction as the database changes. Each fragment gets its own creation-time (the timestamp at which it was created) and deletion-time (the timestamp at which it was marked as deleted, starting at infinity for fragments not yet deleted).

For a request that doesn't modify data the system gets a performance boost by skipping the need for any URI locking. The query is viewed as running at a certain timestamp, and throughout its life it sees a consistent view of the database at that timestamp, even as other (update) requests continue forward and change the data.

Updates and Deadlocks

An update request, because it isn't read-only, has to use read/write locks to maintain system integrity while making changes. Read-locks block for write-locks; write-locks block for both read and write-locks. An update has to obtain a read-lock before reading a document and a write-lock before changing (adding, deleting, modifying) a document. Lock acquisition is ordered, first-come first-served, and locks are released automatically at the end of a request.

In any lock-based system you have to worry about deadlocks, where two or more updates are stalled waiting on locks held by the other. In MarkLogic deadlocks are automatically detected with a background thread. When the deadlock happens on the same host in a cluster, the update farthest along (with the most locks) wins and the other update gets restarted. When it happens on different hosts, because lock count information isn't in the wire protocol, both updates start over. MarkLogic differentiates queries from updates using static analysis. Before running a request, it looks at the code to determine if it includes any calls to update functions. If so, it's an update. If not, it's a query. Even if at execution time the update doesn't actually invoke the updating function, it still runs as an update.

For the most part it's not under the control of the user. The one exception is there's an xdmp:lock-for-update($uri) call that requests a write-lock on a document URI, without actually having to issue a write and in fact without the URI even having to exist.

When a request potentially touches millions of documents (such as sorting a large data set to find the most recent items), a query request that runs lock-free will outperform an update request that needs to acquire read-locks and writelocks. In some cases you can speed up the query work by isolating the update work to its own transactional context. This technique only works if the update doesn't have a dependency on the outer query, but that turns out to be a common case. For example, let's say you want to execute a content search and record the user's search string to the database for tracking purposes. The database update doesn't need to be in the same transactional context as the search itself, and would slow things down if it were. In this case it's better to run the search in one context (read-only and lock-free) and the update in a different context. See the xdmp:eval() and xdmp:invoke() functions for documentation on how to invoke a request from within another request and manage the transactional contexts between the two.

Document Lifecycle

Let's track the lifecycle of a document from first load to deletion until the eventual removal from disk. A document load request acquires a write-lock for the target URI as part of the xdmp:document-load() function call. If any other request is already doing a write to the same URI, our load will block for it, and vice versa. At some point, when the full update request completes successfully (without any errors that would implicitly cause a rollback), the actual insertion work begins, processing the queue of update work orders. MarkLogic starts by parsing and indexing the document contents, converting the document from XML to a compressed binary fragment representation. The fragment gets added to the in-memory stand. At this point the fragment is considered a nascent fragment, a term you'll see sometimes on the administration console status pages. Being nascent means it exists in a stand but hasn't been fully committed. (On a technical level, nascent fragments have creation and deletion timestamps both set to infinity, so they can be managed by the system while not appearing in queries prematurely.) If you're doing a large transactional insert you'll accumulate a lot of nascent fragments while the documents are being processed. They stay nascent until they've been committed. Once the fragment is placed into the in-memory stand, the request is ready to commit. It obtains the next timestamp value, journals its intent to commit the transaction, and then makes the fragment available by setting the creation timestamp for the new fragment to the transaction's timestamp. At this point it's a durable transaction, replayable in event of server failure, and it's available to any new queries that run at this timestamp or later, as well as any updates from this point forward (even those in progress). As the request terminates, the write-lock gets released.

Our document lives for a time in the in-memory stand, fully queryable and durable, until at some point the in-memory stand fills up and gets written to disk. Our document is now in an on-disk stand. Sometime later, based on merge algorithms, the on-disk stand will get merged with some other on-disk stands to produce a new on-disk stand. The fragment will be carried over, its tree data and indexes incorporated into the larger stand. This might happen several times.

At some point a new request makes a change to the document, such as with an xdmp:node-replace() call. The request making the change first obtains a read-lock on the URI when it first accesses the document, then promotes the read-lock to a write-lock when executing the xdmp:node-replace() call. If another write-lock were already present on the URI from another executing update, the read-lock would have blocked until the other write-lock released. If another read-lock were already present, the lock promotion to a write-lock would have blocked. Assuming the update request finishes successfully, the work runs similar to before: parsing and indexing the document, writing it to the in-memory stand as a nascent fragment, acquiring a timestamp, journaling the work, and setting the creation timestamp to make the fragment live. Because it's an update, it has to mark the old fragment as deleted also, and does that by setting the deletion timestamp of the original fragment to the transaction timestamp. This combination effectively replaces the old fragment with the new. When the request concludes, it releases its locks. Our document is now deleted, replaced by the new version.

The old fragment still exists on disk, of course. In fact, any query that was already in progress before the update incremented the timestamp, or any query doing time travel with an old timestamp, can still see it. Eventually the on-disk stand holding the fragment will be merged again, at which point the old fragment will be completely removed from the system. It won't be written into the new on-disk stand. That is, unless the administration "merge timestamp" was set to allow deep time travel. In that case it will live on, sticking around in case any new queries want to time travel to see old fragments.

Summary

Hung messages in the ErrorLog indicate that MarkLogic Server was blocked while waiting on host resources, typically I/O or CPU. 

Debug Level

The presence of Debug-level Hung messages in the ErrorLog does not indiciate a critical problem, but it does indicate that the server is under load and intermittently unresponsive for some period of time. A server that is logging Debug-level Hung messages should be closely monitored and the reason(s) for the hangs should be understood.  You'll get a debug message if the hang time is greater than or equal to the Group's XDQP timeout. 

Warning Level

When the duration of the Hung message is greater than or equal to two times the Group's XDQP timeout setting, the Hung message will appear at the Warning log level. Consequently, if the host is unresponsive to the rest of the cluster (that is, they have not received a heartbeat for the group's host timeout number of seconds), it may trigger a failover.

Common Causes

Hung messages in the ErrorLog have been traced back to the following root causes:

  • MarkLogic Server is installed on a Virtual Machine (VM), and
    • The VM does not have sufficient resources provisioned for peak use; or
    • The underlying hardware is not provisioned with enough resources for peak use.
  • MarkLogic Server is using disk space on a Storage Area Network (SAN) or Network Attached Storage (NAS) device, and
    • The SAN or NAS is not provisioned to handle peak load; or
    • The network that connects the host to the storage system is not provisioned to handle peak load.
  • Other enterprise level software is running on the same hardware as MarkLogic Server. MarkLogic Server is designed with the assumption that it is running on dedicated hardware.
  • A file backup or a virus scan utility is running against the same disk where forest data is stored, overloading the I/O capabilities of the storage system.
  • There is insufficient I/O bandwidth for the merging of all forests simultaneously.
  • Re-indexing overloads the I/O capabilities of the storage system.
  • A query that performs extremely poorly, or a number of such queries, caused host resource exhaustion.

Forest Failover

If the cause of the Hung message further causes the server to be unresponsive to cluster heartbeat requests from other servers in the cluster, for a duration greater than the host timeout, then the host will be considered unavailable and will be voted out of the cluster by a quorum of its peers.  If this happens, and failover is configured for forests stored on the unresponsive host, the forests will fail over.  

Debugging Tips

Look at system statistics (such as SAR data) and system logs from your server for entries that occurred during the time-span of the Hung message.  The goal is to pinpoint the resource bottleneck that is the root cause.

Provisioning Recommendation

The host on which MarkLogic Server runs needs to be correctly provisioned for peak load. 

MarkLogic recommends that your storage subsystem simultaneously support:

  •     20MB/s read throughput, per forest
  •     20MB/s write throughput, per forest

We have found that customers who are able to sustain these throughput rates have not encountered operational problems related to storage resources.

Configuration Tips

If the Hung message occurred during a I/O intensive background task (such as database backup, merge or reindexing), consider setting of decreasing the backgound IO Limit - This group level configuration controls the I/O resources that background I/O tasks will consume.

If the Hung message occurred during a database merge, consider decreasing the merge priority in the database’s Merge Policy.  For example, if the priority is set to "normal", then try decreasing it to "lower".

 

Indexing Best Practices

MarkLogic Server indexes records (or documents/fragments) on ingest. When a database's index configuration is changed, the server will consequently reindex all matching records.

Indexing and reindexing can be a CPU and I/O intensive operation. Reindexing creates a lot of new fragments, with the original fragments being marked for deletion. These deleted fragments will then need to be merged out. All of this activity can potentially affect query performance, especially in systems with under-provisioned hardware.

Reindexing in Production

If you need to add or modify an index on a production cluster, consider scheduling the reindex during a time when your cluster is less busy. If your database is too large to completely reindex during a single period of low usage, consider running the reindex over several periods of time. For example, if your low usage period is during a weekend, the process may look like:

  • Change your index configuration on a Friday night
  • Let the reindex run for most of the weekend
  • To pause the reindex, set the reindexer-enable field to 'false' for the database being reindexed. Be sure to allow sufficient time for the associated merging to complete before system load comes back.
  • If needed, reindexing can continue over the next weekend - the reindexer process will pick up where it left off before it was disabled.

You can refer to https://help.marklogic.com/Knowledgebase/Article/View/18/15/how-reindexing-works-and-its-impact-on-performance for more details on invoking reindexing on production.

      When you have Database Replication Configured

If you have to add or modify indexes on a database which has database replication configured, make sure the same changes are made on the Replica cluster as  well. Starting with ML server version 9.0-7, index data is also replicated from the Master to the Replica, but it does not automatically check if both sides have the same index settings. Reindexing is disabled by default on a replica cluster. However, when database replication configuration is removed (such as after a disaster),  the replica database will reindex as necessary. So it is important that the Replica database index configuration matches the Master’s to avoid unnecessary reindexing.

Further reading -

Master and Replica Database Index Settings

Database Replication - Indexing on Replica Explained

Avoid Unused Range Indexes, Fields, and Path Indexes

In addition to taking up extra disk space, Range, Field, and Path Indexes require extra work when it's time to reindex. Field and Path indexes may also require extra indexing passes.

Avoid Using Namespaces to Implement Multi-Tenancy

It's a common use case to want to create some kind of partition (or multiple partitions) between documents in a particular database. In such a scenario it's far better to 1) constrain the partitioning information to a particular element in a document (then include a clause over that element in your searches), than it is to 2) attempt to manage partitions via unique element namespaces corresponding to each partition. For example, given two documents in two different partitions, you'll want them to look like this:

1a. <doc><partition>partition1</partition><name>Joe Smith</name></doc>

1b. <doc><partition>partition2</partition><name>John Smith</name></doc>

...vs. something like this:

2a. <doc xmlns:p="http://partition1"><p:name>Joe Smith</p:name></doc>

2b. <doc xmlns:p="http://partition2"><p:name>John Smith</p:name></doc>

Why is #1 better? In terms of searching the data once it's indexed, there's actually not much of a difference - one could easily create searches to accommodate both approaches. The issue is how the indexing works in practice. MarkLogic Server indexes all content on ingest. In scenario #2, every time a new partition is created, a new range element index needs to defined in the Admin UI, which means your index settings have changed, which means the server now needs to reindex all of your content - not just the documents corresponding to the newly introduced partition. In contrast, for scenario #1, all that would need to be done is to ingest the documents corresponding to the new partition, which would then be indexed just like all the other existing content. There would be a need, however, to change the searches in scenario #1, as they would not yet include a clause to accommodate the new partition (for example: cts:element-value-query(xs:QName("partition"), "partition2")) - but the overall impact of adding a partition is changing the searches in scenario #1, which is ultimately far, far less intrusive a change than reindexing your entire database as would be required in scenario #2. Note that in addition to a database-wide reindex, searches would also need to change in scenario #2, as well.

Keep an Eye on I/O Throughput

Reindexing can lead to heavy merge activity and may lead to disk I/O bottlenecks if not managed carefully. If you have a system that is available 24-7 with no downtime window, then you may need to throttle the reindexer in order to keep the disk I/O to a minimum. We suggest the following database settings for reindexing a system that must always remain in use:

  • reindexer-throttle = 3
  • large-size-threshold = 1048576

You can also adjust the following group settings to help limit background I/O:

  • background-io-limit = 100

This will limit the background I/O for that group to 100 MB/sec per host across all hosts in that group. This should only be configured if merges are causing problems—it is a way of throttling back the I/O used by the merging process.This is good starting point, and may be increased in increments of 50 if you find that your merges are progressing too slowly.  Proceed with caution as too low of a background IO limit can have negative performance or even catastrophic consequences

General Recommendations

In general, your indexing/reindexing and subsequent search experience will be better if you

Summary

This article contains a high level overview of Transparent Huge Pages and Huge Pages. It covers the configuration of Huge Pages and offers advice as to when Huge Pages should be used and how they can be configured.

To the Linux kernel, "pages" describe a unit of memory; by default this should be 2048 KiB. You can confirm this from the terminal by issuing a call to getconf PAGESIZE.

Huge Pages (and Transparent Huge Pages) allow areas of memory to be reserved for resources which are likely to be accessed frequently, such as group level caches. Enabling (and configuring) Huge Pages can increase performance because - when enabled - caches should always be resident in memory.

Huge Pages

In general you should follow the recommendation stated in the MarkLogic Installation Guide for All Platforms, which states:

On Linux systems, MarkLogic recommends setting Linux Huge Pages to around 3/8 the size of your physical memory. For details on setting up Huge Pages, refer to the following KB

Group Caches and Linux Huge Pages

Caution

Since the OS and server perform many memory allocations that do not and cannot use huge pages, it may not be possible to configure the full 3/8 for huge pages.  It is not advised to configure more than 3/8 of memory to huge pages.

Calculating the number of Huge Pages to configure:

On an x86 system the default Huge Page size is 2048 KiB.  This can be confirmed using the command "cat /proc/meminfo | grep Hugepagesize".  On a system with 64GiB of physical memory it would be advised to configure 12288 Huge Pages or 24GiB.

Alternatively, MarkLogic provides a recommended range for the number of Huge Pages that should be used.  This recommendation can be seen in the ErrorLog.txt file located in /var/opt/MarkLogic/Logs/ just after the server is started.  Right after starting the server, look for a message that looks like this:

2019-09-01 17:33:14.894 Info: Linux Huge Pages: detected 0, recommend 11360 to 15413

The lower bound includes all group-level caches, while the upper bound also includes in-memory stand sizes.

Allocating Huge Pages

Since Huge Pages require large areas of contiguous physical memory, it is advised to allocate huge pages at boot time.  This can be accomplished by placing vm.nr_hugepages = 11360 into the /etc/sysctl.conf file.

Transparent Huge Pages

The Transparent Huge Page (THP) implementation in the Linux kernel includes functionality that provides compaction. Compaction operations are system level processes that are resource intensive, potentially causing resource starvation to the MarkLogic process. Using static Huge Pages is the preferred memory configuration for several high performance database platforms including MarkLogic Server. The recommended method to disable THP on Red Hat Enterprise Linux (RHEL) 7 and 8, is to disable it in the grub2 configuration file, and then rebuild the grub2 configuration.  The following articles from Red Hat detail the process of disabling THP:

How to disable transparent hugepages (THP) on Red Hat Enterprise Linux 7

How to disable transparent hugepage (THP) on Red Hat Enterprise Linux 8

Previous Releases

If you are using Red Hat Enterprise Linux or CentOS 6, you must turn off Transparent Huge Pages (Transparent Huge Pages are configured automatically by the operating system).

The preferred method to disable Transparent HugePages is to add "transparent_hugepage=never" to the kernel boot line in the "/etc/grub.conf" file.

This solution (disabling Transparent HugePages) is covered in detail in this article on RedHat's website

[ref:https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/virtualization_tuning_and_optimization_guide/sect-virtualization_tuning_optimization_guide-memory-tuning#sect-Virtualization_Tuning_Optimization_Guide-Memory-Huge_Pages]

[ref: http://docs.marklogic.com/guide/installation/intro#id_11335]


RHEL 6 | CentOS 6: Kernels newer than kernel-2.6.32-504.23.4. A race condition can manifest itself as system crash in region_* functions when HugePages are enabled. [1]

Resolution options:
1. Update kernel to 2.6.32-504.40.1.el6 or later, which includes the fix for this issue.

2. Downgrade the kernel to a version before 2.6.32-504.23.4.

3. Disable Huge Pages completely, which might impact MarkLogic’s performance by ~5-15%.
    (Set the value of vm.nr_hugepages=0)

References:
[1] [ref: RedHat Knowledge-base article: https://access.redhat.com/solutions/1992703]

Summary

MarkLogic performs best if swap space is not used.  There are other knowledge base articles that discuss sizing recommendations when configuring your MarkLogic Server. This article discusses the Linux swappiness setting that can limit the amount of swap activity in the event that swap space is required.

Details

Beginning with kernel version 3.5 (RHEL7) and 2.6.32-303 (RHEL 6.4),  MarkLogic suggests setting vm.swappiness to '1', and our recommendation is for it to be set to a value no greater than 10.  With older Linux kernel versions, vm.swappiness can be set to zero, but we do not recommend setting swappiness to 0 on newer kernels. 

You can check the swappiness setting on your Linux servers with the following command: 

    sysctl -a | grep swapp

If it doesn't exist already, add vm.swappiness=1 to the /etc/sysctl.conf file and execute the following command to apply this change:

     sudo sysctl -f

Depending on kernel version, the default Linux value for swappiness is 40-60. This is good for a desktop system that run a variety of applications but has a very small amount of memory. The default setting is not good for a server system that wants to run a single dedicated process – such as MarkLogic Server.

A warning on swappiness

The behaviour of swappiness on newer Linux kernels has changed. On kernels for Linux greater than 3.5 and 2.6.32-303, setting swappiness to 0 will more aggressively avoid swapping, which increases the risk of out-of-memory (OOM) killing under strong memory and I/O pressure. To achieve the same behavior of swappiness as previous versions in which the recommendation was to set swappiness to 0, set swappiness to the value of 1. We do not recommend setting swappiness to 0 on newer kernels. 

Other Settings

While you are making changes, the following kernels settings will also help

vm.dirty_background_ratio=1

vm.dirty_ratio=40

vm.dirty_background_ratio is the percentage of system memory that can be filled with “dirty” pages — memory pages that still need to be written to disk — before the pdflush/flush/kdmflush background processes kick in to write it to disk. We suggest setting it to 1%, so if your virtual server has 64 GB of memory that’s ~2/3 GB of data that can be sitting in RAM before something is done.

vm.dirty_ratio is the absolute maximum amount of system memory that can be filled with dirty pages before everything must get committed to disk. When the system gets to this point all new I/O blocks until dirty pages have been written to disk. This is often the source of long I/O pauses, but is a safeguard against too much data being cached unsafely in memory.

Swappiness explained

When the computer runs out of memory, it has to kick out some memory.  At that moment, it has 2 choices:

1)    Kick out mmap’ed files.  This is cheaper if the file is mmaped, as in read-only (for example, MarkLogic range indexes are read only)

2)    Kick out anon memory.

If swappiness is large, Linux would prefer #2 over #1.  If swappiness is small, Linux would prefer #1 over #2.

Additional Reading

Knowledgebase: Swap Space Requirements 

Knowledgebase: Memory Consumption Logging and Status 

Knowledgebase: RAMblings - Opinions on Scaling Memory in MarkLogic Server 

MarkLogic Linux Tuned Profile

Summary

The tuned tuning service can change operating system settings to improve performance for certain workloads. Different tuned profiles are available and choosing the profile that best fits your use case simplifies configuration management and system administration. You can also write your own profiles, or extend the existing profiles if further customization is needed. The tuned-adm command allows users to switch between different profiles.

RedHat Performance and Tuning Guide: tuned and tuned-adm

  • tuned-adm list will list the available profiles
  • tuned-adm active will list the active profile

Creating a MarkLogic Tuned Profile

Using the throughput-performance profile, we can create a custom tuned profile for MarkLogic Server. First create the directory for the MarkLogic profile:

sudo mkdir /usr/lib/tuned/MarkLogic/

Next, create the tuned.conf file that will include the throughput-performance profile, along with our recommended configuration:

#
# tuned configuration
#

[main]
summary=Optimize for MarkLogic Server on Bare Metal
include=throughput-performance

[sysctl]
vm.swappiness = 1
vm.dirty_ratio = 40
vm.dirty_background_ratio=1

[vm]
transparent_hugepages=never

Activating the MarkLogic Tuned Profile

Now when we do a tuned list it should show us the default profiles, as well as our new MarkLogic profile:

$ tuned-adm list
Available profiles:
- MarkLogic                   - Optimize for MarkLogic Server
- balanced                    - General non-specialized tuned profile
- desktop                     - Optimize for the desktop use-case
- hpc-compute                 - Optimize for HPC compute workloads
- latency-performance         - Optimize for deterministic performance at the cost of increased power consumption
- network-latency             - Optimize for deterministic performance at the cost of increased power consumption, focused on low latency network performance
- network-throughput          - Optimize for streaming network throughput, generally only necessary on older CPUs or 40G+ networks
- powersave                   - Optimize for low power consumption
- throughput-performance      - Broadly applicable tuning that provides excellent performance across a variety of common server workloads
- virtual-guest               - Optimize for running inside a virtual guest
- virtual-host                - Optimize for running KVM guests
Current active profile: virtual-guest

Now we can make MarkLogic the active profile:

$ sudo tuned-adm profile MarkLogic

And then check the active profile:

$ tuned-adm active
Current active profile: MarkLogic

Disabling the Tuned Daemon

The tuned daemon does have some overhead, and so MarkLogic recommends that it be disabled. When the daemon is disabled, tuned will only apply the profile settings and then exit. Update the /etc/tuned/tuned-main.conf and set the following value:

daemon = 0

References

There are various operating system settings that MarkLogic prescribes for best performance. During the startup of a MarkLogic Server instance, some of these parameters are set to the recommended values. These parameters include:

  • File descriptor limit
  • Number of processes per user
  • Swappiness
  • Dirty background ratio
  • Max sectors
  • Read ahead

For some settings, Info level error log messages are recorded to indicate that these values were changed.  For example, the MarkLogic Server error log might include a line similar to:

2020-03-03 12:40:25.512 Info: Reduced Linux kernel swappiness to 1
2020-03-03 12:40:25.512 Info: Reduced Linux kernel dirty background ratio to 1
2020-03-03 12:40:25.513 Info: Reduced Linux kernel read ahead to 512KB for vda
2020-03-03 12:40:25.513 Info: Increased Linux kernel max sectors to 2048KB for vda

Introduction: the decimal type

In order to be compliant with the XQuery specification and to satisfy the needs of customers working with financial data, MarkLogic Server implements a decimal type, available in XQuery and server-side JavaScript.

Decimal type has been implemented for very specific requirements, decimals have about a dozen more bits of precision than doubles but take up more memory and arithmetic operations over them are much slower.

Use the double where possible

Unless you have a specific requirement to use a Decimal data type, in most case it's better and faster to use the double data type to represent large numbers.

Specific details about the decimal data type

If you still want or need to use a decimal data type below are its limitations and details on how exactly it is implemented in MarkLogic Server:

o   Precision

  • How many decimal digits of precision does it have?

The MarkLogic implementation of xs:decimal representation is designed to meet the XQuery specification requirements to provide at least 18 decimal digits of precision. In practice, up to 19 decimal digits can be represented with full fidelity.

  • If it is a binary number, how many binary digits of precision does it have?

 A decimal number is represented inside MarkLogic with 64 binary bits of digits and an additional 64 bits of sign and a scale (specifies where the decimal point is).

  • What are the exact upper and lower bounds of its precision?

-18446744073709551615 to 18446744073709551615 

Any operation producing number smaller or bigger than this range will result in XDMP-DECOVRFLW error (decimal overflow)

o   Scale

  • Does it have a fixed scale or floating scale?

It has a floating scale.

  • What are the limitations on the scale?

-20 to 0

So you can only represent numbers between 1 * (2^-64) and 18446744073709551615

  • Is the scale binary or decimal?

Decimal

  • How many decimal digits can it scale?

20

  • How many binary digits can it scale?

N/A

  • What is the smallest number it can represent and the largest?

smallest: -1*(2^64)
closest to zero: 1*(10^-20)
largest: (2^64)

  • Are all integers safe or does it have a limited safe range for integers?

It can represent 64 bit unsigned integers with full fidelity.

 

o   Limitations

  • Does it have binary rounding errors?

The division algorithm on Linux in particular does convert to an 80-bit binary floating point representation to calculate reciprocals - which can result in binary rounding errors. Other arithmetic algorithms work solely in base 10.

  • What numeric errors can it throw and when?

Overflow: Number is too big or small to represent
Underflow: Number is close to zero to represent
Loss of precision: The result has too many digits of precision (essentially the 64bit digits value has overflowed)

  • Can it represent floating point values, such as NaN, -Infinity, +Infinity, etc.?

 No

o   Implementation

  • How is the DECIMAL data type implemented?

It has a representation with 64 bits of digits, a sign, and a base 10 negative exponent (fixed to range from -20 to 0). So the value is calculated like this:

sign * digits * (10 ^ -exponent)

  • How many bytes does it consume?

On disk, for example in triple indexes, it's not a fixed size as it uses integer compression. At maximum, the decimal scalar type consumes 16 bytes per value: eight bytes of digits, four bytes of sign, and four bytes of scale. It is not space efficient but it keeps the digits aligned on eight-byte boundaries.

SUMMARY

This article will help MarkLogic Administrators and System Architects who need to understand how to provision the I/O capacity of their MarkLogic installation.

MarkLogic Disk Usage

Databases in MarkLogic Server are made up of forests. Individual forests are made up of stands. In the interests of both read and write performance, MarkLogic Server doesn't update data already on disk. Instead, it simply writes to the current in-memory stand, which will then contain the latest version of any new or changed fragments, and old versions are marked as obsolete. The current in-memory stand will eventually become yet another on-disk stand in a particular forest.

Ultimately, however, the more stands or obsolete fragments there are in a forest, the more time it takes to resolve a query. Merges are a background process that reduce the number of stands and purge obsolete fragments in each forest in a database, thereby improving the time it takes to resolve queries. Because merges are so important to the optimal operation of MarkLogic Server, it's important to provision the appropriate amount of I/O bandwidth, where each forest will typically need 20MB/sec read and 20MB/sec write. For example, a machine hosting four forests will typically need sufficient I/O bandwidth for both 80MB/sec read and 80MB/sec write.

Determining I/O Bandwidth

One way to determine I/O bandwidth would be to use a synthetic benchmarking utility to return the available read and write bandwidth for the system as currently provisioned. While useful in terms of getting a ballpark sense of the I/O capacity, this approach unfortunately does not provide any information about the real world demand that will ultimately be placed on that capacity.

Another way would be to actually load test a candidate provisioning against the application you're going to deploy on this cluster. If you start from our general recommendations (from MarkLogic: Understanding System Resources) then do an application level load test (paying special attention to I/O heavy activities like ingestion or re-indexing, and the subsequent merging), the system metrics from that load test will then tell you what, if any, bottlenecks or extra capacity may exist on the system across not only your I/O subsystem, but for your CPU and RAM usage as well.

For both of these approaches (measuring capacity via synthetic benchmarks or measuring both capacity and demand vs. synthetic application load), it would also be useful to have some sense of the theoretical available I/O bandwidth before doing any testing. In other words, if you're provisioning shared storage like SAN or NAS, your storage admin should have some idea of the bandwidth available to each of the hosts. If you're provisioning local disk, you probably already have some performance guidance from the vendors of the I/O controllers or disks being used in your nodes. We've seen situations in the past where actual available bandwidth has been much different from expected, but at a minimum the expected values will provide a decent baseline for comparison against your eventual testing results.

Additional Resources

 

With the release of MarkLogic Server versions 8.0-8 and 9.0-4, detailing memory use broken out by major areas is periodically recorded to the error log. These diagnostic messages can be useful for quickly identifying memory resource consumption at a glance and aid in determining where to investigate memory-related issues.

Error Log Message and Description of Details

At one hour intervals, an Info level log message will be written to the server error log in the following format:

Info: Memory 46% phys=255137 size=136452(53%) rss=20426(8%) huge=97490(38%) anon=1284(0%) swap=1(0%) file=37323(14%) forest=49883(19%) cache=81920(32%) registry=1(0%)

The error log entry contains memory-related figures for non-zero statistics: raw figures are in megabytes; percentages are relative to the amount of physical memory reported by the operating system. Except for phys, all values are for the MarkLogic Server process alone.  The figures include

Memory: percentage of physical memory consumed by the MarkLogic Server process
phys: size of physical memory in the machine
size: total process memory for the MarkLogic process; basically huge+anon+swap+file on Linux.  This includes memory-mapped files, even if they are not currently in physical memory.
swap: swap consumed by the MarkLogic Server process
rss: Resident Set Size reported by the operating system
anon: anonymous mapped memory used by the MarkLogic Server
file: total amount of RAM for memory-mapped data files used the MarkLogic Server---the MarkLogic Server executable itself, for example, is memory-mapped by the operating system, but is not included in this figure
forest: forest-related memory allocated by the MarkLogic Server process
cache: user-configured cache memory (list cache, expanded tree cache, etc.) consumed by the MarkLogic Server process
registry: memory consumed by registered queries
huge: huge page memory reserved by the operating system
join: memory consumed by joins for active running queries within the MarkLogic Server process
unclosed: unclosed memory, signifying memory consumed by unclosed or obsolete stands still held by the MarkLogic Server process

In addition to reporting once an hour, the Info level error log entry is written whenever the amount of main memory used by MarkLogic Server changes by more than five percent from one check to the next. MarkLogic Server will check the raw metering data obtained from the operating system once per minute. If metering is disabled, the check will not occur and no log entries will be made.

With the release of MarkLogic Server versions 8.0-8 and 9.0-5, this same information will be available in the output from the function xdmp:host-status().

<host-status xmlns="http://marklogic.com/xdmp/status/host">
. . .
<memory-process-size>246162</memory-process-size>
<memory-process-rss>27412</memory-process-rss>
<memory-process-anon>54208</memory-process-anon>
<memory-process-rss-hwm>73706</memory-process-rss-hwm>
<memory-process-swap-size>0</memory-process-swap-size>
<memory-system-pagein-rate>0</memory-system-pagein-rate>
<memory-system-pageout-rate>14.6835</memory-system-pageout-rate>
<memory-system-swapin-rate>0</memory-system-swapin-rate>
<memory-system-swapout-rate>0</memory-system-swapout-rate>
<memory-size>147456</memory-size>
<memory-file-size>279</memory-file-size>
<memory-forest-size>1791</memory-forest-size>
<memory-unclosed-size>0</memory-unclosed-size>
<memory-cache-size>40960</memory-cache-size>
<memory-registry-size>1</memory-registry-size>
. . .
</host-status>


Additionally, with the release of MarkLogic Server 8.0-9.3 and 9.0-7, Warning-level log messages will be reported when the host may be low on memory.  The messages will indicate the areas involved, for example:

Warning: Memory low: forest+cache=97%phys

Warning: Memory low: huge+anon+swap+file=128%phys

The messages are reported if the total memory used by the mentioned areas is greater than 90% of physical memory (phys). As best practice for most use cases, the total of the areas should not be more than around 80% of physical memory, and should be even less if you are using the host for query processing.

Both forest and file include memory-mapped files; for example, range indexes.  Since the OS manages the paging in/out of the files, it knows and reports the actual RAM in use; MarkLogic reports the amount of RAM needed if all the mapped files were in memory at once.  That's why MarkLogic can even report >100% of RAM in use---if all the memory-mapped files were required at once the machine would be out of memory.

Data Encryption Scenario: An encrypted file cannot be memory-mapped and is instead decrypted and read into anon memory. Since the file that is decrypted in memory is not file-backed it cannot be paged out. Therefore, even though encrypted files do not require more memory than unencrypted files, they become memory-resident and require physical memory to be allocated when they are read.

If the hosts are encountering these warnings, memory use should be monitored closely.

Remedial action to support memory requirements might include:

  • Adding more physical memory to each of the hosts;
  • Adding additional hosts to the cluster to spread the data across;
  • Adding additional forests to any under-utilized hosts.

Other action might include:

  • Archiving/dropping any older forest data that is no longer used;
  • Reviewing the group level cache settings to ensure they are not set too high, as they make up the cache part of the total. For reference, default (and recommended) group level cache settings based on common RAM configurations may be found in our Group Level Cache Settings based on RAM Knowledge base article.

Summary

This enhancement to MarkLogic Server allows for easy periodic monitoring of memory consumption over time, and records it in a summary fashion in the same place as other data pertaining to the operation of a running node in a cluster. Since all these figures have at their source raw Meters data, more in-depth investigation should start with the Meters history. However, having this information available at a glance can aid in identifying whether memory-related resources need to be explored when investigating performance, scale, or other like issues during testing or operation.

Additional Reading

Knowledgebase: RAMblings - Opinions on Scaling Memory in MarkLogic Server 

Monitoring History

The Monitoring History feature allows you to capture and view critical performance data from your cluster. Monitoring History capture is enabled at the group level. Once the performance data has been collected, you can view the data in the Monitoring History page.

By default, the performance data is stored in the Meters database. A consolidated Meters database that captures performance metrics from multiple groups can be configured, if there is more than one group in the cluster.

Monitoring History Data Retention Policy

How long the performance data should be kept in the Meters database before it is deleted can be configured with the data retention policy. (http://docs.marklogic.com/guide/monitoring/history#id_80656)

If it is observed that meters data is not being cleared according to the retention policy, the first place to check would be the range indexes configured for the Meters database.

Range indexes and the Meters Database

Meters database is configured with a set of range indexes which, if not configured correctly (or not present) can prevent the cleaning up of Meters database according to the set retention policy.

It is possible to have missing or misconfigured range indexes in either of the below scenarios

  •  if the cluster was upgraded from a version of ML before 7.0 and the upgrade had some issues
  •  if the indexes were manually created (when using another database for meters data instead of the default Meters database)

The size of the meters database can grow significantly as the cluster grows, so it is important that the meters database is cleared per the retention policy.

The required indexes (as of 8.0-5 and 7.0-6) are attached as an ML Configuration Manager package(http://docs.marklogic.com/guide/admin/config_manager#id_38038). Once these are added, the Meters database will reindex and the older data should be deleted.

Note that deletion of data older than the retention policy occurs no sooner than the retention policy. Data older than the retention policy may still be maintained for an unspecified amount of time.

Related documentation

http://docs.marklogic.com/guide/monitoring

https://help.marklogic.com/Knowledgebase/Article/View/259/0/metering-database-disk-space-requirements

 

 

 

 

 

 

 

 

 

 

 

 

Summary

When restarting very large forests, some customers have noted that it may take a while for them to mount. While the forests are mounting, the database is unable to come online, thus impacting the availability of your main site. This article shows you how to change a few database settings to improve forest-mounting time.

 


 

When encountering delays with forest mounting time after restarts, we usually recommend the following settings:

format-compatibility set to the latest format
expunge-locks set to none
index-detection set to none

Additionally, some customers might be able to spread out the work of memory mapping forest indexes by setting preload-mapped-data to false - though it should be noted that instead of the necessary time being taken during the mounting of the forest, memory-mapped file data will be loaded on demand through page faults as the server accesses it.

While the above settings should help with forest mounting time, in general, their effects can be situationally dependent. You can read more about each of these settings in our documentation here: http://docs.marklogic.com/admin-help/database. In particular:


1) Regarding format compatability: "The automatic detection occurs during database startup and after any database configuration changes, and can take some time and system resources for very large forests and for very large clusters. The default value of automatic is recommended for most installations." So to your question, while automatic is recommended in most cases, you should try changing the setting if you're seeing long forest mount times.

2) Regarding expunge-locks: "Setting this to none is only recommended to speed cluster startup time for extremely large clusters. The default setting of automatic, which cleans up the locks as they expire, is recommended for most installations."

3) Regarding index-detection: "This detection occurs during database startup and after any database configuration changes, and can take some time and system resources for very large forests and for very large clusters. Setting this to none also causes queries to use the current database index settings, even if some settings have not completed reindexing. The default value of automatic is recommended for most installations"

It may also be worth considering why forests are taking a long time to mount. If your data size has grown significantly over the lifetime of the affected database, it might be the case that your forests are now overly large, in which case a better approach might be to instead distribute the data across more forests.

Summary

I/O Schedulers are used to determine the order of block operations will be passed to the storage subsystem.

Linux Kernel 4.x

The Linux 4.x Kernel, used by Red Hat Enterprise Linux (RHEL 8), CentOS 8 and Amazon Linux 2, has 3 I/O schedulers that can be used with MarkLogic Server:

  • none - No reordering of I/O operations (Default for Amazon Linux 2)
  • mq-deadline - Reordering to minimize I/O (Default for RHEL 8)
  • kyber - Token based I/O scheduling

Linux Kernel 3.x

The Linux 3.x Kernel, used by RHEL 7 and CentOS 7, also has 3 I/O schedulers that can be used with MarkLogic Server:

  • none - No reordering of I/O operations
  • deadline - Reordering to minimize I/O
  • noop - Inserts I/O requests into a FIFO queue

Recommended IO Schedulers

Three I/O schedulers are recommended for use with MarkLogic Server:

  • deadline / mq-deadline
    • configured by setting elevator=deadline as a kernel boot parameter
  • noop
    • configured by setting elevator=noop as a kernel boot parameter
  • none
    • configured by setting elevator=none as a kernel boot parameter

Note: [none] is recommended for SSDs, NVMEs [1] and guest OS virtual machines [2]

Choosing a Scheduler

If your MarkLogic host has intelligent I/O controllers (hardware RAID) or only uses SSDs/NVMEs, choose none or noop. If you're unsure, choose deadline or mq-deadline.

The deadline Scheduler

The deadline scheduler attempts to minimize I/O latency by enforcing start service times for each incoming request. As I/O requests come in, they are assigned an expiry time (the deadline for that request). At the point where the expiry time for that request is reached, the scheduler forces the service of that request at the location on the disk. While it is doing this, any other requests within easy reach (without requiring too much movement) are attempted. Where possible, the scheduler attempts completion of any I/O request before the expiry time is met.

The deadline scheduler can be used in situations where the host is not concerned with "fairness" for all processes residing on the system. The concern is rather where the system requires I/O requests are not stalled for long periods.

The deadline scheduler can be considered the best choice given a host where one process dominates disk I/O. Most database servers are a natural fit for this category.

The mq-deadline Scheduler

The mq-deadline scheduler is the adaptation of the deadline scheduler to support multi-threading.

The noop Scheduler

The noop scheduler performs no scheduling optimizations, but does support request merging.

All incoming I/O requests are pushed onto a FIFO queue and left to the block device to manage. Intelligent disk controllers will manage the priority from there. In any situation where a hardware controller (an HBA or similar controller attached to a SAN) can manage scheduling - or where disk seek times are not important (such as on SSDs) - any extra work performed by the scheduler at Linux kernel level is wasted.

The noop scheduler can be considered the best choice when MarkLogic server is hosted on VMWare.

The Kyber Scheduler

The Kyber scheduler uses a token based system for managing requests. A queueing token is requirted to allocate a request and a dispatch token is used to limit operations of a certain priority. The Kyber scheduler also defines a target latency, and tunes itself to reach the target.

Output Schedulers

Kyber is a recent scheduler inspired by active queue management techniques used for network routing. The implementation is based on "tokens" that serve as a mechanism for limiting requests. A queuing token is required to allocate a request, this is used to prevent starvation of requests. A dispatch token is also needed and limits the operations of a certain priority on a given device. Finally, a target read latency is defined and the scheduler tunes itself to reach this latency goal. The implementation of the algorithm is relatively simple and it is deemed efficient for fast devices.

Finding the Active Scheduler

The active scheduler is identified in the file /sys/block/[device-name]/queue/scheduler, and is the option surrounded by square brackets.
The example below shows that the 'noop' scheduler is currently configured for the block device sdb:

> cat /sys/block/sdb/queue/scheduler

[noop] anticipatory deadline cfq

References

[1] Unable to change IO scheduler on nvme device
[2] What is the suggested I/O scheduler to improve disk performance when using Red Hat Enterprise Linux with virtualization?

Introduction
 
MarkLogic Server's 'DatabaseClient' instance represents a database connection sharable across threads. The connection is stateless, except that authentication is done the first time a client interacts with the database via a Document Manager, Query Manager, or other manager. For instance: you may instantiate a DatabaseClient as follows:
 
// Create the database client

DatabaseClient client = DatabaseClientFactory.newClient(host, port,
                                          user, password, authType);

And release it as follows:
// release the client
client.release();

Details on DatabaseClient Usage

To use the Java Client API efficiently, it helps to know a little bit about what goes on behind the scenes.

You specify the enode or load balancer host when you create a database client object.  Internally, the database client object instantiates an Apache HttpClient object to communicate with the host.

The internal Apache HttpClient object creates a connection pool for the host.  The connection pool makes it possible to reuse a single persistent HTTP connection for many requests, typically improving performance.

Setting up the connection pool has a cost, however.

As a result, we strongly recommend that applications create one database client for each unique combination of host, database, and user.  Applications should share the database client across threads.  In addition, applications should keep a reference to the database client for the entire life of the application interaction with that host.


For instance, a servlet might create the database client during initialization and release the database client during destruction. The same servlet may also use two separate database client instances with different permissions, one for read-only users and one with read/write permissions for editors. In the latter case, both client instances are used throughout the life of the servlet and destroyed during client destruction.

Summary

This article briefly looks at the performance implications of ad hoc queries versus passing external variables to a query in a module

Details

Programatically, you can achieve similar results by dynamically generating ad hoc queries on the client as you can by definining your queries in modules and passing in external variable values as necessary.

Dynamically generating ad hoc queries on the client side results in each of your queries being compiled and linked with library modules before they can be evaluated - for every query you submit. In contrast, queries in modules only experience that performance overhead the first time they're invoked.

While it's possible to submit queries to MarkLogic Server in any number of ways, in terms of performance, it's far better to define your queries in modules, passing in external variable values as necessary.

Summary

MarkLogic does not enforce a programmatic upper limit on How many indexes you *can* have. This leaves open the question of how many range indexes should be used in your application. The answer is that you should have as many as the application requires, but with the caveat that there are some infrastructure limits that should be taken into account. For instance:

1. More Memory Mapped file Handles (file fd)

OS has limits of how many file handles a given process can have at a given point in time. This limit, therefore, affects how many range index files, and therefore range indexes a given MarkLogic process can have; However, One could configure higher File Handle limits on most platforms (ulimit, vm.max_map_count).

2. More RAM requirement 

In-memory footprint of node involves In-memory structures like in-memory-list-cache, in-memory-tree-cache, in-memory-range index, in-memory-reverse-index (if-reverse-query-enabled) , in-memory-triple-index (if-triple-positions-enabled); multiply those with total number of forests + buffer.

A Large number of Range indexes can result in a huge index expansion in memory use. Also, values mentioned above are in addition to memory that would be required for MarkLogic Server to maintain its HTTP servers, perform merges, reindex, re-balance, as well as operations like processing queries, etc.

Tip: Memory consumption can be reduced by configuring a database to optimize range indexes for minimum memory usage (memory-size); Default is configured for maximum performance (facet-time). 

UI : Admin UI > Databases > {database-name} > Configure > range index optimize [facet-time or memory-size]

API : admin:database-set-range-index-optimize 

3. Longer Merge Times (Bigger stands due to Large index expansion)

Large number of Range Index ends up expanding data in forests. Now for a given host size and number of hosts- larger stand sizes in forest will make range index query faster; However it will also make merge times slower. If we want to make Queries and merges all fast with a large number of range indexes, we will need to scale out the number of physical hosts. 

4. More CPU, Disk & IO requirement 

Merges are IO intensive processes; this, combined with frequent updates/load could result in CPU as well as IO bottlenecks.

5. Longer Forest Mount times

In general, Each configured range index with data takes two memory mapped files per stand.

A typical busy host has on the order of 10 forests, each forest with on the order of 10 stands; So a typical busy host has on the order of 100 stands.

Now for 100 stands -

  • With 100 range indexes, we have in the order of 10,000 files to open and map when the server starts up.
  • While for 1,000 range indexes, we have in the order of 100,000 files to open and map when the server starts up.
  • While for 10,000 range indexes, we have in the order of 1,000,000 mapped files to open and map when the server starts up.

As we increase our range indexes, at some point of time, Server will take unreasonably long time to start up (unless we throw equivalent processing power).

The amount of time one is willing to wait for the server to start up is not a hard limit, but the question should be "what is 'reasonable' behavior for Server start-up in eyes of Server Admin based on current hardware."

Conclusion

Range Indexes in magnitude of a thousand starts affecting Performance if not managed properly and if above consideration are not accounted for; In most scenarios the solution to the problem is not about "How many indexes can we configure", but rather about "How many indexes do we need".

MarkLogic considers configured range index in the order of 100 as a “reasonable” limit, because it results in “reasonable” behaviors of the Server.

Tips for Best Performance for Solutions with lots of Range Indexes

Before launching your application, review the number of Range Indexes and work to 1) Remove ones that are not being used, and 2) Consolidate any range indexes that are mutually redundant. This will help you get under the prescribed 100 range index limit.

On systems that already have a large number of range indexes (say 100+), merging multiple stands may become a performance issue. Thus, you will need to think about easing the query and merge load, here are some strategies for easing the load on your system: 

  1. Increase merge-max-size from 32768 to 49152 on your database. This will create larger stands and will lower the number of merges that need to be performed.
  2. There is configuration setting "preload mapped data" (default false), by leaving it as false, it will speed up merging of forest stands. Bear in mind that this will come at the cost of slower query performance immediately after forest mounts.
  3. If your system begins to slow down due to merging activity, you can spread the load by adding more hosts & forests to your cluster. The smaller forests and stands will merge and load faster when there are more CPU cores and IO bandwidth to service them.

Further Reading

Performance implications of updating Module and Schema databases

This article briefly looks at the performance implications of adding or modifying modules or schemas to live (production) databases.

Details

When XQuery modules or schemas are referenced for the first time after upload, they are parsed and then cached in memory so that subsequent access is faster.

When a module is added or updated, the modules cache is invalidated and every module (for all Modules databases within the cluster) will need to be parsed again before they can be evaluated by MarkLogic Server.

Special consideration should be made when updating modules or schemas in a production environment as reparsing can impact the performance of MarkLogic server for the duration that the cache is being rebuilt.

MarkLogic was designed with the assumption that modules and schemas are rarely updated. As such, the recommendation is that updates to modules or schemas in production environments is made during periods of low activity or out of hours.

Further reading

Overview

Performance issues in MarkLogic Server typically involve either 1) unnecessary waiting on locks or 2) overlarge workloads. The goal of this knowledgebase article is to give a high level overview of both of these classes of performance issue, as well as some guidelines in terms of what they look like - and what you should do about them.

Waiting on Locks

We often see customer applications waiting on unnecessary read or write locks. 

What does waiting on read or write locks look like? You can see read or write lock activity in our Monitoring History dashboard at port 8002 in the Lock Rate, Lock Wait Load, Lock Hold Load, and Deadlock Wait Load displays. This scenario will typically present with low resource utilization, but spikes in the read/write lock displays and high request latency.

What should you do when faced with unnecessary read or write locks? Remediation of this scenario pretty much always goes through optimization of either request code, data model, or both. Additional hardware resources will not help in this case because there is no hardware resource bound present. You can learn more about data model optimizations through MarkLogic University's On-Demand courses, in particular XML and JSON Data Modeling Best Practices and Impact of Normalization: Lessons Learned

Relevant Knowledgebase articles:

  1. Understanding XDMP Deadlock
  2. How Do Updates Work in MarkLogic Server?
  3. Fast vs Strict Locking
  4. Read Only Queries Run at a Timestamp & Update Transactions use Locks
  5. Performance Theory: Tales From MarkLogic Support

Overlarge Workloads

Overlarge workloads typically take two forms: a. too many concurrent workloads or b. work intensive individual requests

Too Many Concurrent Workloads

With regard to too many concurrent workloads - we often see clusters exhibit poor performance when subjected to many more workloads than the cluster can reasonably handle. In this scenario, any individual workload could be fine - but when the total amount of work over many, many concurrently running workloads is large, the end result is often the oversubscription of the underlying resources.

What does too many concurrent workloads look like? You can see this scenario in our Monitoring History at port 8002, in the Disk I/O, CPU, Memory Footprint, App Server Request Rate, App Server Latency, or Task Server Queue Size displays. This scenario will typically present with spikes in both App Server Latency and App Server Request Rate, and correlated maximum level plateaus in one or more of the aforementioned hardware resource utilization charts.

What should you do when faced with too many concurrent workloads? Remediation of this scenario pretty much always involves the addition of more rate-limiting hardware resource(s). This assumes, of course, that request code and/or data model are both already fully optimized. If either could be further optimized, then it might be possible to enable a higher request count given the same amount of resources - see the "Work Intensive Individual Requests" section, below. Rarely, in circumstances where traffic spikes are unpredictable - but likely - we’ve seen customers incorporate load shedding or traffic management techniques in their application architectures. For example, when request times pass a certain threshold, traffic is then routed through a less resource hungry code path.

Note that concurrent workloads entail both request workload and maintenance activities such as merging or reindexing. If your cluster is not able to serve both requests and maintenance activities, then the remediation tactics are the same as listed above: you either need to a. add more rate-limiting hardware resource(s) to serve both, or b. you need to incorporate load shedding or traffic management techniques like restricting maintenance activities to periods where the necessary resources are indeed available.

Relevant Knowledgebase articles:

  1. When submitting lots of parallel queries, some subset of those queries take much longer - why?
  2. How reindexing works, and its impact on performance
  3. MarkLogic Server I/O Requirements Guide
  4. Sizing E-nodes
  5. Performance Theory: Tales From MarkLogic Support
Work Intensive Individual Requests

With regard to work intensive individual requests - we often see clusters exhibit poor performance when individual requests attempt to do too much work. Too much work can entail an unoptimized query, but it can also be seen when an otherwise optimized query attempts to work over a dataset that has grown past its original hardware specification.

What do work intensive requests look like? You can see this scenario in our Monitoring History at port 8002, in the Disk I/O, CPU, Memory Footprint, App Server Request Rate, App Server Latency, or Task Server Queue Size displays. This scenario will typically present with spikes in one or more system resources (Disk I/O, CPU, Memory Footprint) and App Server Latency. In contrast to the "Too Many Concurrent Requests" scenario App Server Request Rate should not exhibit a spike.

What should you do when faced with work intensive requests? As in the case with too many concurrent requests, it's sometimes possible for customers to address this situation with additional hardware resources. However, remediation in this scenario more typically involves finding additional efficiencies via code or data model optimizations. Code optimizations can be made with the use of xdmp:plan() and xdmp:query-trace(). You can learn more about data model optimizations through MarkLogic University's On-Demand courses, in particular XML and JSON Data Modeling Best Practices and Impact of Normalization: Lessons Learned. If the increase in work is rooted in data growth, it's also possible to reduce the amount of data. Customers pursuing this route will typically do periodic data purges or by using features like Tiered Storage.

Relevant Knowledgebase articles:

  1. Gathering information to troubleshoot long-running queries
  2. Fast searches: resolving from the indexes vs. filtering
  3. What do I do about XDMP-LISTCACHEFULL errors?
  4. Resolving XDMP-EXPNTREECACHEFULL errors
  5. When should I look into query or data model tuning?
  6. Performance Theory: Tales From MarkLogic Support

Additional Resources

  1. Monitoring MarkLogic Guide
  2. Query Performance and Tuning Guide
  3. Performance: Understanding System Resources

 

Summary

This article lists some common system and MarkLogic Server settings that can affect the performance of a MarkLogic cluster.

Details

From MarkLogic System Requirements:

I/O Schedulers

** The deadline I/O scheduler is required on Red Hat Linux platforms. The deadline scheduler is optimized to ensure efficient disk I/O for multi-threaded processes, and MarkLogic Server can have many simultaneous threads. For information on the deadline scheduler, see the Red Hat documentation.

Note that on VMWare hosted servers, the noop scheduler is recommended.

You can read more about I/O schedulers in the following MarkLogic knowledgebase article

     Notes on IO schedulers.

Huge Pages

At system startup on Linux machines, MarkLogic Server logs a message to the ErrorLog.txt file showing the Huge Page size, and the message indicates if the size is below the recommended level.

If you are using Red Hat 6, you must turn off Transparent Huge Pages (Transparent Huge Pages are configured automatically by the operating system).

You can also read more about huge pages, transparent huge pages, and group cache settings at the following MarkLogic knowledgebase articles:

     Linux Huge Pages and Transparent Huge Pages

     Group Caches and Linux Huge Pages

MarkLogic Server Configurations 

The following items are related to default MarkLogic Server configurations and their relationship to indexes – either index population during ingest or index reads during query time, especially in the context of avoiding threads locking when executed in parallel

There’s a collection of settings that are enabled by default in the server, whose values we often recommend changing from their defaults when users run into performance issues. Those are:

  1. If not needed, directory creation should be set to manual
  2. If not needed, maintain last modified should be set to false
  3. If not needed, maintain directory last modified should be set to false
  4. If not needed, inherit permissions should be set to false
  5. If not needed, inherit collections should be set to false
  6. If not needed, inherit quality should be set to false
  7. If you’re likely to use URI or collection lexicon functions, both URI lexicon and collection lexicon should be set to true

You can read more about these settings and how they relate to overall multi-thread/multi-request system performance in the following knowledgebase articles:

                - When submitting lots of parallel queries, some subset of those queries take much longer - Why?;

                - Read only queries run at a timestamp - Update transactions use locks;

                - https://help.marklogic.com/Knowledgebase/Article/View/113/0/indexing-best-practices

                - https://help.marklogic.com/Knowledgebase/Article/View/73/0/what-is-a-directory-in-marklogic

                - https://help.marklogic.com/Knowledgebase/Article/View/17/0/understanding-xdmp-deadlock

ATTENTION

This knowledgebase article dates from 2014 - which is a long time ago in terms of available hardware and MarkLogic development. While some of the fundamental principles in the article bellow still apply, you'll find more recent specific guidance in this "Performance Testing with MarkLogic" whitepaper.


Performance Theory: Tales From MarkLogic Support

This article is a snapshot of the talk that Jason Hunter and Franklin Salonga gave next at MarkLogic World 2014, also titled, “Performance Theory: Tales From The MarkLogic Support Desk.” Jason Hunter is Chief Architect and Frank Salonga is Lead Engineer at MarkLogic. 

MarkLogic is extremely well-designed, and from the ground up it’s built for speed, yet many of our support cases have to do with performance. Often that’s because people are following historical conventions that no longer apply. Today, there are big-memory systems using a 64-bit address space with lots of CPU cores, holding disks that are insanely fast (but that haven’t grown in speed as much as they have in size*), hooked together by high-speed bandwidth. MarkLogic lives natively in this new reality, and that changes the guidelines you want to follow for finding optimal performance in your database.

The Top 10 (Actually 16) Tips

The following is a list of top 16 tips to realize optimal performance when using MarkLogic, all based on some of the common problems encountered by our customers:

1. Buy Enough Iron
MarkLogic is optimized for server-grade systems, those just to the left of the hockey-stick price jump. Today (April 2014) that means 16 cores, 128-256 Gigs of RAM, 8-20 TB of disk, 2 disk controllers.

2. Aim for 100KB docs +/- 2 Orders of Magnitude
MarkLogic’s internal algorithms are optimized for documents around 100 KB (remember, in MarkLogic, each document should be one unit of query and should be seen more like relational rows than tables). You can go down to 1 KB but below that the memory/disk/lock overhead per document starts to be troublesome. And, you can go up to 10 MB but above that line the time to read it off disk starts to be noticeable.

3. Avoid Fragmentation
Just avoid it, but if you must, then understand the tradeoffs.  See also Search and Fragmentation.

4. Think of MarkLogic Like an Only Child
It’s not a bug to use 100 percent of the CPU—that’s a feature. MarkLogic assumes you want maximum performance given available resources. If you’re using shared resources (a SAN, a virtual machine) you may want to impose restrictions that limit what MarkLogic can use.

5. Six Forests, Six Replicas
Every use case is different, but in general deployments of MarkLogic 7 are proving optimal with 6 forests on each computer and (if doing High Availability) 6 replicas.

6. Earlier Indexing is Better Indexing
Adding an index after loading requires touching every document with data relating to that index. Turning off an index is instant, but no space will be reclaimed until the re-index occurs. A little thought into index settings before loading will save you time.

7. Filtering: Your Friend or Foe
Indexes isolate candidate documents, then filtering verifies the hits. Filtering lets you get accurate results even without accurate indexes (e.g., a case sensitive query without the case sensitive index). So, watch out, as filtering can hide bad index settings! If you really trust the indexes, you can use “unfiltered.” It is best to perfect your index settings in a small test environment, then apply them to production.

8. Use Meaningful Markup If You Can
If you can use meaningful markup (where the tags describe the content they hold) you get both prettier XML and XML that’s easier to write indexes against.

9. Don’t Try to Outsmart Merging
Contact support if you plan to change any of the advanced merge settings (max size, min size, min ratio, timeout periods). You shouldn’t usually tweak these. If you’re thinking about merge settings, you’re probably underprovisioned (See Recommendation #1).

10. Big Reads Go In Queries, Not Updates
Hurrah! Using MVCC for transaction processing means lock-free reads. But, to be a “read” your module can’t include any update calls. This is determined by static analysis in advance, so even if the update call isn’t made, it still changes your behavior. Locks are cheap but they’re not free, and any big search to find the top 10 results will lock the full result set during the sort. Whenever possible, do update calls in a separate nested transaction context using xdmp:invoke() with an option specifying “different-transaction”.

11. Taste Test
Load a bit of data early, so you can get an idea about rates, sizes, and loads. Different index settings will affect performance and sizes. Test at a few sizes because some things scale linearly, some logarithmically.

12. Measure
Measure before. Measure after. Measure at all levels. When you know what’s normal, you can isolate when something goes different. MarkLogic 7 can internally capture “Monitoring History” to a Meters database. There are also tools such as Cacti, Ganglia, Nagios, Graphite, and others.

13. Keep a Staging Box
A staging box (or cluster) means you can measure changes in isolation (new application code, new indexes, new data models, MarkLogic upgrades, etc.). If you’re running on a cluster, then stage on a cluster (because you’ll see the effects of distribution, like net traffic and 2-phase commits). With AWS it’s easier than ever to “spin up” a cluster to test something.

14. Adjust as Needed
You need to be measuring so you know what is normal and then know what you should adjust. So, what can you adjust?

  • Code: Adjusting your code often provides the biggest bang
  • Memory sizes: The defaults assume a combo E-node/D-node server
  • Indexes: Best in advance, maybe during tasting. Or, try on staging
  • Cluster size and forest distribution: This is much easier in MarkLogic 7

15. Follow Our Advice on Swap Space
Our release notes tell you:

  • Windows: 2x the physical memory
  • Linux: 1x the physical memory (minus any huge pages), or 32GB, whichever is lower
  • Solaris: 1x-2x the physical memory

MarkLogic doesn’t intend to leverage swap space! But, for an OS to give memory to MarkLogic, it wants the swap space to exist. Remember, disk is 100x cheaper than RAM, and this helps us use the RAM.

16. Don’t Forget New Features
MarkLogic has plenty of features that help with performance, including MLCP, tiered storage, and semantics. With the MLCP fast-load option, you can perform forest assignments on the client, and directly insert to that forest. It’s really a sharp tool, but you don’t use it if you’re changing forest topology or assignment policies. With tiered storage, you can use HDFS as cheap mass storage of data that doesn’t need high performance. Remember, you can “partition” data (i.e. based on dates) and let it age to slower disks. With semantics, you have a whole new way to model your data, which in many cases can produce easier to optimize queries.

That’s it! With these pro tips, you should be able to handle the most common performance issues. 

*With regard to storage, as you add capacity, it is critical that you add throughput in order to maintain a fast system (http://tylermuth.wordpress.com/2011/11/02/a-little-hard-drive-history-and-the-big-data-problem/)

Summary

There are index settings that may be problematic if your documents contain encoded binary data (such as Base64 encoded binary).  This article identifies a couple of these index settings and explains the potential pitafall.

Details

When word lexicons or string range indexes are enabled, each stand in the database's forest will contain a file called the 'atom data' file.  The contents of this file includes all of the relevant unique tokens.  This could include all the unique tokens in the forest (stand).  If your documents contain encoded binary data, all of the encode binary may be replicated as atom data and stored in the atom data file.

Pitfall: There is an undocumented limit on the size of the atom data file of 4GB.  If this limit is exceeded for the content of a forest, then stand merges will begin to fail with the error

    "XDMP-FORESTERR: Error in merge of forest forest-nameSVC-MAPBIG: Mapped file too large to map: NNN bytes: '\path\Forests\forest-name\stand-id\AtomData'"

Workarounds

There are a few options that you can pursue to get around these problems

1. Do not include encoded binary data in your documents.  An alternative is to store the binary content seperately using MarkLogic Server support for binary documents and to include a reference to the binary document in the original.

2. If word lexicons are required, and the encoded binary data is limited to a finite number of elements in your documents, then you can create word query exclusions for those elements. In the MarkLogic Server Admin UI, word query element exclusions can be configured by navigating to -> Configure -> Databases -> {database-name} -> Word Query -> Exclude tab. 

3. If a string range index is defined on an element that contains encoded binary, then you can either remove the string range index or change the document data model so that the element containing the encoded binary is not shared with an element that requires a string range index. 

 

 

This not-too-technical article covers a number of questions about MarkLogic Server and its use of memory:

  • How MarkLogic uses memory;
  • Why you might need more memory;
  • When you might need more memory;
  • How you can add more memory.

Let’s say you have an existing MarkLogic environment that’s running acceptably well.  You have made sure that it does not abuse the infrastructure on which it’s running.  It meets your SLA (maybe expressed as something like “99% of searches return within 2 seconds, with availability of 99.99%”).   Several things about your applications have helped achieve this success:

As such, your application’s performance is largely determined by the number of disk accesses required to satisfy any given query.  Most of the processing involved is related to our major data structures:

Fulfilling a query can involve tens, hundreds or even thousands of accesses to these data structures, which reside on disk in files within stand directories.   (The triple world especially tends to exhibit the greatest variability and computational burden.)

Of course, MarkLogic is designed so that the great majority of these accesses do not need to access the on-disk structures.  Instead, the server caches termlists, range indexes, triples, etc. which are kept in RAM in the following places:

  • termlists are cached in the List Cache, which is allocated at startup time (according to values found in config files) and managed by MarkLogic Server.  When a termlist is needed, the cache is first consulted to see whether the termlist in question is present.   If so, no disk access is required.  Otherwise, the termlist is read from disk involving files in the stand such as ListIndex and ListData.
  • range indexes are held in memory-mapped areas of RAM and managed by the operating system’s virtual memory management system.  MarkLogic allocates the space for the in-memory version of the range index, causes the file to be loaded in (either on-demand or via pre-load option), and thereafter treats it as an in-memory array structure.  Any re-reading of previously paged-out data is performed transparently by the OS.  Needless to say, this last activity slows down operation of the server and should be kept to a minimum. 

One key notion to keep in mind is that the in-memory operations (the “hit” cases above) operate at speeds of about a microsecond or so of computation.  The go-to-disk penalty (the “miss” cases) cost at least one disk access which takes a handful of milliseconds plus even more computation than a hit case.  This represents a difference on the order of 10,000 times slower. 

Nonetheless, you are running acceptably.  Your business is succeeding and growing.  However, there are a number of forces stealthily working against your enterprise continuing in this happy state. 

  • Your database is getting larger (more and perhaps larger documents).
  • More users are accessing your applications.
  • Your applications are gaining new or expanded capabilities.
  • Your software is being updated on a regular basis.
  • You are thinking about new operational procedures (e.g. encryption).

In the best of all worlds, you have been measuring your system diligently and can sense when your response time is starting to degrade.  In the worst of all worlds, you perform some kind of operational / application / server / operating system upgrade and performance falls off a cliff.

Let’s look under the hood and see how pressure is building on your infrastructure.  Specifically, let’s look at consumption of memory and effectiveness of the key caching structures in the server.

Recall that the response time of a MarkLogic application is driven predominantly by how many disk operations are needed to complete a query.  This, in turn, is driven by how many termlist and range index requests are initiated by the application through MarkLogic Server and how many of those do not “hit” in the List Cache and in-memory Range Indexes.  Each one of those “misses” generates disk activity, as well as a significant amount of additional computation.

All the forces listed above contribute to decreasing cache efficiency, in large part because they all use more RAM.  A fixed size cache can hold only a fraction of the on-disk structure that it attempts to optimize.  If the on-disk size keeps growing (a good thing, right?) then the existing cache will be less effective at satisfying requests.  If more users are accessing the system, they will ask in total for a wider range of data.  As applications are enriched, new on-disk structures will be needed (additional range indexes, additional index types, etc.)  And when did any software upgrade use LESS memory?

There’s a caching concept from the early days of modern computing (the Sixties, before many of you were born) called “folding ratio”.  You take the total size of a data structure and divide it by the size of the “cache” that sits in front of it.  This yields a dimensionless number that serves as a rough indicator of cache efficiency (and lets you track changes to it).   A way to compute this for your environment is to take the total on-disk size of your database and divide it by the total amount of RAM in your cluster.  Let’s say each of your nodes has 128GB of RAM and 10 disks of about 1TB each that are about half full.  So, the folding ratio of each node of (the shared-nothing approach of MarkLogic allows us to consider each node individually) this configuration at this moment is (10 x 1TB x 50%) / 128GB or about 40 to 1.

This number by itself is neither good nor bad.  It’s just a way to track changes in load.  As the ratio gets larger, cache hit ratio will decrease (or, more to the point, the cache miss ratio will increase) and response time will grow.   Remember, the difference between a hit ratio of 98% versus a hit ratio of 92% (both seem pretty good, you say) is a factor of four in resulting disk accesses!  That’s because one is a 2% miss ratio and the other is an 8% miss ratio.

Consider the guidelines that MarkLogic provides regarding provisioning: 2 VCPUs and 8GB RAM to support a primary forest that is being updated and queried.  The maximum recommended size of a single forest is about 400 GB, so the folding ratio of such a forest is 400GB / 8GB or about 50 to 1.  This suggests that the configuration outlined a couple of paragraphs back is at about 80% of capacity.  It would be time to think about growing RAM before too long.  What will happen if you delay?

Since MarkLogic is a shared-nothing architecture, the caches on any given node will behave independently from those on the other nodes.  Each node will therefore exhibit its own measure of cache efficiency.  Since a distributed system operates at the speed of its slowest component, it is likely that the node with the most misses will govern the response time of the cluster as a whole.

At some point, response time degradation will become noticeable and it will become time to remedy the situation.  The miss ratios on your List Cache and your page-in rate for your Range Indexes will grow to the point at which your SLA might no longer be met. 

Many installations get surprised by the rapidity of this degradation.  But recall, the various forces mentioned above are all happening in parallel, and their effect is compounding.  The load on your caches will grow more than linearly over time.  So be vigilant and measure, measure, and measure!

In the best of all possible worlds, you have a test system that mirrors your production environment that exhibits this behavior in advance of production.  One approach is to experiment with reducing the memory on the test system by, say, configuring VMs for a given memory size (and adjusting huge pages and cache sizes proportionately) to see where things degrade unacceptably.  You could measure:

  • Response time: where does it degrade by 2x, say?
  • List cache miss ratio: at what point does it double, say?
  • Page-in rate: at what point does increase by 2x, say?

When you find the memory size at which things degraded unacceptably, use that to project the largest folding ratio that your workload can tolerate.  Or you can be a bit clever and do the same additional calculations for ListCache and Anonymous memory:

  • Compute the sum of the sizes of all ListIndex + ListData files in all stands and divide by size of ListCache.  This gives the folding ratio for this host of the termlist world.
  • Similarly, compute the sum of the sizes of all RangeIndex files and divide by the size of anonymous memory.  This gives the folding ratio for the range index world on this host.  This is where encryption can bite you.  At least for a period of time, both the encrypted and the un-encrypted versions of a range index must be present in memory.  This effectively doubles your folding ratio and can send you over the edge in a hurry.  [Note: depending on your application, there may be additional in-memory derivatives of range indexes built to optimize for facets, sorting of results, … all taking up additional RAM.]

[To be fair, on occasion a resource other than RAM can become oversubscribed (beyond the scope of this discussion):

  • IOPs and I/O bandwidth (both at the host and storage level);
  • Disk capacity (too full leads to slowness on some storage devices, or to inability to merge);
  • Wide-area network bandwidth / latency / consistency (causes DR to push back and stall primary);
  • CPU saturation (this is rare for traditional search-style applications, but showing up more in the world of SQL, SPARQL and Optic, often accompanied by memory pressure due to very large Join Tables.  Check your query plans!);
  • Intra-cluster network bandwidth (both at host and switch/backbone layer, also rare)].

Alternatively, you may know you need to add RAM because you have an emergency on your hands: you observe that MarkLogic is issuing Low Memory warnings, you have evidence of heavy swap usage, your performance is often abysmal, and/or the operating system’s OOM (out of memory) killer is often taking down your MarkLogic instance.  It is important to pay attention to the warnings that MarkLogic issues, above and beyond any that come from the OS. 

You need to tune your queries so as to avoid bad practices (see the discussion in the beginning of this article) that waste memory and other resources, and almost certainly add RAM to your installation.   The tuning exercise can be labor-intensive and time-consuming; it is often best to throw lots of RAM at the problem to get past the emergency at hand.

So, how to add more RAM to your cluster?  There are three distinct techniques:

  • Scale vertically:  Just add more RAM to the hosts you already have.
  • Scale horizontally:  Add more nodes to your cluster and re-distribute the data
  • Scale functionally:  Convert your existing e/d-nodes into d-nodes and add new e-nodes

Each of these options has its pros and cons.   Various considerations:

  • Granularity:   Say you want to increase RAM by 20%.  Is there an option to do just this?
  • Scope:  Do you upgrade all nodes?  Upgrade some nodes?   Add additional nodes?
  • Cost:  Will there be unanticipated costs beyond just adding RAM (or nodes)?
  • Operational impact:  What downtime is needed?  Will you need to re-balance?
  • Timeliness: How can you get back to acceptable operation as quickly as possible?

Option 1: Scale Vertically

On the surface, this is the simplest way to go.  Adding more RAM to each node requires upgrading all nodes.  If you already have separate e- and d-nodes, then it is likely that just the d-nodes should get the increased RAM.

In an on-prem (or, more properly, non-cloud) environment this is a bunch of procurement and IT work.  In the worst case, your RAM is already maxed out so scaling vertically is not an option.

In a cloud deployment, the cloud provider dictates what options you have.  Adding RAM may drag along additional CPUs to all nodes also, which requires added MarkLogic licenses as well as larger payment to the cloud provider.  The increased RAM tends to come in big chunks (only 1.5x or 2x options).  It’s generally not easy to get just the 20% more RAM (say) that you want.  But this may be premature cost optimization; it may be best just to add heaps of RAM, stabilize the situation, and then scale RAM back as feasible.  Once you are past the emergency, you should begin to implement longer-term strategies.

This approach also does not add any network bandwidth, storage bandwidth and capacity in most cases, and runs the small risk of just moving the bottleneck away from RAM and onto something else.

Option 2: Scale Horizontally

This approach adds whole nodes to the existing complex.  It has the net effect of adding RAM, CPU, bandwidth and capacity.   It requires added licenses, and payment to the cloud provider (or a capital procurement if on-prem).  The granularity of expansion can be controlled; if you have an existing cluster of (2n+1) nodes, the smallest increment that makes sense in an HA context is 2 more nodes (to preserve quorum determination) giving (2n+3) nodes.  In order to make use of the RAM in the new nodes, rebalancing will be required.  When the rebalancing is complete, the new RAM will be utilized.

This option tends to be optimal in terms of granularity, especially in already larger clusters.  To add 20% of aggregate RAM to a 25-node cluster, you would add 6 nodes to make a 31-node cluster (maintaining the odd number of nodes for HA).  You would be adding 24%, which is better than having to add 50% if you had to scale all 25 nodes by 50% because that was what your cloud provider offered.

Option 3: Scale Functionally

Scaling functionally means adding new nodes as e-nodes to cluster and reconfiguring existing e/d-nodes to be d-nodes.  This frees up RAM on the d-side (specifically by dramatically reducing the need for Expanded Tree Cache and memory for query evaluation) which will go towards restoring good folding ratio.  Recent experience says about 15% of RAM could be affected in this manner.

More licenses are again required, plus installation and admin work to reconfigure the cluster.  You need to make sure that network can handle increases in XDMP traffic from e-nodes to d-nodes, but this is not typically a problem.  The resulting cluster tends to run more predictably.  One of our largest production clusters typically runs its d-nodes at nearly 95% memory usage as reported by MarkLogic as the first number in an error log line.  It can get away with running so full because it is a classical search application whose d-node RAM usage does not fluctuate much.  Memory usage on e-nodes is a different story, especially when the application uses SQL or Optic.  In such a situation, on-demand allocation of large Join Tables can cause abrupt increase in memory usage. That’s why our advice on combined e/d nodes is to run below 80% to allow for query processing.

Thereafter, the two groups of nodes can be scaled independently depending on how the workload evolves.

Here are a few key takeaways from this discussion:

  • Measure performance when it is acceptable, not just when it is poor.
  • Do whatever it takes to stabilize in an emergency situation.
  • Correlate metrics with acceptable / marginal performance to determine a usable folding ratio.
  • If you have to make a guess, try to achieve no worse than a 50:1 ratio and go from there.
  • Measure and project the growth rate of your database.
  • Figure out how much RAM needs to be added to accommodate projected growth.
  • Test this hypothesis if you can on your performance cluster.

Overview

Update transactions run with readers/writers locks, obtaining locks as needed for documents accessed in the transaction. Because update transactions only obtain locks as needed, update statements always see the latest version of a document. The view is still consistent for any given document from the time the document is locked. Once a document is locked, any update statements in other transactions wait for the lock to be released before updating the document.

Read only query transactions run at a particular system timestamp, instead of acquiring locks, and have a read-consistent view of the database. That is, the query transaction runs at a point in time where all documents are in a consistent state.

The system timestamp is a number maintained by MarkLogic Server that increases every time a change or a set of changes occurs in any of the databases in a system (including configuration changes from any host in a cluster). Each fragment stored in a database has system timestamps associated with it to determine the range of timestamps during which the fragment is valid.

On a clustered system where there are multiple hosts, the timestamps need to be coordinated accross all hosts. Marklogic Server does this by passing the timestamp in every message communicated between hosts of the cluster, including the heartbeat message. Typically, the message carries two important pieces of information:

  • The origin host id
  • The precise time on the host at the time that heartbeat took place

In addition to the heartbeat information, the "Label" file for each forest in the database is written as changes are made. The Label file also contains timestamp information; this is what each host uses to ascertain the current "view" of the data at a given moment in time. This technique is what allows queries to be executed at a 'point in time' to give insight into the data within a forest at that moment.

You can learn more about transactions in MarkLogic Server by reading the Understanding Transactions in MarkLogic Server section of the MarkLogic Server Application Developers Guide.

The distribute timestamps option on Application Server can specify how the latest timestamp is distributed after updates. This affects performance of updates and the timeliness of read-after-write query results from other hosts in the group.

When set to fast, updates return as quickly as possible. No special timestamp notification messages are broadcasted to other hosts. Instead, timestamps are distributed to other hosts when any other message is sent. The maximum amount of time that could pass before other hosts see the update timestamp is one second, because a heartbeat message is sent to other hosts every second.

When set to strict, updates immediately broadcast timestamp notification messages to every other host in the group. Updates do not return until their timestamp has been distributed. This ensures timeliness of read-after-write query results from other hosts in the group.

When set to cluster, updates immediately broadcast timestamp notification messages to every other host in the cluster. Updates do not return until their timestamp has been distributed. This ensures timeliness of read-after-write query results from any host in the cluster, so requests made to any app server on any host in the cluster will see immediately consistent results.

The default value for "distribute timestamps" option is fast. The remainder of this article is applicable when fast mode is used.

Read after Write in Fast Mode

We will look at the different scenario for the case where a read occurs in a transaction immediately following an update transaction.

  • If the read transaction is executed against an application server on the same node of the cluster (or any node that participated in the update) then the read will execute at a timestamp equal to or greater than the time that the update occurred.
  • If the read is executed in the context of an update transaction, then, by acquiring locks, the view of the documents will be the latest version of the documents.
  • If the read is executed in a query transaction, then the query will execute at the latest timestamp that the host on which it was executed is aware of. Although this will always produce a transactionally consistent view of the database, it may not return the latest updates. The remainder of this article addresses this case.

Consider the following code:

The above example performs the following steps:

  • Instantiates two XCC ContentSource Objects - each connecting to a different host in the cluster.
  • Establishes a short loop (which runs the enclosed steps 10 times)
    • Creates a unique UUID which is used as a URI for the Document
    • Establishes a session with the first host in the cluster and performs he following:
      • Gets the timestamp (session.getCurrentServerPointInTime()) and writes it out to the console / stdout
      • Inserts a simple, single element () as a document-node into a given database
      • Gets the timestamp again and writes it out to the console / stdout
    • The session with the first host is then closed. A new session is established with the second host and the following steps are performed:
      • Gets the timestamp at the start of the session and writes it out to the console / stdout
      • An attempt is made to retrieve the document which was just inserted
    • On success the second session will be closed.
    • If the document could not be read successfully, an immediate retry attempt follows thereafter - which will result a successful retrieval.

Running this test will yield one of two results for each iteration of the loop:

Query Transaction at Timestamp that includes Update

Most of the time, you will find that the timestamps will be in lockstep with the host before - note that there is no time difference between the output from getCurrentServerPointInTime() after the document has been inserted and before the attempt is made to retrieve the document from the connection to the second host in the cluster.

----------------- START OF INSERT / READ CYCLE (1) -----------------
First host timestamp before document is inserted: 	13673327800295300
First host timestamp after document is inserted: 	13673328229180040
Second host timestamp before document is read: 	13673328229180040
------------------ END OF INSERT / READ CYCLE (1) ------------------

However, you may also see this:

----------------- START OF INSERT / READ CYCLE (10) -----------------
First host timestamp before document is inserted: 	13673328311216780
First host timestamp after document is inserted: 	13673328322546380
Second host timestamp before document is read: 	13673328311216780
------------------ END OF INSERT / READ CYCLE (10) ------------------

Note that on this run, the timestamps are out of sync; at the point where getCurrentServerPointInTime() is called, the timestamp for the second connection is at that point just before the document is inserted.

Yet this also returns results that include the updates; in the interval between the timestamp being written to the console and the construction and submission of the newAdhocQuery(), the document has become available and was successfully retrieved during the read process.

The path with an immediate retry

Now let's explore what happens when the read only query transaction runs at a point in time that does not include the updates:

----------------- START OF INSERT / READ CYCLE (2) -----------------
First host timestamp before document is inserted: 	13673328229180040
First host timestamp after document is inserted: 	13673328240679460
Second host timestamp before document is read: 		13673328229180040
WARNING: Immediate read failed; performing an immediate retry
Second host timestamp for read retry: 		13673328240679460
Result Sequence below:
<?xml version="1.0" encoding="UTF-8"?>
<ok/>
------------------ END OF INSERT / READ CYCLE (2) ------------------

Note that on this occasion, we see an outcome that starts much like the previous example; the timestamps mismatch and we see that we've hit the point in the code where our validation of the response fails.

Also note that the timestamp at the point where the retry takes place is now back in step; from this, we can see that the document should be available even before the retry request is executed. Under these conditions, the response (the result) is also written to stdout so we can be sure the document was available on this attempt.

Multi Version Concurrency Control

In order to gurarantee that the "holistic" view of the data is current and available in a read only query transaction across each host in the cluster, two things need to take place:

  • All forests need to be up-to-date and all pending transactions need to be committed.
  • Each host must be in complete agreement as to the 'last known good' (safest) timestamp from which the query can be allowed to take place.

In all situations, to ensure a complete (and reliable) view of the data, the read only query transaction must take place at the lowest known timestamp across the cluster

With every message between nodes in the cluster, the latest timestamp information is communicated across each host in the cluster - the first "failed" attempt to read the document necessitates communication between each host in the cluster - and by doing so, this action propagates a new "agreed" timestamp across every node in the cluster.

It is because of this, the retry will always work; at the point where the immediate read after write fails, timestamp changes are propagated, and the new timestamp is now at a waypoint for the retry query to take place. This is why the single retry is always guaranteed to work.

Summary

When used as a file system, GFS needs to be tuned for optimal performance with MarkLogic Server.

Recommendations

Specifically, we recommend tuning the demote_secs and statfs_fast parameters. The demote_secs parameter determines the amount of time GFS will wait before demoting a lock on a file that is not in use. (GFS uses a time-based locking system.) One of the ways that MarkLogic Server makes queries go fast is its use of memory mapped index files. When index files are stored on a GFS filesystem, locks on these memory-mapped files are demoted purely on the basis of demote_secs, regardless of use. This is because they are not accessed using a method that keeps the lock active -- the server interacts with the memory map, not direct access to the on-disk file.

When a GFS lock is demoted, pages from the memory-mapped index files are removed from cache. When the server makes another request of the memory-mapped file, GFS must acquire another lock and the requested page(s) from the on-disk file must be read back into cache. The lock reacquisition process, as well as the I/O needed to load data from disk into cache, may causes noticeable performance degradation.

Starting with MarkLogic Server 4.0-4, MarkLogic introduced an optimization for GFS. From that maintenance release forward, MarkLogic gets the status of its memory-maps files every hour, which results in the retention of the GFS locks on those files so that they do not get demoted. Therefore, it is important that demote_secs is equal to or greater than one hour. It is also recommended that the tuning parameter statfs_fast is set to "1" (true), which makes statfs on GFS faster.

Using gfs_tool, you should be able to set the demote_secs and statfs_fast parameters to the following values:

demote_secs 3600

statfs_fast 1

While we're discussin tuning a Linux filesystem, it is worth noting the following Linux tuning tips also:

  • Use the deadline elevator (aka I/O scheduler), rather than cfq, on all hosts in the cluster. This has been added to our installation requirements for RHEL. With RHEL-4, this requires the elevator=deadline option at boot time. With RHEL-5, this can be changed at any time via /sys/block/*/queue/scheduler
  • If you are running on a VM slice, then no-op I/O scheduler is recommended.
  • Set the following kernel tuning parameters:

Edit /etc/sysctl.conf:

vm.swappiness = 0

vm.dirty_background_ratio=1

vm.dirty_ratio=40

Use sudo sysctl -f to apply these changes.

  • It is very important to have at least one journal per host that will mount the filesystem. If the number of hosts exceeds the number of journals, performance will suffer. It is, unfortunately, impossible to add more journals without rebuilding the entire filesystem, so be sure to set journals up for each host during your initial build.

 

Working with RedHat

Should you run into GFS-related problems, running the following Script will provide all the information that you need in order to work with the Redhat Support Team:


mkdir /tmp/debugfs

mount -t debugfs none /tmp/debugfs

mkdir /tmp/$(hostname)-hangdata

cp -rf /tmp/debugfs/dlm/ /tmp/$(hostname)-hangdata

cp -rf /tmp/debugfs/gfs2/ /tmp/$(hostname)-hangdata

echo 1 > /proc/sys/kernel/sysrq 

echo 't' > /proc/sysrq-trigger 

sleep 60

cp /var/log/messages /tmp/$(hostname)-hangdata/

clustat > /tmp/$(hostname)-hangdata/clustat.out

cman_tool services > /tmp/$(hostname)-hangdata/clustat.out

mount -l > /tmp/$(hostname)-hangdata/mount-l.out

ps aux > /tmp/$(hostname)-hangdata/ps-aux.out

tar cjvf /tmp/$(hostname)-hangdata.tar.bz /tmp/$(hostname)-hangdata/

umount /tmp/debugfs/

rm -rf /tmp/debugfs

rm -rf /tmp/$(hostname)-hangdata

Introduction

MarkLogic is supported on XFS filesystem. The minimum system requirements can be found here:

https://developer.marklogic.com/products/marklogic-server/requirements-9.0

The default mount options will generally give good performance, assuming the underlying hardware is capable enough in terms of IO performance and durability of writes, but if you can test your system adequately, you can consider different mount options.

The values provided here are just general recommendations, if you wish to fine tune your storage performance, you need to ensure that you do adequate testing both with MarkLogic and low level tools such as fio:

http://freecode.com/projects/fio

1. I/O Schedulers

Unless you have a directly connected single HDD or SSD, noop is usually the best choice, see here for more details:

https://help.marklogic.com/Knowledgebase/Article/View/8/0/notes-on-io-schedulers

2. XFS Mount options

relatimeThe default atime behaviour is relatime, which has almost no overhead compared to noatime but still maintains sane atime values. All Linux filesystems use this as the default now (since around 2.6.30), but XFS has used relatime-like behaviour since 2006, so no-one should really need to ever use noatime on XFS for performance reasons.

attr2 This options enables an "opportunistic" improvement to be made in the way inline extended attributes are stored on-disk. It's the default and should be kept as such in most scenarios.

inode64 - to sum up this allows xfs to create nodes anywhere and not worry about backwards compatibility, which should result in better scalability. See here for more information: https://access.redhat.com/solutions/67091

sunit=x,swidth=y XFS allows you to specify RAID settings. This enables the file system to optimize its read and write access for RAID alignment, e.g. by committing data as complete stripe sets for maximum throughput. These RAID optimizations can significantly improve performance, but only if your partition is properly aligned or of you are avoiding misalignment by creating the xfs on a device without partitions. 

largeio, swalloc - these are intended to further optimize streaming performance on RAID storage. You need to do your own testing.

isize=512 - XFS allow inlinings of data into inodes to avoid the need for additional blocks and the corresponding expensive extra disk seeks for directories. In order to use this efficiently, the inode size should be increased to 512 bytes or larger.

allocsize=131072k (or larger) XFS can be tuned to a fixed allocation size, for optimal streaming write throughput. This setting could have a significant impact on the interim space usage in systems with many parallel write and create operations.

As with any advice of this nature, we strongly advise that you always do your own testing to ensure that options you choose are stable and reliable for your workload.

What is DLS?

The Document Library Service (DLS) enables you to create and maintain versions of managed documents in MarkLogic Server. Access to managed documents is controlled using a check-out/check-in model. You must first check out a managed document before you can perform any update operations on the document. A checked out document can only be updated by the user who checked it out; another user cannot update the document until it is checked back in and then checked out by the other user. 

Searching across latest version of managed documents

To track document changes, you can store versions of a document by defining a retention policy in DLS.  However, it is often the latest version of the document that most of the people are intereseted in. MarkLogic provides a function dls:documents-query which helps you access latest versions of the managed documents in the database. There are situations where there are performance overhead in using this function.  When the database has millions of managed documents you may see some performance overhead in accessing all the latest versions. This is an intrinsic issue related to because of large numbers of files and joining across properties.

How can one improve the search performance?

A simple workaround is to add your latest versions in a collection (say "latest"). Instead of the API dls:documents-query, you can then use a collection query on this "latest" collection. Below are two approaches that you can use - while the first approach can be used for new changes (inserts/updates), the second approach should be used to modify the existing managed documents in the database.

1.) To add new inserts/updates to "latest" collection

Below are two files, manage.xqy, and update.xqy that can be used for new inserts/updates.

In manage.xqy, we do an insert and manage, and manipulate the collections such that the numbered document has the "historic" collection and the latest document has the "latest" collection. You have to use xdmp:document-add-collections() and xdmp:document-remove-collections() when doing the insert and manage because it's not really managed until after the transaction is done.

In update.xqy, we do the checkout-update-checkin with the "historic" collection (so that we don't inherit the "latest" collection from the latest document), and then add "latest" and remove "historic" from the latest document. 

(: manage.xqy :)
xquery version "1.0-ml";
import module namespace dls = "http://marklogic.com/xdmp/dls" at "/MarkLogic/dls.xqy";
dls:document-insert-and-manage(
  "/stuff.xml",
  fn:false(),
  <test>one</test>,
  "created",
  (xdmp:permission("dls-user", "read"),
   xdmp:permission("dls-user", "update")),
  "historic"),
xdmp:document-add-collections(
  "/stuff.xml",
  "latest"),
xdmp:document-remove-collections(
  "/stuff.xml",  "historic")

(: update.xqy :)
xquery version "1.0-ml";
import module namespace dls = "http://marklogic.com/xdmp/dls" at "/MarkLogic/dls.xqy";
dls:document-checkout-update-checkin(
  "/stuff.xml",
  <test>three</test>,
  "three",
  fn:true(),
  (),
  ("historic")),
dls:document-add-collections(
  "/stuff.xml",
  "latest"),
dls:document-remove-collections(
  "/stuff.xml",
  "historic")

2.) To add the already existing managed documents to the "latest" collection

To add the latest version of documents already existing in your database to the "latest" collection you can do the following in batches.

xquery version "1.0-ml";
import module namespace dls = "http://marklogic.com/xdmp/dls" at "/MarkLogic/dls.xqy";
declare variable $start external ;
declare variable $end   external ;
for $uri in cts:search(fn:collection(), dls:documents-query())[$start to $end]/document-uri(.) 
return xdmp:document-add-collections($uri, ("latest"))

This way you can segregate historical and latest version of the managed documents and then, instead of using dls:documents-query, you can use the "latest" collection to search across the latest version of managed documents.

Note: Although this workaround may work when you want search across the latest version of managed documents, it does not solve all the cases. dls:documents-query is used internally in many dls.xqy calls so not all functionality will be improved.

SUMMARY

Some MarkLogic Server sites are intalled in a 1GB network environment. At some point, your cluster growth may require an upgrade to 10GB ethernet. Here are some hints for knowing when to migrate up to 10GB ethernet, as well as some ways to work around it prior to making the move to 10GB.

General Approach

A good way to check if you need more network bandwidth is to monitor the network packet retransmission rate on each host.  To do this, use the "sar -n EDEV 5" shell command. [For best results, make sure you have an updated version of sar]

Sample results:

# sar -n EDEV 5 3
... 10:41:44 AM IFACE rxerr/s txerr/s coll/s rxdrop/s txdrop/s txcarr/s rxfram/s rxfifo/s txfifo/s 10:41:49 AM lo 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 10:41:49 AM eth0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 10:41:49 AM IFACE rxerr/s txerr/s coll/s rxdrop/s txdrop/s txcarr/s rxfram/s rxfifo/s txfifo/s 10:41:54 AM lo 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 10:41:54 AM eth0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 10:41:54 AM IFACE rxerr/s txerr/s coll/s rxdrop/s txdrop/s txcarr/s rxfram/s rxfifo/s txfifo/s 10:41:59 AM lo 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 10:41:59 AM eth0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Average: IFACE rxerr/s txerr/s coll/s rxdrop/s txdrop/s txcarr/s rxfram/s rxfifo/s txfifo/s Average: lo 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Average: eth0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00


Explanation of terms:

FIELDDESCRIPTION
IFACE LAN interface
rxerr/s Bad packets received per second
txerr/s Bad packets transmitted per second
coll/s Collisions per second
rxdrop/s Received packets dropped per second because buffers were full
txdrop/s Transmitted packets dropped per second because buffers were full
txcarr/s Carrier errors per second while transmitting packets
rxfram/s Frame alignment errors on received packets per second
rxfifo/s FIFO overrun errors per second on received packets
txfifo/s FIFO overrun errors per second on transmitted packets

If the value of txerr/s and txcarr/s is none zero, that means that the packets sent by this host are being dropped over the network, and that this host needs to retransmit.  By default, a host will wait for 200ms to see if there is an acknowledgment packet before taking this retransmission step. This delay is significant for MarkLogic Server and will factor into overall cluster performance.  You may use this as an indicator to see that it's time to upgrade (or, debug) your network. 

Other Considerations

10 gigabit ethernet requires special cables.  These cables are expensive, and easy to break.  If a cable is just slightly bent improperly, you will not get 10 gigabit ethernet out of it. So be sure to work with your IT department to insure that everything is installed as per the manufaturer specification. Once installed, double-check that you are actually getting 10GB from the installed network.

Another option is to use bonded ethernet to increase network bandwidth from 1GB to 2GB and to 4GB prior to jumping to 10GB.  A description of Bonded ethernet lies beyond the scope of this article, but your IT department should be familiar with it and be able to help you set it up.

 

Introduction

The performance and resource consumption of E-nodes is determined by the kind of queries executed in addtion to the distribution and amount of data. For example, if there are 4 forests in the cluster and the query is asking for only the top-10 results, then the E-node would receive a total of 4 x 10 results in order to determine the top-10 among these 40. If there are 8 forests, then the E-node would have to sort through 8 x 10 results.

Performance Test for Sizing E-Nodes:

To size E-nodes, it’s best to determine first how much workload a single E-node can handle, and then scale up accordingly.

Set up your performance test so it is at scale and so that it only talks to a single E-node. Start the Application Server settings with something like

  • threads = 32
  • backlog = 512
  • keep alive = 0

Crank up the number of threads for the test from low to high, and observe the amount of resources being used on the E-node (cpu, memory, network). Measure both response time and throughput during these tests.

  • When the number of threads are low, you should be getting the best response time. This is what the end user would experience when the site is not busy.
  • When the number of threads are high, you will see longer response time, but you should be getting more throughput.

As you increase the number of threads, you will eventually run out of resources on the E-node - most likely memory. The idea is to identify the number of active threads when the system's memory is exceeded, because that is the maximum number of threads that your E-node can handle.

Addtitional Tuning of E-nodes

Thrashing

  • If you notice thrashing before MarkLogic is able to reach a  memory consumption equilibrium, you will need to continue decreasing the threads so that the RAM/thread ratio is near the 'pmap total memory'/thread
  • The backlog setting can be used to queue up requests w/o chewing up significant resources.
  • Adjusting backlog along with some of the timeout settings might give a reasonable user experience comparable to, or even better than, what you may see with high thread counts. 

As you continue to decrease the thread count and make other adjustments, the mean time to failure will likely increase until the settings are such that equilibrium is reached before all the memory resources are consumed - at which time we do not expect to see any additional memory failures.

Swap, RAM & Cache for E-nodes

  • Make sure that the E-nodes have swap space equal to the size of RAM (if the node has less than 32GB of RAM) or 32 GB (if the node has 32GB or more of RAM)
  • For E-nodes, you can minimize the List Cache and Compressed Tree Cache  - set to 1GB each - in your group level configurations.
  • Your Expanded Tree Cache (group level parameter) should be at least equal to 1/8 of RAM, but you can further increase the Expanded Tree Cache so that all three caches (List, Compressed, Expanded) in combination are up to 1/3 of RAM.
  • Another important group configuration parameter is Expanded Tree Cache Partitions.  A good starting point is 2-3 GB per partition, but is should not be more than 12 GB per partition. The greater the number of partitions, the greater the capacity of handling concurrent query loads.

Growing your Cluster

As your application, data and usage changes over time, it is important to periodically revisit your cluster sizings and re-run your performance tests.

 

Summary

MarkLogic Server expects the system clocks to be synchronized across all the nodes in a cluster, as well as between Primary and Replica clusters. The acceptable level of clock skew (or drift) between hosts is less than 0.5 seconds, and values greater than 30 seconds will trigger XDMP-CLOCKSKEW errors, and could impact cluster availability.

Cluster Hosts should use NTP to maintain proper clock synchronization.

Inside MarkLogic Clock Time usage

MarkLogic hosts include a precise time of day in XDQP heartbeat messages they send to each other. When a host processes incoming XDQP heartbeat messages, host compares the time of the day in the message against its own clock. If the time difference from the comparison is large enough host will report a CLOCKSKEW in ErrorLog.

Clock Skew

MarkLogic does not thoroughly test clusters in a clock skewed configuration, as it is not a valid configuration. As a result, we do not know all of the ways that a MarkLogic Server Cluster would fail. However, there are some areas where we have noticed issues:

  • Local disk failover may not perform properly as the inter-forest negotiations regarding which forest has the most up to date content may not produce the correct results.
  • Database replication can hang
  • SSL certificate verification may fail on the time range.

If MarkLogic Server detects a clock skew, it will write a message to the error log such as one of the following:

  • Warning: Heartbeat: XDMP-CLOCKSKEW: Detected clock skew: host hostname.domain.com skewed by NN seconds
  • Warning: XDQPServerConnection::init: nnn.nnn.nnn.nnn XDMP-CLOCKSKEW: Detected clock skew: host host.domain.local skewed by NN seconds
  • Warning: Excessive clock skew detected; suggest using NTP (NN seconds skew with hostname)

If one of these lines appears in the error log, or you see repeated XDMP-CLOCKSKEW errors over an extended time period, the clock skew between the hosts in the cluster should be verified. However, do not be alarmed if this warning appears even if there is no clock skew. This message may appear on a system under load, or at the same time as a failed host comes back online. In these cases the errors will typically clear within a short amount of time, once the load on the system is reduced.

Time Sync Config

NTP is the recommended solution for maintaining system clock synchronization.

(1) NTP clients on Linux

The most common Linux NTP clients are ntpd and chrony.   Either of these can be used to ensure your hosts stay synchronized to a central NTP time source.  You can check the settings for NTP, and manually update the date if needed

The instructions in the link below goes over the process of checking the ntpd service, and updating the date manually using the ntpdate command. 
The following Server Fault article goes over the process of forcing chrony to manually update and step the time using the chronyc command.

Running the applicable command on the affected servers should resolve the CLOCKSKEW errors for the short term.

If the ntpd or chrony service is not running, you can still use the ntpdate or chronyc command to update the system clock, but you will need to configure a time service to ensure accurate time is maintained, and avoid future CLOCKSKEW errors.  For more information on setting up a time sychonization service, see the following KB article:

(2) NTP clients on Windows

Windows servers can be configured to retrieve time directly from an NTP server, or from a Primary Domain Controller (PDC) in the root of an Active Directory forest that is configured as an NTP server. The following link includes information on configuring NTP on a Windows server, as well as configuring a PDC as an NTP server.

https://support.microsoft.com/en-us/help/816042/how-to-configure-an-authoritative-time-server-in-windows-server

(3) VMWare time synchronization

If your systems are VMWare virtual machines then you may need to take the additional step of disabling time synchronization of the virtual machine. By default the VMWare daemon will synchronize the Guest OS to the Host OS once per minute, and may interfere with ntpd settings. Through the VMSphere Admin UI, you can disable time synchronization between the Guest OS and Host OS in the virtual machine settings.

Configuring Virtual Machine Options

This will prevent regular time synchronization, but synchronization will still occur during some VMWare operations such as, Guest OS boots/reboots, resuming a virtual machine, among others. To disable VMWare clock sync completely, then you need to edit the .vmx for the virtual machine to set several synchronization properties to false. Details can be found in the following VMWare Blog:

Completely Disable Time Synchronization for your VM

(4) AWS EC2 time synchronization

For AWS EC2 instances, if you are noticing CLOCKSKEW in MarkLogic cluster you would benefit from changing clock source from default xen to tsc.

Other sources for Clock Skew

(1) Overloaded Host leading to Clock Skew

If for some reason there is a long time between when a XDQP heartbeat message was encoded in sending host, and when it was decoded at receiving host end, it will be interpreted as a CLOCKSKEW. Below are some of the combinations which can lead to CLOCKSKEW.

  • If a sending host is overloaded enough that heartbeat messages are taking a long time to be sent, it could be reported as a transient CLOCKSKEW by the receiver.
  • If a receiving host is overloaded enough that a long time elapsed between sending time and processing time, it can be reported as a transient CLOCKSKEW.

If you see a CLOCKSKEW message in ErrorLog combined with other messages (Hung messages, Slow Warning) then Server is likely overloaded and thrashing. Messages reporting broken XDQP connections (Stopping XDQPServerConnection) are a good indication that a host is overloaded and hung for a while, so much that other hosts disconnected.

(2) XDQP Thread start fail leading to Clock Skew

When MarkLogic starts up it tries to make the number of process per user (set limit) on System to at least 16384. But if MarkLogic is not starting as root, then MarkLogic will only be able to raise the soft limit (for number of processes per user) up to the hard limit, which could fail XDQP thread start up. You can get the current setting with the shell command ulimit -u and make sure number of process per user is at least 16384.

Further Reading

Summary

Sometimes, following a manual merge, a number of deleted fragments -- usually small number -- are left behind after the merge completes. In a system that is undergoing steady updates, one will observe that the number of deleted fragments will go up and down, but never go down to zero.

 

Options

There are a couple of approaches to resolve this issue:

  1.  If you have access to the Query Console, you should run xdmp:merge() with an explicit timestamp (e.g. the return value of xdmp:request-timestamp()). This will cause the server to discard all deleted fragments.

  2.  If you do not have access to the Query Console, just wait an hour and do the merge again from the Admin GUI.

 

Explanation

The hour window was added to avoid XDMP-OLDSTAMP errors that had cropped up in some of our internal stress testing, most commonly for replica databases, but also causing transaction retries for non-replica databases.

We've done some tuning of the change since then (e.g. not holding on to the last hour of deleted fragments after a reindex), and we may do some further tuning so this is less surprising to people.

 

Note

The explanation above is for new MarkLogic 7 installations. In case of an upgrade from prior MarkLogic 7 this solution might not work as it requires a divergent approach to split single big stands into 32GB. Please read more in the following knowledge base article Migrating to MarkLogic 7 and understanding the 1.5x disk rule (rather than 3x.

Introduction

Slow journal frame log entries will be logged at Warning level in your ErrorLog file and will mention something like this:

.....journal frame took 28158 ms to journal...

Examples

2016-11-17 18:38:28.476 Warning: forest Documents journal frame took 28152 ms to journal (sem=0 disk=28152 ja=0 dbrep=0 ld=0): {{fsn=121519836, chksum=0xd79a4bd0, words=33}, op=commit, time=1479425880, mfor=18383617934651757356, mtim=14445621353792290, mfsn=121519625, fmcl=16964678471847070106, fmf=18383617934651757356, fmt=14445621353792290, fmfsn=121519625, sk=10604213488372914348, pfo=116961308}

2016-11-17 18:38:28.482 Warning: forest Documents journal frame took 26308 ms to journal (sem=0 disk=26308 ja=0 dbrep=0 ld=0): {{fsn=113883463, chksum=0x10b1bd40, words=23}, op=fastQueryTimestamp, time=1479425882, mfor=959797732298875593, mtim=14701896887337160, mfsn=113882912, fmcl=16964678471847070106, fmf=959797732298875593, fmt=14701896887337160, fmfsn=113882912, sk=4596785426549375418, pfo=54687472}

2016-11-17 18:38:28.482 Warning: forest Documents journal frame took 28155 ms to journal (sem=0 disk=28155 ja=0 dbrep=0 ld=0):{{fsn=121740077, chksum=0xfd950360, words=31}, op=prepare, time=1479425880, mfor=10258363344370988969, mtim=14784083780681960, mfsn=121740077,fmcl=16964678471847070106, fmf=10258363344370988969, fmt=14784083780681960, fmfsn=121740077, sk=12062047643091825183, pfo=14672600}

Understanding the messages in further detail

These messages give you further hints on what is causing the delay; in most cases, you would probably want to involve the MarkLogic Support team in diagnosing the root cause of the problem although the table below should help with further interpretation of cause of these messages:

Item Description
sem time waiting on semaphore
disk time waiting on disk
ja time waiting if journal archive is lagged
dbrep time waiting if DR replication is lagged
ld time waiting to replicate the journal frame to a HA replica
fsn frame sequence number
chksum frame checksum
words length in words of the frame
op the type of frame
time UNIX time
mfor ID of master forest (if replica)
mtim when master became master
mfsn master forest fsn
fmcl foreign master cluster id
fmf foreign master forest id
fmt when foreign master became HA master
fmfsn foreign master fsn
sk sequence key (frame unique id)
pfo previous frame offset

Further reading / related articles

Summary

The XDMP-INMMTREEFULL, XDMP-INMMLISTFULL, XDMP-INMMINDXFULL, XDMP-INMREVIDXFULL, XDMP-INMMTRPLFULL & XDMP-INMMGEOREGIONIDXFULL messages are informational only.  These messages indicate that in-memory storage is full, resulting in the forest stands being written out to disk. There is no error as MarkLogic Server is working as expected.

Configuration Settings

If these messages consistently appear more frequently than once per minute, increasing the ‘in-memory’ settings in the affected database may be appropriate.

  • XDMP-INMMTREEFULL corresponds to the “in memory tree size” setting. "in memory tree size" specifies the amount of cache and buffer memory to be allocated for managing fragment data for an in-memory stand.
  • XDMP-INMMLISTFULL corresponds to the “in memory list size” setting. "in memory list size" specifies the amount of cache and buffer memory to be allocated for managing termlist data for an in-memory stand.
  • XDMP-INMMINDXFULL corresponds to the “in memory range index size” setting. "in memory range index size" specifies the amount of cache and buffer memory to be allocated for managing range index data for an in-memory stand.
  • XDMP-INMREVIDXFULL corresponds to the “in memory reverse index size” setting. "in memory reverse index size" specifies the amount of cache and buffer memory to be allocated for managing reverse index data for an in-memory stand. 
  • XDMP-INMMTRPLFULL corresponds to the “in memory triple index size” setting. "in memory triple index size" specifies the amount of cache and buffer memory to be allocated for managing triple index data for an in-memory stand. 
  • XDMP-INMMGEOREGIONIDXFULL corresponds to the “in memory geospatial region index size” setting. "in memory geospatial region index size" specifies the amount of cache and buffer memory to be allocated for managing geo region index data for an in-memory stand. 

Increasing the in memory settings have implications on the ‘journal size’ setting. The default value of journal size should be sufficient for most systems; it is calculated at database configuration time based on the size of your system. If you change the other memory settings, however, the journal size should equal the sum of the in memory list size and the in memory tree size. Additionally, you should add space to the journal size if you use range indexes (particularly if you use a lot of range indexes or have extremely large range indexes), as range index data can take up journal space.

Introduction

There are two ways of leveraging SSDs that can be used independently or simultaneously.

Fast Data Directory

In the forest configuration for each forest, you can configure a Fast Data Directory. The Fast Data Directory is designed for fast filesystems such as SSDs with built-in disk controllers. The Fast Data Directory stores the forest journals and as many stands as will fit onto the filesystem; if the forest never grows beyond the size of the Fast Data Directory, then the entire forest will be stored in that directory. If there are multiple forests on the same host that point to the same Fast Data Directory, MarkLogic Server divides the space equally between the different forests.

See Disk Storage.

Tiered Storage (licensed feature)

MarkLogic Server allows you to manage your data at different tiers of storage and computation environments, with the top-most tier providing the fastest access to your most-critical data and the lowest tier providing the slowest access to your least-critical data. As data ages and becomes less updated and queried, it can be migrated to less expensive and more densely packed storage devices to make room for newer, more frequently accessed and updated data.

See Tiered Storage.

 

Introduction

MarkLogic Server is a highly scalable, high performance Enterprise NoSQL database platform. Configuring a MarkLogic cluster to run as virtual machines follows tuning best practices associated with highly distributed, high performance database applications. Avoiding resource contention and oversubscription is critical for maintaining the performance of the cluster. The objective of this guide is to provide a set of recommendations for configuring virtual machines running MarkLogic for optimal performance. This guide is organized into sections for each computing resource, and provides a recommendation along with the rationale for that particular recommendation. The contents of this guide are intended for best practice recommendations and are not meant as a replacement for tuning resources based on specific usage profiles. Additionally, several of these recommendations trade off performance for flexibility in the virtualized environment.

General

Recommendation: Use the latest version of Virtual Hardware

The latest version of Virtual Hardware provides performance enhancements and maximums over older Virtual Hardware versions. Be aware that you may have to update the host, cluster or data center. For example, ESXi 7.0 introduces virtual hardware version 17, but VMs imported or migrated from older versions may not be automatically upgraded.

Recommendation: Use paravirtualized device drivers in the guest operating system

Paravirtualized hardware provides advanced queuing and processing off-loading features to maximize Virtual Machine performance. Additionally, paravirtualized drives provide batching of interrupts and requests to the physical hardware, which provides optimal performance for resource intensive operations.

Recommendation: Keep VMware Tools up to date on guest operating systems

VMware Tools provides guest OS drivers for paravirtual devices that optimize the interaction with VMkernel and offload potentially processor-intensive tasks such packet segmentation.

Recommendation: Disable VMWare Daemon Time Synchronization of the Virtual Machine

By default the VMWare daemon will synchronize the Guest OS to the Host OS (Hypervisor) once per minute, and may interfere with ntpdor chronyd settings. Through the VMSphere Admin UI, you can disable time synchronization between the Guest OS and Host OS in the virtual machine settings.

VMWare Docs: Configuring Virtual Machine Options

Recommendation: Disable Time Synchronization during VMWare operations

Even when daemon time synchronization is disabled, time synchronization will still occur during some VMWare operations such as, Guest OS boots/reboots, resuming a virtual machine, among others. Disabling VMWare clock sync completely requires editing the .vmx for the virtual machine to set several synchronization properties to false. Details can be found in the following VMWare Blog:

VMWare Blog: Completely Disable Time Synchronization for your VM

Recommendation: Use the noop scheduler for VMWare instances rather than deadline

The NOOP scheduler is a simple FIFO queue and uses the minimal amount of CPU/instructions per I/O to accomplish the basic merging and sorting functionality to complete the I/O.

Red Hat KB: IO Scheduler Recommendations for Virtualized Hosts

Recommendation: Remove any unused virtual hardware devices

Each virtual hardware device (Floppy disks, CD/DVD drives, COM/LPT ports) assigned to a VM requires interrupts on the physical CPU; reducing the number of unnecessary interrupts reduces the overhead associated with a VM.

Processor                                                                                                     

Socket and Core Allocation

Recommendation: Only allocate enough vCPUs for the expected server load, keeping in mind the general recommendation is to maintain two cores per forest.

Rationale: Context switching between physical CPUs for instruction execution on virtual CPUs creates overhead in the hypervisor.

Recommendation: Avoid oversubscription of physical CPU resources on hosts with MarkLogic Virtual Machines. Ensure proper accounting for hypervisor overhead, including interrupt operations, storage network operations, and monitoring overhead, when allocating vCPUs.

Rationale: Oversubscription of physical CPUs can cause contention for process intensive operations on in MarkLogic. Properly accounting will ensure adequate CPU resources are available for both the hypervisor and any MarkLogic Virtual Machines.

 

Memory                                                                                                      

General

Recommendation: Set up memory reservations for MarkLogic Virtual Machines.

Rationale: Memory reservations guarantee the availability of Virtual Machine memory when leveraging advanced vSphere functions such as Dynamic Resource Scheduling. Creating a reservation reduces the likelihood that MarkLogic Virtual Machines will be vMotioned to an ESX host with insufficient memory resources.

Recommendation: Avoid combining MarkLogic Virtual Machines with other types of Virtual Machines.

Rationale: Overcommitting memory on a single ESX host can result in swapping, causing significant performance overhead. Additionally, memory optimization techniques in the hypervisor, such as Transparent Page Sharing, rely on Virtual Machines running the same operating systems and processes.

Swapping Optimizations

Recommendation: Configure VM swap space to leverage host cache when available.

Rationale: During swapping, leveraging the ESXi hosts local SSD for swap will likely be substantially faster than using shared storage. This is unavailable when running a Fault Tolerant VM or using vMotion, but will provide a performance improvement for VMs in an HA cluster.

Huge / Large Pages

Recommendation: Configure Huge Pages in the guest operating system for Virtual Machines.

Rationale: Configuring Huge Pages in the guest operating system for a Virtual Machine prioritizes swapping of other memory first.

Recommendation: Disable Transparent Huge Pages in Linux kernels.

Rationale: The transparent Huge Page implementation in the Linux kernel includes functionality that provides compaction. Compaction operations are system level processes that are resource intensive, potentially causing resource starvation to the MarkLogic process. Using static Huge Pages is the preferred memory configuration for several high performance database platforms including MarkLogic Server.

 

Disk                                                                                                                        

General

Recommendation: Use Storage IO Control (SIOC) to prioritize MarkLogic VM disk IO.

Rationale: Several operations within MarkLogic require prioritized, consistent access to disk IO for consistent operation. Implementing a SIOC policy will help guarantee consistent performance when resources are contentious across multiple VMs accessing disk over shared links.

Recommendation: When possible, store VMDKs with MarkLogic forests on separate aggregates and LUNs.

Rationale: Storing data on separate aggregates and LUNs will reduce disk seek latency when IO intensive operations are taking place – for instance multiple hosts merging simultaneously.

Disk Provisioning

Recommendation: Use Thick Provisioning for MarkLogic data devices

Rationale: Thick provisioning prevents oversubscription of disk resources.  This will also prevent any issues where the storage appliance does not automatically reclaim free space, which can cause writes to a LUN to fail.

NetAPP Data ONTAP Discussion on Free Space with VMWare

SCSI Adapter Configuration

Recommendation: Allocate a SCSI adapter for guest operating system files and database storage independently. Additionally, add a storage adapter per tier of storage being used when configuring MarkLogic (i.e., an isolated adapter with a virtual disk for fast data directory).

Rationale: Leveraging two SCSI adapters provides additional queuing capacity for high IOPS virtual machines. Isolating IO also allows tuning of data sources to meet specific application demands.

Recommendation: Use paravirtualized SCSI controllers in Virtual Machines.

Rationale: Paravirtualized SCSI controllers reduce management overhead associated with operation queuing.

Virtual Disks versus Raw Device Mappings

Recommendation: Use Virtual Disks rather than Raw Device Mappings.

Rationale: VMFS provides optimized block alignment for virtual machines. Ensuring that MarkLogic VMs are placed on VMFS volumes with sufficient IO throughput and dedicated physical storage reduces management complexity while optimizing performance.

Multipathing

Recommendation: Use round robin multipathing for iSCSI, FCoE, and Fibre Channel LUNs.

Rationale: Multipathing allows the usage of multiple storage network links; using round robin ensures that all available paths will be used, reducing the possibility of storage network saturation.

vSphere Flash Read Cache

Recommendation: Enabling vSphere Flash Read Cache can enhance database performance. When possible, a Cache Size of 20% of the total database size should be configured.

Rationale: vSphere Flash Read Cache provides read caching for commonly accessed blocks. MarkLogic can take advantage of localized read cache for many operations including term list index resolution. Additionally, offloading read requests from the backing storage array reduces contention for write operations.

 

Network                                                                                                          

General

Recommendation: Use a dedicated physical NIC for MarkLogic cluster communications and a separate NIC for application communications. If multiple NICs are unavailable, use separate VLANs for cluster and application communication.

Rationale: Separating communications ensures optimal bandwidth is available for cluster communications while spreading the networking workload across multiple CPUs.

Recommendation: Use dedicated physical NICs for vMotion traffic on ESXi hosts running MarkLogic. If additional physical NICs are unavailable, move vMotion traffic to a separate VLAN.

Rationale: Separating vMotion traffic onto separate physical NICs, or at the very least a VLAN, reduces overall network congestion while providing optimal bandwidth for cluster communications. Additionally, NIOC policies can be configured to ensure resource shares are provided where necessary.

Recommendation: Use dedicated physical NICs for IP storage if possible. If additional physical NICs are unavailable, move IP storage traffic to a separate VLAN.

Rationale: Separating IP storage traffic onto separate physical NICs, or at the very least a VLAN, reduces overall network congestion while providing optimal bandwidth for cluster communications. Additionally, NIOC policies can be configured to ensure resource shares are provided where necessary.

Recommendation: Use Network I/O Control (NIOC) to prioritize MarkLogic inter-cluster communication
traffic.

Rationale: Since MarkLogic is a shared-nothing architecture, guaranteeing consistent network communication between nodes in the cluster provides consistent and optimal performance.

 

Network Adapter Configuration

Recommendation: Use enhanced vmxnet virtual network adapters in Virtual Machines.

Rationale: Enhanced vmxnet virtual network adapters can leverage both Jumbo Frames and TCP Segmentation Offloading to improve performance. Jumbo Frames allow for an increased MTU, reducing TCP header transmission overhead and CPU load. TCP Segmentation Offloading allows packets up to 64KB to be passed to the physical NIC for segmentation into smaller TCP packets, reducing CPU overhead and improving throughput.

Jumbo Frames

Recommendation: Use jumbo frames with MarkLogic Virtual Machines, ensuring that all components of the physical network support jumbo frames and are configured with an MTU of 9000.

Rationale: Jumbo frames increase the payload size of each TCP/IP frame sent across a network. As a result, the number of packets required to send a set of data is reduced, reducing overhead associated with the header of a TCP/IP frame. Jumbo frames are advantageous for optimizing the utilization of the physical network. However, if any components of the physical network do not support jumbo frames or are misconfigured, large frames are broken up and fragmented causing excessive overhead.

 


Analyzing Resource Contention                                                                               


Processor Contention

Virtual Machines with CPU utilization above 90% and CPU ready metrics of 20% or higher, and really any CPU ready time, are contentious for CPU.

Key metric for processor contention is %RDY.

Memory Contention

Metrics for memory contention requires an understanding of VMware memory management techniques.

. Transparent Page Sharing
. Enabled by default in the hypervisor
. Deduplicates memory pages between virtual machines on a single host

 

. Balloon Driver
. Leverages the guest operating systems internal swapping mechanisms
. Implemented as a high-priority system process that balloons to consume memory, forcing the operating system to swap older pages
. Indicated by the MEMCTL/VMMEMCTL metric

 

. Memory Page Compression
. Only enabled when memory becomes constrained on the host
. Breaks large pages into smaller pages then compresses the smaller pages
. Can generate up to a 2:1 compression ratio
. Increases processor load during reading and writing pages
. Indicated by the ZIP metric

 

. Swapping
. Hypervisor level swapping
. Swapping usually happens to the vswp file allocated for the VM
. Storage can be with the VM
. Storage can be in a custom area, local disk on the ESXi host for instance

 

. Can use SSD Host Cache for faster swapping, but still very bad
. Indicated by the SWAP metric

 Free memory metrics less than 6% or memory utilization above 94% indicate VM memory contention.


Disk Contention

Disk contention exists if the value for kernelLatency exceeds 4ms or deviceLatency exceeds 15ms. Device latencies greater than 15ms indicate an issue with the storage array, potentially an oversubscription of LUNs being used by VMFS or RDMs on the VM, or a misconfiguration in the storage processor. Additionally, a
queueLength counter greater than zero may indicate a less than optimal queue size set for an HBA or queuing on the storage array.

Network Contention

Dropped packets, indicated by the droppedTx and droppedRx metrics, are usually a good sign of a network bottleneck or misconfiguration.

High latency for a virtual switch configured with load balancing can indicate a misconfiguration in the selected load balancing algorithm. Particularly if using the IP-hash balancing algorithm, check the switch to ensure all ports on the switch are configured for EtherChannel or 802.3ad. High latency may also indicate a
misconfiguration of jumbo frames somewhere along the network path. Ensure all devices in the network have jumbo frames enabled.

References                                                                                                                                        

VMware Performance Best Practices for vSphere 5.5 - http://www.vmware.com/pdf/Perf_Best_Practices_vSphere5.5.pdf

VMware Resource Management - http://pubs.vmware.com/vsphere-51/topic/com.vmware.ICbase/PDF/vsphere-esxi-vcenter-server-51-resource-management-guide.pdf

Introduction

MarkLogic Server uses its list cache to hold search term lists in memory. If you're attempting to execute a particularly non-selective or inefficient query, your query will fail due to the size of the search term lists exceeding the allocated list cache.

What do I do about XDMP-LISTCACHEFULL errors?

This error really boils down to the amount of an available resource (in this case, the list cache) vs. resource usage (here, the size of the fetched search term lists). The available options are:

  1. Increasing the amount of available resource - in this case, increasing the list cache size.
  2. decreasing the amount of resource usage - here, either:
    • Finding and preventing inefficient queries.  That is, tune the queries in order to select fewer terms during index resolution;
    • Ensure appropriate indexes are enabled. For example, enabling positions may improve multi-term and proximity queries;
    • Reducing the size of forests, as smaller forests will have less data and consequently smaller term lists for a given query.

Note that option #1 (increasing the list cache size) is at best a temporary solution. There is a maximum size for the list cache of 32768 MB (73728 MB as of v7.0-5 and v8.0-2), and each partition should be no more than 8192 MB. Ultimately, if you're often seeing XDMP-LISTCACHEFULL errors, you're likely running insufficiently selective queries, and/or your forests are too big.

MarkLogic Server is designed to scale horizontally, and goes to great effort to make sure queries can be parallelized independently of one another. Nevertheless, there are occasions where users will run into an issue where, when invoked in parallel, some subset of their queries will take much longer than usual to execute. The longer running parallel invocations are almost always due to those some queries' runtime being informed by

a. the runtime of the query in isolation (in other words, absent any other activity on the database) but also

b. the time that query instance spends waiting on resources necessary for it to make forward progress.

Resolving this kind of performance issue requires a careful analysis of how queries are using resources as they work there way through your application stack. For example, if you have a web or application server in front of MarkLogic Server, how many threads do you have configured? How does that number compare to the number of threads configured on the MarkLogic application server to which its connected? If the number of MarkLogic application server threads is much smaller than the number of potential incoming requests, then some of your queries will be fast because all they need to do is execute - and some of your queries will be slower to run because, in addition to the time needed to execute, they'll also need to wait for a MarkLogic application server thread to free up. In this case, you'll want to bring the number of threads into better alignment with one another - either by reducing the number of threads on the web or application server in front of MarkLogic, or increasing the number of MarkLogic application server threads - or both.

You'll want to try and minimize the amount of time queries spend waiting for resources, in general. Application server threads are just one example of resources on which queries can sometimes wait. Queries can also wait for all sorts of other resources to free up - threads, RAM, storage I/O, read or write locks, etc. Ultimately, if you're seeing a performance profile where a single query invocation if fast but some subset of parallel invocations is fast and some slow (sometimes seen with higher query runtime averages and larger query runtime standard deviations), then you're very likely to have a resource bottleneck somewhere in your application stack. Resolving such a bottleneck will involve some combination of increasing the amount of available resource, reducing the amount of parallel traffic, or improving the overall efficiency of any one instance of your query.

Summary

XDMP-CANCELED indicates that a query or operation was cancelled either explicitly or as a result of a system event. XDMP-EXTIME also indicates that a query or operation was cancelled, but the reason for the cancellation is the result of the elapsed processing time exceeding a timeout setting.

XDMP-CANCELED:  Canceled request

The XDMP-CANCELED error message usually indicates that an operation such as a merge, backup or query was explicitly canceled. The message includes information about what operation was canceled. Cancellation may occur through the Admin Interface or by calling an explicit cancellation function, such as xdmp:request-cancel().

An XDMP-CANCELED error message can also occur when a client breaks the network socket connection to the server while a query is running (i.e. if the client abandons the request), resulting in the query being canceled.

try/catch:

XDMP-CANCELED exception will not be caught in a try/catch block.

XDMP-EXTIME: Time limit exceeded

An XDMP-EXTIME error will occur if a query or other operation exceeded its processing time limit. Surrounding messages in the ErrorLog.txt file may pinpoint the operation which timed out.

Inefficient Queries

If the cause of the timeout is an inefficient or incorrect query, you should tune the query. This may involve tuning your query to minimize the amount of filtering required. Tuning queries in MarkLogic often includes maintaining the proper indexes for the database so that the queries can be resolved during the index resolution phase of query evaluation. If a query requires filtering of many documents, then the performance will be adversely affected. To learn more about query evaluation, refer to Section 2.1 'Understanding the Search Process' of the MarkLogic Server's Query Performance and Tuning Guide available in our documentation at https://docs.marklogic.com/guide/performance.pdf.  

MarkLogic has tools that can be used to help evaluate the characteristic of your queries. The best way to analyze a single query is to instrument the query with query trace, query meters and query profiling API calls: Query trace can be used to determine if the queries are resolvable in the index, or if filtering is involved;  Query meters gives statistics from a query execution; and Query profiling will provide information regarding how long each statement in your query took. Information regarding these APIs are available in the Query Performance and Tuning Guide

The Query Console makes it easy to profile a query in order to view sub-statement execution times. Once you have identified the poor performing statements, you can focus on optimizing that part of the code.

Inadequate Processing Limit

If the cause of the timeout is an inadequate processing limit, you may be able to configure a more generous limit through the Admin Interface. 

A common setting which can contribute to the XDMP-EXTIME error message is the 'default time limit' setting for an Application Server or the Task Server.  An alternative to increasing the 'default time limit' is to use xdmp:set-request-time-limit() within your query.  Note that neither the 'default time limit' nor the request time limit can be larger than the "max time limit".

Resource Bottlenecks

If the cause of the timeout is the result of a resource bottleneck where the query or operation was not being serviced adequately, you will need to tune your system to eliminate the resource bottleneck. MarkLogic recommends that all systems where MarkLogic Server is installed should monitor the resource usage of its system components (i.e. CPU, memory, I/O, swap, network, ...) so that resource bottlenecks can easily be detected.

try/catch

XDMP-EXTIME can be caught in a try/catch block.

Introduction

The first time a query is executed, it will take longer to execute than subsequent runs.  This extra time for the first runs become more pronounced when importing large libraries.  Why is this so and is there anything that we can do to improve the performance?

Details

When MarkLogic evaluates an XQuery script it first compiles it into a complete XQuery program. When compiling the program, the transitive closure of all imported modules are linked together and all function and variables names are resolved.

MarkLogic maintains a cache of pre-parsed library modules, so library modules are not re-parsed when a program is compiled. But every unique program needs to be globally linked together with all its included library modules before it is executed. The time for this linking can result in "First Runs" being slower than subsequent runs.

Performance recommendation

When using library modules, you will likely see better performance if you parameterize frequently used queries through variables and not through code. For example, use external variables in xdbc requests, or use request fields with application server requests.