Knowledgebase: Performance Tuning
Data Balancing in MarkLogic
22 March 2023 11:52 AM

Summary

MarkLogic Server clusters are built on a distributed, shared nothing architecture.  Typical query loads will maximize resource utilization only when database content is evenly distributed across D-node hosts in a cluster.  That is, optimal performance will occur when the amount of concurrent work required of each node in a cluster is equivalent.   Having your data balanced across the forests in your cluster is necessary in order to achieve optimal performance.

If all of the forests in a multi-forest database are present from the time when the database was created, the forests will likely each have approximately the same number of documents. If forests were added later on, the newer forests will tend to have fewer documents. In cases like this, rebalancing the forests may be in order.

Default Document Forest Assignment (bucket assignment policy)

Earlier versions used a default document forest assignment policy (or legacy policy). Currently bucket is the default assignment policy and the rebalancer enable configuration for a database is set to true.

In bucket assignment policy, in a multi-forest database, a new document gets assigned to a forest based on the URI hash.  For practical purposes, the default forest assignment is random. In most cases, the default behavior is sufficient to guarantee evenly distributed content.  

There are API functions that allow you to determine where a document resides or will reside:

  • The xdmp:document-assign() function can be used to determine the forest for which a document URI will be assigned. 
  • For existing documents, document updates will occur in the same forest as the existing document. The xdmp:document-forest() function can be used to determine which forest the document is assigned to. 

In-Forest Placement

'In-forest placement' is a technique that is used to override the default document forest assignments.  

Both xdmp:document-insert() and xdmp:document-load() allow you to specify the forest in which the document will be inserted.

mlcp has a -fastload option which will insert content directly.  See Time vs. Correctness: Understanding -fastload Tradeoffs to understand the tradeoffs when using this option.

Some common open source document loading tools also support in-forest placement. RecordLoader (http://developer.marklogic.com/code/recordloader) and XQsync (http://developer.marklogic.com/code/xqsync) support in-forest placement with the OUTPUT_FORESTS property setting.

Rebalancing

MarkLogic 7 introduced database rebalancing using a database rebalancer configured with one of several assignment policies.

A database rebalancer consists of two parts:

  1. an assignment policy for data insert and rebalancing, and
  2. a rebalancer for data movement.

The rebalancer can be configured with one of several assignment policies, which define what is considered 'balanced' for a database. The rebalancer runs on each forest and consults the database's assignment policy to determine which documents do not 'belong to' this forest and then pushes them to the correct forests. You can read more about database rebalancing at http://docs.marklogic.com/guide/admin/database-rebalancing

For a brand new database, the rebalancer is enabled by default and the assignment policy is bucket.  For older versions (before ML 7), by default, the assignment was done using legacy policy.  Upgrades do not change the assignment policy for a database.

(Note that rebalancing forests may result in forests that contain many deleted fragments. To recover disk space, you may wish to force some forests to merge.)

Before Rebalancing, Consider This …

Before embarking on a process to rebalance the documents in your database, consider that rebalancing is generally slower than clearing the database and reloading. The reason is that rebalancing involves updating documents, and updates are more expensive than inserts. Rebalancing the forests may not be the best to solution. If you have the luxury of clearing the database and reloading everything, do it.  However, if the database must be available throughout the rebalancing process, then using the rebalancer may be appropriate.

(4 vote(s))
Helpful
Not helpful

Comments (0)