Data Balancing in MarkLogic
27 September 2016 01:58 PM
MarkLogic Server clusters are built on a distributed, shared nothing architecture. Typical query loads will maximize resource utilization only when database content is evenly distributed across D-node hosts in a cluster. That is, optimal performance will occur when the amount of concurrent work required of each node in a cluster is equivalent. Having your data balanced across the forests in your cluster is necessary in order to achieve optimal performance.
If all of the forests in a multi-forest database are present from the time when the database was created, the forests will likely each have approximately the same number of documents. If forests were added later on, the newer forests will tend to have fewer documents. In cases like this, rebalancing the forests may be in order.
Default Document Forest Assignment (Legacy assignment policy)
Before MarkLogic 7, earlier versions used a default document forest assignment policy (or legacy policy). For MarkLogic 7 this is the default assignment policy when rebalancer enable configuration for a database is set to false.
In legacy assignment policy, in a multi-forest database, a new document gets assigned to a forest based on the URI hash. For practical purposes, the default forest assignment is random. In most cases, the default behavior is sufficient to guarantee evenly distributed content.
There are API functions that allow you to determine where a document resides or will reside:
'In-forest placement' is a technique that is used to override the default document forest assignments.
Both xdmp:document-insert() and xdmp:document-load() allow you to specify the forest in which the document will be inserted.
mlcp has a -fastload option which will insert content directly. See Time vs. Correctness: Understanding -fastload Tradeoffs to understand the tradeoffs when using this option.
Some common open source document loading tools also support in-forest placement. RecordLoader (http://developer.marklogic.com/code/recordloader) and XQsync (http://developer.marklogic.com/code/xqsync) support in-forest placement with the OUTPUT_FORESTS property setting.
MarkLogic 7 introduced database rebalancing using a database rebalancer configured with one of several assignment policies.
A database rebalancer consists of two parts:
The rebalancer can be configured with one of several assignment policies, which define what is considered 'balanced' for a database. The rebalancer runs on each forest and consults the database's assignment policy to determine which documents do not 'belong to' this forest and then pushes them to the correct forests. You can read more about database rebalancing at http://docs.marklogic.com/guide/admin/database-rebalancing
For a brand new database, the rebalancer is enabled by default and the assignment policy is bucket. For older versions (before ML 7), by default, the assignment was done using legacy policy.
(Note that rebalancing forests may result in forests that contain many deleted fragments. To recover disk space, you may wish to force some forests to merge.)
Before Rebalancing, Consider This …
Before embarking on a process to rebalance the documents in your database, consider that rebalancing is generally slower than clearing the database and reloading. The reason is that rebalancing involves updating documents, and updates are more expensive than inserts. Rebalancing the forests may not be the best to solution. If you have the luxury of clearing the database and reloading everything, do it. However, if the database must be available throughout the rebalancing process, then using the rebalancer may be appropriate.