Knowledgebase:
Data distribution imbalance with long uri prefix in bucket assignment
21 July 2022 01:55 PM

Summary

Long URI prefix may lead to imbalance in data distribution among the forests. 

Observation

Database assignment policy is set to 'Bucket'. Rebalancer is set to enable, and no fragments is pending to be rebalanced; However, data is imbalanced across forests associated with database. Few forests has higher number of fragments compared to other forests in a given database.

Root cause

For bucket assignment policy, document uri is hashed to match specific bucket. The bucket policy algorithm maps a document’s URI to one of 16K “buckets,” with each bucket being associated with a forest. A table mapping buckets to forests is stored in memory for fast assignment.

Bucket algorithm does not consider whole uri length for the calculation while determining bucket based on uri hash. Uri based bucket determination in bucket assignment policy rely largely on initial characters for hashing algorithm.

If document uri includes long common prefix then all documents uri will result in same hash value and same bucket, even if they different suffix number, and hence result is skewed if there is larger common prefix.

Analysis

To confirm if uneven number of fragments between different forests in database, you can run below query which will give 100 sample documents from each forests and you can review if there are common prefix in document uri in forests with higher number of fragments.

xquery version "1.0-ml";

for $i in xdmp:database-forests(xdmp:database('<dbname>'))
    let $uri := for $j in cts:uris((),(),(),(), $i)[0 to 100]
                return <uri>{$j}</uri>
return <forests><forest>{$i}</forest><uris>{$uri}</uris></forests>

Recommendation

We recommend document uri to not have long name and common prefix. Certain common document uri values can be changed to collection.

Example uri -  /Prime/InternationalTradeDay/Activity/AccountId/ABC0001/BusinessDate/2021-06-14/CurrencyCode/USD/ID/ABC0001-XYZ-123.json

Can be -  /ABC0001-XYZ-123.json. with collection "USD", "Prime", and doc that have date element with "2021-06-14".

Above is just an example, but suggestion is to have an URI naming pattern to avoid large common prefix or save under collection. 

You can use document-assign built-in to verify if URI’s are distributed per the bucket algorithm.

https://docs.marklogic.com/xdmp:document-assign

Additional Resources

(0 vote(s))
Helpful
Not helpful

Comments (0)