Data distribution imbalance with long uri prefix in bucket assignment
21 July 2022 01:55 PM
Long URI prefix may lead to imbalance in data distribution among the forests.
Database assignment policy is set to 'Bucket'. Rebalancer is set to enable, and no fragments is pending to be rebalanced; However, data is imbalanced across forests associated with database. Few forests has higher number of fragments compared to other forests in a given database.
For bucket assignment policy, document uri is hashed to match specific bucket. The bucket policy algorithm maps a document’s URI to one of 16K “buckets,” with each bucket being associated with a forest. A table mapping buckets to forests is stored in memory for fast assignment.
Bucket algorithm does not consider whole uri length for the calculation while determining bucket based on uri hash. Uri based bucket determination in bucket assignment policy rely largely on initial characters for hashing algorithm.
If document uri includes long common prefix then all documents uri will result in same hash value and same bucket, even if they different suffix number, and hence result is skewed if there is larger common prefix.
To confirm if uneven number of fragments between different forests in database, you can run below query which will give 100 sample documents from each forests and you can review if there are common prefix in document uri in forests with higher number of fragments.
We recommend document uri to not have long name and common prefix. Certain common document uri values can be changed to collection.
Above is just an example, but suggestion is to have an URI naming pattern to avoid large common prefix or save under collection.
You can use document-assign built-in to verify if URI’s are distributed per the bucket algorithm.