How reindexing/rebalancing works, and the impact on performance
26 May 2021 02:04 PM
While reindexing should be an infrequent operation in a production environment, it is important to understand how the process can impact a MarkLogic environment. This article describes the process of reindexing and explores how it may affect performance of the server.
MarkLogic enables some default full-text search indexes and any inserted content populates these indexes. There are situations where these indexes may need to be changed, including:
When the indexes are changed, the server will begin the process of reindexing all affected content. In most cases, this will include all documents in a database, but the server does try to reindex only the fragments that contain content that would be populated in the added/removed index.
How reindexing works
When configuration changes are made in the admin interface or the Admin API, the server will write a new version of its configuration files. Directly after these configuration changes are made, or on startup, the server will automatically start reindexing forests. If no index changes have been made, the server will simply reindex zero fragments. If the changes include index settings, however, the server will find that some/all fragments may need to be reindexed. The server will query the content and pick up the first 500 fragments that have not be reindexed, and it will reinsert this content into the database with the new index settings. In this way, reindexing is very much like a simple document update, only the process is automated and the index settings are different. Once these 500 fragments have been completed, the server will get the next 500 fragments, and this process continues until the query returns zero fragments to reindex.
Reindexing consumes additional disk space during the process itself. In particular, at any point in the reindexing process, the server can have up to three instances of a single fragment:
In a worst-case scenario, more likely to happen towards the end of reindexing, the disk footprint of all the forests could be 3x the original size. For this reason, MarkLogic requires extra disk space for reindexing. This design choice ensures integrity of the content and allows for zero downtime when reindexing.
Reindexing is a resource-intensive operation, as it uses both CPU and disk bandwidth. The CPU will be busy parsing the content and generating index entries while the disk will be reading fragments for reindexing, writing new stands to disk, and running merges on these newly created stands. You can expect significant performance impact in environments that are normally heavily utilized. You can decrease the impact of reindexing by using the reindexer throttle setting in the database configuration page. Reducing the value from 5 will introduce a delay between completion of 500 fragments and the next query for the following 500 fragments.
Here are some recommendations when considering reindexing:
Changes in the cluster configuration may require rebalancing content across forests. Rebalancing works similar to reindexing; batches of documents are marked deleted in one forest and inserted into another forest. The performance impact and recommendations are thus the same as for reindexing.