Community

MarkLogic 10 and Data Hub 5.0

Latest MarkLogic releases provide a smarter, simpler, and more secure way to integrate data.

Read Blog →

Company

Stay On Top Of Everything MarkLogic

Be the first to know! News, product information, and events delivered straight to your inbox.

Sign Me Up →

 
Knowledgebase: MarkLogic Server
How reindexing/rebalancing works, and the impact on performance
26 May 2021 02:04 PM

Summary

While reindexing should be an infrequent operation in a production environment, it is important to understand how the process can impact a MarkLogic environment. This article describes the process of reindexing and explores how it may affect performance of the server.

Why reindex?

MarkLogic enables some default full-text search indexes and any inserted content populates these indexes. There are situations where these indexes may need to be changed, including:

  • to enable additional search functionality in a MarkLogic application
  • to increase accuracy of unfiltered searches
  • to create additional facets or lexicon-based functions
  • to recognize enhancements or bug fixes between MarkLogic versions
  • to remove unused indexes to reclaim disk space

When the indexes are changed, the server will begin the process of reindexing all affected content. In most cases, this will include all documents in a database, but the server does try to reindex only the fragments that contain content that would be populated in the added/removed index.

How reindexing works

When configuration changes are made in the admin interface or the Admin API, the server will write a new version of its configuration files. Directly after these configuration changes are made, or on startup, the server will automatically start reindexing forests. If no index changes have been made, the server will simply reindex zero fragments. If the changes include index settings, however, the server will find that some/all fragments may need to be reindexed. The server will query the content and pick up the first 500 fragments that have not be reindexed, and it will reinsert this content into the database with the new index settings. In this way, reindexing is very much like a simple document update, only the process is automated and the index settings are different. Once these 500 fragments have been completed, the server will get the next 500 fragments, and this process continues until the query returns zero fragments to reindex.

Reindexing consumes additional disk space during the process itself. In particular, at any point in the reindexing process, the server can have up to three instances of a single fragment:

  • the original document (original indexes)
  • updated document (new indexes)
  • merged document (only if the stand where this document resides is currently being merged)

In a worst-case scenario, more likely to happen towards the end of reindexing, the disk footprint of all the forests could be 3x the original size. For this reason, MarkLogic requires extra disk space for reindexing. This design choice ensures integrity of the content and allows for zero downtime when reindexing.

Performance Impact

Reindexing is a resource-intensive operation, as it uses both CPU and disk bandwidth. The CPU will be busy parsing the content and generating index entries while the disk will be reading fragments for reindexing, writing new stands to disk, and running merges on these newly created stands. You can expect significant performance impact in environments that are normally heavily utilized. You can decrease the impact of reindexing by using the reindexer throttle setting in the database configuration page. Reducing the value from 5 will introduce a delay between completion of 500 fragments and the next query for the following 500 fragments.

Recommendations

Here are some recommendations when considering reindexing:

  • Plan to make multiple index changes at once to avoid reindexing multiple times
  • Disable reindexing (database configuration) to avoid accidentally forcing a reindex, only re-enabling it when reindexing is explicitly planned
  • Only enable reindexing (database configuration) during off-peak hours.  The duration to complete the reindex will increase, but performance during peak hours will be better.
  • Check for free disk space before the reindex process begins (see Understanding MarkLogic Minimum Disk Space Requirements)
  • Ensure the environment has sufficient i/o bandwidth
  • Disable application/user access if you can afford the downtime, as this may improve overall reindex performance

Rebalancing

Changes in the cluster configuration may require rebalancing content across forests.  Rebalancing works similar to reindexing; batches of documents are marked deleted in one forest and inserted into another forest.  The performance impact and recommendations are thus the same as for reindexing.

(22 vote(s))
Helpful
Not helpful

Comments (0)