Community

MarkLogic 10 and Data Hub 5.0

Latest MarkLogic releases provide a smarter, simpler, and more secure way to integrate data.

Read Blog →

Company

Stay On Top Of Everything MarkLogic

Be the first to know! News, product information, and events delivered straight to your inbox.

Sign Me Up →

 
Knowledgebase: MarkLogic Server
How reindexing works, and its impact on performance
05 November 2019 02:03 PM

Summary

While reindexing should be an infrequent operation in a production environment, it is important to understand how the process can impact a MarkLogic environment. This article describes the process of reindexing and explores how it may affect performance of the server.

Why reindex?

MarkLogic enables some default fulltext search indexes and any inserted content populates these indexes. There are situations where these indexes may need to be changed, including:

  • to enable additional search functionality in a MarkLogic application
  • to increase accuracy of unfiltered searches
  • to create additional facets or lexicon-based functions
  • to recognize enhancements or bug fixes between MarkLogic versions
  • to remove unused indexes to reclaim disk space

When the indexes are changed, the server will begin the process of reindexing all affected content. In most cases, this will include all documents in a database, but the server does try to reindex only the fragments that contain content that would be populated in the added/removed index.

How reindexing works

When configuration changes are made in the admin interface or the Admin API, the server will write a new version of its configuration files. Directly after these configuration changes are made, or on startup, the server will automatically start reindexing forests. If no index changes have been made, the server will simply reindex zero fragments. If the changes include index settings, however, the server will find that some/all fragments may need to be reindexed. The server will query the content and pick up the first 500 fragments that have not be reindexed, and it will reinsert this content into the database with the new index settings. In this way, reindexing is very much like a simple document update, only the process is automated and the index settings are different. Once these 500 fragments have been completed, the server will get the next 500 fragments, and this process continues until the query returns zero fragments to reindex.

Reindexing consumes additional disk space during the process itself. In particular, at any point in the reindexing process, the server can have up to three instances of a single fragment:

  • the original document (original indexes)
  • updated document (new indexes)
  • merged document (only if the stand where this document resides is currently being merged)

In a worst-case scenario, more likely to happen towards the end of reindexing, the disk footprint of all the forests could be 3x the original size. For this reason, MarkLogic requires 3x disk space for reindexing. For example, if there are six 100GB forests, each forest would require 300GB available (totalling 1.8TB). This design choice ensure integrity of the content and allows for zero downtime when reindexing.

Performance Impact

Reindexing is a resource intensive operation, as it uses both CPU and disk bandwidth. The CPU will be busy parsing the content and generating index entries while the disk will be reading fragments for reindex, writing new stands to disk, and running merges on these newly-created stands. You can expect significant performance impact in environments that are normally heavily utilized. You can decrease the impact of reindexing by using the reindexer throttle setting in the database configuration page. Reducing the value from 5 will introduce a delay between completion of 500 fragments and the next query for the following 500 fragments.

Recommendations

Here are some recommendations when considering reindexing:

  • Plan to make multiple index changes at once to avoid reindexing multiple times
  • Disable reindexing (database configuration) to avoid accidentally forcing a reindex, only re-enabling it when reindexing is explicitly planned
  • Only enable reindexing (database configuration) during off-peak hours.  The duration to complete the reindex will increase, but performance during peak hours will be better.
  • Check for 3x disk space before the reindex process begins
  • Ensure the environment has sufficient i/o bandwidth
  • Disable application/user access if you can afford the downtime, as this may improve overall reindex performance
(17 vote(s))
Helpful
Not helpful

Comments (0)