Knowledgebase:
Indexing Best Practices
04 May 2020 03:10 PM

Indexing Best Practices

MarkLogic Server indexes records (or documents/fragments) on ingest. When a database's index configuration is changed, the server will consequently reindex all matching records.

Indexing and reindexing can be a CPU and I/O intensive operation. Reindexing creates a lot of new fragments, with the original fragments being marked for deletion. These deleted fragments will then need to be merged out. All of this activity can potentially affect query performance, especially in systems with under-provisioned hardware.

Reindexing in Production

If you need to add or modify an index on a production cluster, consider scheduling the reindex during a time when your cluster is less busy. If your database is too large to completely reindex during a single period of low usage, consider running the reindex over several periods of time. For example, if your low usage period is during a weekend, the process may look like:

  • Change your index configuration on a Friday night
  • Let the reindex run for most of the weekend
  • To pause the reindex, set the reindexer-enable field to 'false' for the database being reindexed. Be sure to allow sufficient time for the associated merging to complete before system load comes back.
  • If needed, reindexing can continue over the next weekend - the reindexer process will pick up where it left off before it was disabled.

You can refer to https://help.marklogic.com/Knowledgebase/Article/View/18/15/how-reindexing-works-and-its-impact-on-performance for more details on invoking reindexing on production.

      When you have Database Replication Configured

If you have to add or modify indexes on a database which has database replication configured, make sure the same changes are made on the Replica cluster as  well. Starting with ML server version 9.0-7, index data is also replicated from the Master to the Replica, but it does not automatically check if both sides have the same index settings. Reindexing is disabled by default on a replica cluster. However, when database replication configuration is removed (such as after a disaster),  the replica database will reindex as necessary. So it is important that the Replica database index configuration matches the Master’s to avoid unnecessary reindexing.

Further reading -

Master and Replica Database Index Settings

Database Replication - Indexing on Replica Explained

Avoid Unused Range Indexes, Fields, and Path Indexes

In addition to taking up extra disk space, Range, Field, and Path Indexes require extra work when it's time to reindex. Field and Path indexes may also require extra indexing passes.

Avoid Using Namespaces to Implement Multi-Tenancy

It's a common use case to want to create some kind of partition (or multiple partitions) between documents in a particular database. In such a scenario it's far better to 1) constrain the partitioning information to a particular element in a document (then include a clause over that element in your searches), than it is to 2) attempt to manage partitions via unique element namespaces corresponding to each partition. For example, given two documents in two different partitions, you'll want them to look like this:

1a. <doc><partition>partition1</partition><name>Joe Smith</name></doc>

1b. <doc><partition>partition2</partition><name>John Smith</name></doc>

...vs. something like this:

2a. <doc xmlns:p="http://partition1"><p:name>Joe Smith</p:name></doc>

2b. <doc xmlns:p="http://partition2"><p:name>John Smith</p:name></doc>

Why is #1 better? In terms of searching the data once it's indexed, there's actually not much of a difference - one could easily create searches to accommodate both approaches. The issue is how the indexing works in practice. MarkLogic Server indexes all content on ingest. In scenario #2, every time a new partition is created, a new range element index needs to defined in the Admin UI, which means your index settings have changed, which means the server now needs to reindex all of your content - not just the documents corresponding to the newly introduced partition. In contrast, for scenario #1, all that would need to be done is to ingest the documents corresponding to the new partition, which would then be indexed just like all the other existing content. There would be a need, however, to change the searches in scenario #1, as they would not yet include a clause to accommodate the new partition (for example: cts:element-value-query(xs:QName("partition"), "partition2")) - but the overall impact of adding a partition is changing the searches in scenario #1, which is ultimately far, far less intrusive a change than reindexing your entire database as would be required in scenario #2. Note that in addition to a database-wide reindex, searches would also need to change in scenario #2, as well.

Keep an Eye on I/O Throughput

Reindexing can lead to heavy merge activity and may lead to disk I/O bottlenecks if not managed carefully. If you have a system that is available 24-7 with no downtime window, then you may need to throttle the reindexer in order to keep the disk I/O to a minimum. We suggest the following database settings for reindexing a system that must always remain in use:

  • reindexer-throttle = 3
  • large-size-threshold = 1048576

You can also adjust the following group settings to help limit background I/O:

  • background-io-limit = 100

This will limit the background I/O for that group to 100 MB/sec per host across all hosts in that group. This should only be configured if merges are causing problems—it is a way of throttling back the I/O used by the merging process.This is good starting point, and may be increased in increments of 50 if you find that your merges are progressing too slowly.  Proceed with caution as too low of a background IO limit can have negative performance or even catastrophic consequences

General Recommendations

In general, your indexing/reindexing and subsequent search experience will be better if you

(14 vote(s))
Helpful
Not helpful

Comments (0)