Indexing Best Practices
01 November 2017 02:36 PM
Indexing Best Practices
Indexing in MarkLogic occurs when a document is added or updated. When adding a new index, the server runs an estimate of all the fragments that match and the proceeds to reload those URIs that match.
Indexing/reindexing can be a CPU and disk-IO intensive operation. Reindexing creates a lot of new fragments, with the original fragments being marked for deletion - these will then need to be merged. All of this activity can affect query performance, which leads to our first recommendation.
If you need to add or modify an index on a production cluster, consider scheduling the reindex during a time when your cluster is not busy. If your database is too large to reindex during a single period of low usage, consider running it over several periods. For example, if low usage period is during a weekend, the process could look like:
Avoid Unused Range Indexes, Fields, and Path Indexes
In addition to taking up extra disk space, Range Indexes, Fields and Path Indexes require extra work when it's time to reindex. Field and Path indexes may require extra loading passes.
Avoid Using Namespaces to Implement Multi-Tenancy
It's a common use case to want to create some kind of partition (or multiple partitions) between documents in a particular database. In such a scenario it's far better to 1) constrain the partitioning information to a particular element in a document (then include a clause over that element in your searches), than it is to 2) attempt to manage partitions via unique element namespaces corresponding to each partition. For example, given two documents in two different partitions, you'll want them to look like this:
1a. <doc><partition>partition1</partition><name>Joe Smith</name></doc>
1b. <doc><partition>partition2</partition><name>John Smith</name></doc>
...vs. something like this:
2a. <doc xmlns:p="http://partition1"><p:name>Joe Smith</p:name></doc>
2b. <doc xmlns:p="http://partition2"><p:name>John Smith</p:name></doc>
Why is #1 better? In terms of searching the data once it's indexed, there's actually not much of a difference - one could easily create searches to accomodate both approaches. The issue is how the indexing works in practice. MarkLogic Server indexes all content on ingest. In scenario 2, every time a new partition is created, a new range element index needs to defined in the Admin UI, which means your index settings have changed, which means the server now needs to reindex all of your content - not just the documents corresponding to the newly introduced partition. In contrast, for scenario 1, all that would need to be done is to ingest the documents corresponding to the new partition, which would then be indexed just like all the other existing content. There would be a need, however, to change the searches in scenario 1, as they would not yet include a clause to accomodate the new partition (for example: cts:element-value-query(xs:QName("partition"), "partition2")) - but the overall impact of adding a partition is changing the searches in scenario 1, which is ultimately far, far less than reindexing your entire database as would be required in scenario 2. Note that in addition to a database-wide reindex, searches would also need to change in scenario 2, as well.
Keep an Eye on IO Throughput
Reindexing can lead to heavy merge activity and may lead to disk-IO bottlenecks if not managed carefully. If you have a system that is available 24-7 with no downtime window, then you may need to throttle the reindexer in order to keep the IO to a minimum. We suggest the following database settings for reindexing a system that must always remain in use:
You can also adjust the following group settings to help limit background IO:
This will limit the background IO for that group to 100 MB. This is good starting point, and may be increased in increments of 50 if you find that your merges are progressing too slowly. Proceed with caution as too low of a background IO limit can have negative performance or even catastrophic consequences.
In general, your indexing/reindexing and subsequent search experience will be better if you