Indexing Best Practices | MarkLogic Support

Knowledgebase

108Administration 8App Services 42Errors 146MarkLogic Server 53Performance Tuning

Knowledgebase:

Indexing Best Practices 04 May 2020 03:10 PM
Indexing Best Practices MarkLogic Server indexes records (or documents/fragments) on ingest. When a database's index configuration is changed, the server will consequently reindex all matching records. Indexing and reindexing can be a CPU and I/O intensive operation. Reindexing creates a lot of new fragments, with the original fragments being marked for deletion. These deleted fragments will then need to be merged out. All of this activity can potentially affect query performance, especially in systems with under-provisioned hardware. Reindexing in Production If you need to add or modify an index on a production cluster, consider scheduling the reindex during a time when your cluster is less busy. If your database is too large to completely reindex during a single period of low usage, consider running the reindex over several periods of time. For example, if your low usage period is during a weekend, the process may look like: Change your index configuration on a Friday night Let the reindex run for most of the weekend To pause the reindex, set the reindexer-enable field to 'false' for the database being reindexed. Be sure to allow sufficient time for the associated merging to complete before system load comes back. If needed, reindexing can continue over the next weekend - the reindexer process will pick up where it left off before it was disabled. You can refer to https://help.marklogic.com/Knowledgebase/Article/View/18/15/how-reindexing-works-and-its-impact-on-performance for more details on invoking reindexing on production. When you have Database Replication Configured If you have to add or modify indexes on a database which has database replication configured, make sure the same changes are made on the Replica cluster as well. Starting with ML server version 9.0-7, index data is also replicated from the Master to the Replica, but it does not automatically check if both sides have the same index settings. Reindexing is disabled by default on a replica cluster. However, when database replication configuration is removed (such as after a disaster), the replica database will reindex as necessary. So it is important that the Replica database index configuration matches the Master’s to avoid unnecessary reindexing. Further reading - Master and Replica Database Index Settings Database Replication - Indexing on Replica Explained Avoid Unused Range Indexes, Fields, and Path Indexes In addition to taking up extra disk space, Range, Field, and Path Indexes require extra work when it's time to reindex. Field and Path indexes may also require extra indexing passes. Avoid Using Namespaces to Implement Multi-Tenancy It's a common use case to want to create some kind of partition (or multiple partitions) between documents in a particular database. In such a scenario it's far better to 1) constrain the partitioning information to a particular element in a document (then include a clause over that element in your searches), than it is to 2) attempt to manage partitions via unique element namespaces corresponding to each partition. For example, given two documents in two different partitions, you'll want them to look like this: 1a. <doc><partition>partition1</partition><name>Joe Smith</name></doc> 1b. <doc><partition>partition2</partition><name>John Smith</name></doc> ...vs. something like this: 2a. <doc xmlns:p="http://partition1"><p:name>Joe Smith</p:name></doc> 2b. <doc xmlns:p="http://partition2"><p:name>John Smith</p:name></doc> Why is #1 better? In terms of searching the data once it's indexed, there's actually not much of a difference - one could easily create searches to accommodate both approaches. The issue is how the indexing works in practice. MarkLogic Server indexes all content on ingest. In scenario #2, every time a new partition is created, a new range element index needs to defined in the Admin UI, which means your index settings have changed, which means the server now needs to reindex all of your content - not just the documents corresponding to the newly introduced partition. In contrast, for scenario #1, all that would need to be done is to ingest the documents corresponding to the new partition, which would then be indexed just like all the other existing content. There would be a need, however, to change the searches in scenario #1, as they would not yet include a clause to accommodate the new partition (for example: cts:element-value-query(xs:QName("partition"), "partition2")) - but the overall impact of adding a partition is changing the searches in scenario #1, which is ultimately far, far less intrusive a change than reindexing your entire database as would be required in scenario #2. Note that in addition to a database-wide reindex, searches would also need to change in scenario #2, as well. Keep an Eye on I/O Throughput Reindexing can lead to heavy merge activity and may lead to disk I/O bottlenecks if not managed carefully. If you have a system that is available 24-7 with no downtime window, then you may need to throttle the reindexer in order to keep the disk I/O to a minimum. We suggest the following database settings for reindexing a system that must always remain in use: reindexer-throttle = 3 large-size-threshold = 1048576 You can also adjust the following group settings to help limit background I/O: background-io-limit = 100 This will limit the background I/O for that group to 100 MB/sec per host across all hosts in that group. This should only be configured if merges are causing problems—it is a way of throttling back the I/O used by the merging process.This is good starting point, and may be increased in increments of 50 if you find that your merges are progressing too slowly. Proceed with caution as too low of a background IO limit can have negative performance or even catastrophic consequences. General Recommendations In general, your indexing/reindexing and subsequent search experience will be better if you Think of *documents as rows* in a relational system, not like tables (Refer to https://developer.marklogic.com/learn/data-model/ for more details) Insure that a document-unique property that requires a range index should have a unique QName. Be aware of cases where a range index can match more than one thing per document (i.e. ingest date and edit date could both be represented by xs:date). This avoids a required index having many more entries, and using up RAM unnecessarily for entries that are never sorted or accessed from a lexicon. (Refer to https://docs.marklogic.com/guide/admin/range_index#id_93351 for more details on range index) Store dates and times as Schema datatypes Avoid huge fragment sizes (Refer to https://docs.marklogic.com/guide/admin/fragments for further details) Avoid fragmentation via fragment roots, unless absolutely necessary. Check this KB, https://help.marklogic.com/Knowledgebase/Article/View/529/0/search-and-fragmentation and https://www.marklogic.com/blog/fragment-ed-thoughts/ for more details on fragmentation. Avoid indiscriminately using properties fragments (which doubles the fragment count). Use if you need CPF, for example. Avoid leaving directory creation on if not needed (needed for webdav, but not for directory-query) Avoid leaving last-modified on if not needed Avoid using a schema that defines elements in no-namespace (which will leak to all other non-namespaced elements)
(14 vote(s)) Helpful Not helpful

Comments (0)