MarkLogic Data Hub Service

Fast data integration + improved data governance and security, with no infrastructure to buy or manage.

Learn More


Stay On Top Of Everything MarkLogic

Be the first to know! News, product information, and events delivered straight to your inbox.

Sign Me Up


Stay On Top Of Everything MarkLogic

Be the first to know! News, product information, and events delivered straight to your inbox.

Sign Me Up


Stay On Top Of Everything MarkLogic

Be the first to know! News, product information, and events delivered straight to your inbox.

Sign Me Up

Indexing Best Practices
01 November 2017 02:36 PM

Indexing Best Practices

Indexing in MarkLogic occurs when a document is added or updated. When adding a new index, the server runs an estimate of all the fragments that match and the proceeds to reload those URIs that match.

Indexing/reindexing can be a CPU and disk-IO intensive operation. Reindexing creates a lot of new fragments, with the original fragments being marked for deletion - these will then need to be merged. All of this activity can affect query performance, which leads to our first recommendation.

Reindexing Production

If you need to add or modify an index on a production cluster, consider scheduling the reindex during a time when your cluster is not busy. If your database is too large to reindex during a single period of low usage, consider running it over several periods. For example, if low usage period is during a weekend, the process could look like:

  • Change your index configuration on a Friday night
  • Let it run for most of the weekend
  • Set the reindexer-enable field to 'false' for the database being reindexed.   Be sure to disable reindexing with a sufficient amount of time before your cluster begins to see heavy usage, allowing the associated merging to complete.
  • If needed, reindexing can continue over the next weekend .... The reindexer process will pick up where it left off before it was disabled. 

Avoid Unused Range Indexes, Fields, and Path Indexes

In addition to taking up extra disk space, Range Indexes, Fields and Path Indexes require extra work when it's time to reindex. Field and Path indexes may require extra loading passes.

Avoid Using Namespaces to Implement Multi-Tenancy

It's a common use case to want to create some kind of partition (or multiple partitions) between documents in a particular database. In such a scenario it's far better to 1) constrain the partitioning information to a particular element in a document (then include a clause over that element in your searches), than it is to 2) attempt to manage partitions via unique element namespaces corresponding to each partition. For example, given two documents in two different partitions, you'll want them to look like this:

1a. <doc><partition>partition1</partition><name>Joe Smith</name></doc>

1b. <doc><partition>partition2</partition><name>John Smith</name></doc>

...vs. something like this:

2a. <doc xmlns:p="http://partition1"><p:name>Joe Smith</p:name></doc>

2b. <doc xmlns:p="http://partition2"><p:name>John Smith</p:name></doc>

Why is #1 better? In terms of searching the data once it's indexed, there's actually not much of a difference - one could easily create searches to accomodate both approaches. The issue is how the indexing works in practice. MarkLogic Server indexes all content on ingest. In scenario 2, every time a new partition is created, a new range element index needs to defined in the Admin UI, which means your index settings have changed, which means the server now needs to reindex all of your content - not just the documents corresponding to the newly introduced partition. In contrast, for scenario 1, all that would need to be done is to ingest the documents corresponding to the new partition, which would then be indexed just like all the other existing content. There would be a need, however, to change the searches in scenario 1, as they would not yet include a clause to accomodate the new partition (for example: cts:element-value-query(xs:QName("partition"), "partition2")) - but the overall impact of adding a partition is changing the searches in scenario 1, which is ultimately far, far less than reindexing your entire database as would be required in scenario 2. Note that in addition to a database-wide reindex, searches would also need to change in scenario 2, as well.

Keep an Eye on IO Throughput

Reindexing can lead to heavy merge activity and may lead to disk-IO bottlenecks if not managed carefully. If you have a system that is available 24-7 with no downtime window, then you may need to throttle the reindexer in order to keep the IO to a minimum. We suggest the following database settings for reindexing a system that must always remain in use:

  • reindexer-throttle = 3
  • large-size-threshold = 1048576

You can also adjust the following group settings to help limit background IO:

  • background-io-limit = 100

This will limit the background IO for that group to 100 MB. This is good starting point, and may be increased in increments of 50 if you find that your merges are progressing too slowly.  Proceed with caution as too low of a background IO limit can have negative performance or even catastrophic consequences

In general, your indexing/reindexing and subsequent search experience will be better if you

  • Think of documents as more like rows in a relational system, not like tables
  • Insure that a document-unique property that requires a range index should have a unique qname. Be aware of cases where a range index can match more than one thing per document (i.e. ingest date, edit date, both represented by xs:date). This avoids a required index having many more entries, and using up RAM for entries that are never sorted or accessed from a lexicon.
  • Store dates and times as Schema datatypes
  • Avoid huge fragment sizes (10s of Mb and up, depending on queries)
  • Avoid fragmentation via fragment roots, unless absolutely necessary
  • Avoid indiscriminately using properties fragments (which doubles the fragment count). Use if you need CPF, for example.
  • Avoid leaving directory creation on if not needed (needed for webdav, but not for directory-query)
  • Avoid leaving last-modified on if not needed
  • Avoid using a schema that defines elements in no-namespace (which will leak to all other non-namespaced elements)
(8 vote(s))
Not helpful

Comments (0)