When should I look into query or data model tuning?
26 January 2016 03:07 PM
MarkLogic Server can ingest and query all sorts of data such as XML, text, JSON, binary, generic, etc. - and it can do so with both great speed and scale. This article contains a simple rule of thumb to determine when you should do some query or data model tuning to take advantage of the speed and scale that MarkLogic Server can deliver.
A very common development practice is to build your application against a smaller subset of your data, then to deploy that application against your entire production dataset. That practice works when using MarkLogic Server, too. However, do keep in mind that MarkLogic Server was engineered to deliver sub-second response time across very large amounts of data. In general, if your queries against MarkLogic Server are taking multiple seconds to return results when running against your development data subset, then it's very, very likely those same queries will take tens of seconds or even fail to return at all due to timing out when run against your larger production dataset.
When building your appliction, if you're seeing queries that take significantly greater than one second to return, you should absolutely begin the effort to optimize either the relevant query, or your data model, or both - especially if runtime increases as the amount of data increases.
1) For query tuning, consider the following:
This code needs to step through a result set made up of every document in a given collection to evaluate whether or not each of those documents has got a rowkey value greater than a given value. The runtime of the query as written will consequently increase as both the number of documents and/or the number of evaluations increases.
Here, instead of iterating over a result set made up of every document in a given collection, and evaluating each document in that results set to see if they match a given criteria, the cts:search used here will return a result set composed of only the subset of documents that are both in a given collection that also match supplied query terms (in this case, rowkey > $start-rowkey). Note that you'll also need to define an element range index of type int on the rowkey element to take advantage of the resulting much faster index resolution instead of iteration/evaluation, otherwise this query will return the error XDMP-ELEMRIDXNOTFOUND.
In addition to avoiding overly large result set size via query terms, you'll also want to consider the kind of query you'll want to run and what that means in terms of your data model. It's actually possible to run queries both filtered and unfiltered (note the presence of "unfiltered" option in the query revision above). While it's possible to run your queries filtered (where the slower filtering pass will remove any false positives returned during the faster unfiltered index resolution phase of your search), for maximum performance you'll want to construct your data model in such a way that unfiltered queries will return accurate results without the need for a filtering pass. This leads us to:
2) Data model tuning - see our Best practices around data modeling and data loading Knowledge Base article, as well as the "XML and JSON Data Modeling Best Practices" on-demand MarkLogic University course, available here.
There's much, much more information in our Query Tuning and Performance Guide documentation. Additionally, to see how a given expression will be processed, you'll want to make use of xdmp:plan. To optimize query performance, you'll want to make use of xdmp:query-meters and xdmp:query-trace.