Knowledgebase:
Search and Fragmentation
16 April 2018 05:36 PM

Summary

This article explores fragmentation policy decisions for a MarkLogic database, and how search results may be influenced by your fragmentation settings.

Discussion

Fragments versus Documents

Consider the below example.

1) Load 20 test documents in your database by running

let $doc := <test>{
for $i in 1 to 20 return <node>foo {$i}</node>
}</test>
for $i in 1 to 20
return xdmp:document-insert ('/'||$i||'.xml', $doc)

Each of the 20 documents will have a structure like so:

<test>
    <node>foo 1</node>
    <node>foo 2</node>
           .
           .
           .
    <node>foo 20</node>
</test>

2) Observe the database status: 20 documents and 20 fragments.

3) Create a fragment root on 'node' and allow the database to reindex.

4) Observe the database status: 20 documents and 420 fragments. There are now 400 extra fragments for the 'node' elements.

We will use the data with fragmentation in the examples below.


Fragments and cts:search counts

Searches in MarkLogic work against fragments (not documents). In fact, MarkLogic indexes, retrieves, and stores everything as fragments.

While the terms fragments and documents are often used interchangeably, all the search-related operations happen at fragment level. Without any fragmentation policy defined, one fragment is the same as one document. However, with a fragmentation policy defined (e.g., a fragment root), the picture changes. Every fragment acts as its own self-contained unit and is the unit of indexing. A term list doesn't truly reference documents; it references fragments. The filtering and retrieval process doesn't actually load documents; it loads fragments. This means a single document can be split internally into multiple fragments but they are accessed by a single URI for the document.

Since the indexes only work at the fragment level, operations that work at the level of indexing can only know about fragments.

Thus, xdmp:estimate returns the number of matching fragments:

xdmp:estimate (cts:search (/, 'foo')) (: returns 400 :)

while fn:count counts the actual number of items in the returned sequence:

fn:count (cts:search (/, 'foo')) (: returns 20 :)


Fragments and search:search counts

When using search:search, "... the total attribute is an estimate, based on the index resolution of the query, and it is not filtered for accuracy." This can be seen since


import module namespace search = "http://marklogic.com/appservices/search" at "/MarkLogic/appservices/search/search.xqy";
search:search("foo",
<options xmlns="http://marklogic.com/appservices/search">
<transform-results apply="empty-snippet"/>
</options>
)

returns

<search:response snippet-format="empty-snippet" total="400" start="1" page-length="10" xmlns:search="http://marklogic.com/appservices/search">
<search:result index="1" uri="/3.xml" path="fn:doc(&quot;/3.xml&quot;)" score="2048" confidence="0.09590387" fitness="1">
<search:snippet/>
</search:result>
<search:result index="2" uri="/5.xml" path="fn:doc(&quot;/5.xml&quot;)" score="2048" confidence="0.09590387" fitness="1">
<search:snippet/>
</search:result>
.
.
.
<search:result index="10" uri="/2.xml" path="fn:doc(&quot;/2.xml&quot;)" score="2048" confidence="0.09590387" fitness="1">
<search:snippet/>
</search:result>


Notice that the total attribute gives the estimate of the results, starting from the first result in the page, similar to the xdmp:estimate result above, and is based on unfiltered index (fragment-level) information. Thus the value of 400 is returned.

When using search:search:

  • Each result in the report provided by the Search API reflects a document -- not a fragment. That is, the units in the Search API are documents. For instance, the report above has 10 results/documents.
  • Search has to estimate the number of result documents based on the indexes.
  • Indexes are based on fragments and not documents.
  • If no filtering is required to produce an accurate result set and if each fragment is a separate document, the document estimate based on the indexes will be accurate.
  • If filtering is required or if documents aggregate multiple matching fragments, the estimate will be inaccurate. The only way to get an accurate document total in these cases would be to retrieve each document, which would not scale.

Fragmentation and relevance

Fragmentation also has an effect on relevance.  See Fragments.


Should I use fragmentation?

Fragmentation can be useful at times, but generally it should not be used unless you are sure you need it and understand all the tradeoffs. Alternatively, you can break your document into subdocuments instead. In general, the search API is designed to work better without fragmentation in play.

(2 vote(s))
Helpful
Not helpful

Comments (0)