Knowledgebase:
Stemming and element-value-query
28 November 2017 11:59 AM

Introduction

Stemming is handled differently between a word-query and value-query; a value-query only indexes using basic stemming.

Discussion

A word may have more than one stem. For example,

cts:stem ('placing')

returns

place
placing

To see how this works with a word-query we can use xdmp:plan. Running

xdmp:plan (cts:search (/, cts:word-query ('placing')))

on a database with basic stemming returns

<qry:final-plan>
<qry:and-query>
<qry:term-query weight="1">
<qry:key>17061320528361807541</qry:key>
<qry:annotation>word("placing")</qry:annotation>
</qry:term-query>
</qry:and-query>
</qry:final-plan>

Since basic stemming uses only the first/shortest stem, this is searching just for the stem 'place'.

Searching with

cts:search (/, cts:word-query ('placing'))

will match 'a place of my own' ('placing' and 'place' both stem to 'place') but not 'new placings' ('placings' stems to just 'placing').

However, on a database with advanced stemming the plan is

<qry:final-plan>
<qry:and-query>
<qry:or-two-queries>
<qry:term-query weight="1">
<qry:key>17061320528361807541</qry:key>
<qry:annotation>word("placing")</qry:annotation>
</qry:term-query>
<qry:term-query weight="1">
<qry:key>17769756368104569500</qry:key>
<qry:annotation>word("placing")</qry:annotation>
</qry:term-query>
</qry:or-two-queries>
</qry:and-query>
</qry:final-plan>

Here you can see that there are two term queries OR-ed together (note the two different key values). The result is that the same cts:word-query('placing') now also matches 'new placings' because it queries using both stems for 'placing' ('place' and 'placing') and so matches the stemmed version of 'placings' ('placing').

However, a search with

cts:element-value-query(xs:QName('title'), 'new placing')

returns

<qry:final-plan>
<qry:and-query>
<qry:term-query weight="1">
<qry:key>10377808623468699463</qry:key>
<qry:annotation>element(title,value("new","placing"))</qry:annotation>
</qry:term-query>
</qry:and-query>
</qry:final-plan>

whether the database has basic or advanced stemming, showing that multiple stems are not used.

The reason for this is that MarkLogic will only do basic stemming when indexing the keys for a value. Therefore there is a single key for the value.  If MarkLogic Server were designed to support multiple stems for values (which is does not), this would expand the indexes dramatically and slow down indexing, merging, and querying. Consider if each word had two stems, then there would be 2^N keys for N words. The size would grow exponentially for addtional stems. 

More information on value-queries is available at Understanding Search: value queries.

 

(3 vote(s))
Helpful
Not helpful

Comments (0)