Solutions

MarkLogic Data Hub Service

Fast data integration + improved data governance and security, with no infrastructure to buy or manage.

Learn More

Learn

Stay On Top Of Everything MarkLogic

Be the first to know! News, product information, and events delivered straight to your inbox.

Sign Me Up

Community

Stay On Top Of Everything MarkLogic

Be the first to know! News, product information, and events delivered straight to your inbox.

Sign Me Up

Company

Stay On Top Of Everything MarkLogic

Be the first to know! News, product information, and events delivered straight to your inbox.

Sign Me Up

 
Knowledgebase:
Relevance Scores and Stemmed Searches
12 March 2018 06:36 PM

Stemming:

MarkLogic Server supports stemming in English and other languages. If stemmed searches are enabled in the database configuration, MarkLogic Server automatically searches for words that come from the same stem of the word specified in the query, not just the exact string specified in the query. A stemmed search for a word finds the exact same terms as well as terms that derive from the same meaning and part of speech as the search term.

For e.g. in a stemmed search, a query for 'running' will match 'running', 'run' and 'ran' as they all stem to 'run'. The query is actually stemmed before being resolved, so queries for both 'running' and 'ran' are actually performed as queries for 'run', and they return similar results.

 

Relevance score for stemmed searches:

 

Search results in MarkLogic Server return in relevance order; that is, the result that is most relevant to the cts:query expression in the search is the first item in the search return sequence, and the least relevant is the last. (Documentation at http://docs.marklogic.com/guide/search-dev/relevance#chapter gives detailed information of how relevance score is computed).

However, when using stemmed searches, the original query term and its stemmed matches are both ranked equally. That is, higher relevance score is not given to the exact match of the word.

 

For example, consider the following 3 documents:

 

run.xml

<root>

  <id>001</id>

  <text>run out of time</text>

</root>

 

running.xml

<root>

  <id>002</id>

  <text>running out of time</text>

</root>

 

ran.xml

<root>

  <id>003</id>

  <text>ran out of time</text>

</root>

 

The below search query for "running" returns all 3 documents ranked equally.

 

let $query:= cts:word-query("running")

 

for $hit in cts:search(doc(), $query,"relevance-trace")

 

return element hit {

attribute score { cts:score($hit) },

xdmp:node-uri($hit)

}

 

==>

 

<hit score="2048">run.xml</hit>

<hit score="2048">running.xml</hit>

<hit score="2048">ran.xml</hit>

This behavior is desirable  in most search applications. However, to give higher score for the original query term, so that it comes up first in the search results, stemmed and unstemmed word-queries should be combined in an or-query.

let $query:=

cts:or-query(

(cts:word-query("running","stemmed"),

cts:word-query("running","unstemmed")))

 

 

for $hit in cts:search(doc(), $query)

return element hit {

attribute score { cts:score($hit) },

xdmp:node-uri($hit)

}

 

==>

 

<hit score="11264">running.xml</hit>

<hit score="1024">run.xml</hit>

<hit score="1024">ran.xml</hit>

Note that for the above cts:or-query, 'word searches' option should be enabled for the database, else  the query returns an XDMP-WORDSEARCH  error.

(2 vote(s))
Helpful
Not helpful

Comments (0)