Knowledgebase: MarkLogic Server
Case Sensitive Search with Stemming
04 June 2012 10:38 AM

Summary

Stemming in MarkLogic Server is a case-sensitive operation.

Stemmed, Case Insensitive

When you run a stemmed, case-insensitive search, MarkLogic will map all the word to lowercase and then calculate the stems.

In English, this work fairly well as words are generally lowercase. For other languages (such as German) this doesn't always work as well.

Stemmed, Case Sensitive

When a search is case-sensitive, the stems are different depending on the case of the word.

In English, case sensitive searches with stemming specified are not considered as stemmed searches because, in English, words with upper case letters stem to themselves. You would not expect proper names or acronyms to be stemmed to something else. For example, “Mr. Mark Cutting” should not match "marks cuts.”

For German and other languages where stems exist for mixed case words, case-sensitive with stemming is recommended.

Examples

These example queries demonstrate stemmed searches:

Documents
xquery version "1.0-ml";
xdmp:document-insert("1.xml", <a>This is test.</a>),
xdmp:document-insert("2.xml", <a>This is TESTING.</a>), 
xdmp:document-insert("3.xml", <a>This is TESTS.</a>), 
xdmp:document-insert("4.xml", <a>This is TEST.</a>);

Case insensitive with stemming
search:search("TESTS",     
    <options xmlns="http://marklogic.com/appservices/search">
      <term>
        <term-option>case-insensitive</term-option>
        <term-option>stemmed</term-option>
      </term>
    </options>)

Matches: test, TESTS, TESTING, & TEST.

Case sensitive with stemming

search:search("TESTS",     
    <options xmlns="http://marklogic.com/appservices/search">
      <term>
        <term-option>case-sensitive</term-option>
        <term-option>stemmed</term-option>
      </term>
    </options>)

Matches: TESTS


(4 vote(s))
Helpful
Not helpful

Comments (0)